count or toPandas taking too long
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-16-2023 08:18 AM
Hi,
I am fetching data from unity catalog from notebooks using spark.sql(). The query takes just a few seconds - I am actually trying to retrieving 2 rows - but some operations like count() or toPandas() take forever. I wonder why does it take so long and if there is a way to speed up those operations.
Compute: personal compute m5d.2xlarge (14.1 (includes Apache Spark 3.5.0, Scala 2.12))
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-15-2024 01:43 AM - edited 05-15-2024 01:44 AM
Hi, it is quite normal that converting data frame from spark to pandas takes time.
Although there is a way we can optimize it.
Enable Arrow Optimization: Starting from Spark 3.0.0, We can enable arrow optimization, this will speed up the process by enabling the use of Apache Arrow for faster data transfer between Spark and Python.
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
Data engineer at Rsystema
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2024 10:52 AM
Hey @jimcast how are you?
You can check the internals and have a good hint of what's happening using the SparkUI. Filter and select the jobs that are taking the longest and check what is being requested on the SQL/Data Frame tab, as well as their plans.
If your data is public, please also share more details (such as logs, prints and dumps) so we can better help you with.
Best,
Alessandro