Re: count or toPandas taking too long

jimcast · ‎11-16-2023

Hi,

I am fetching data from unity catalog from notebooks using spark.sql(). The query takes just a few seconds - I am actually trying to retrieving 2 rows - but some operations like count() or toPandas() take forever. I wonder why does it take so long and if there is a way to speed up those operations.

Compute: personal compute m5d.2xlarge (14.1 (includes Apache Spark 3.5.0, Scala 2.12))

Thanks!

Hkesharwani · ‎05-15-2024

Hi, it is quite normal that converting data frame from spark to pandas takes time.
Although there is a way we can optimize it.
Enable Arrow Optimization: Starting from Spark 3.0.0, We can enable arrow optimization, this will speed up the process by enabling the use of Apache Arrow for faster data transfer between Spark and Python.

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Harshit Kesharwani
Data engineer at Rsystema

anardinelli · ‎05-27-2024

Hey @jimcast how are you?

You can check the internals and have a good hint of what's happening using the SparkUI. Filter and select the jobs that are taking the longest and check what is being requested on the SQL/Data Frame tab, as well as their plans.

If your data is public, please also share more details (such as logs, prints and dumps) so we can better help you with.

Best,

Alessandro