topic Re: count or toPandas taking too long in Get Started Discussions

count or toPandas taking too long

jimcast — Thu, 16 Nov 2023 16:18:41 GMT

Hi,

I am fetching data from unity catalog from notebooks using spark.sql(). The query takes just a few seconds - I am actually trying to retrieving 2 rows - but some operations like count() or toPandas() take forever. I wonder why does it take so long and if there is a way to speed up those operations.

Compute: personal compute m5d.2xlarge (14.1 (includes Apache Spark 3.5.0, Scala 2.12))

Thanks!

Re: count or toPandas taking too long

Hkesharwani — Wed, 15 May 2024 08:44:45 GMT

Hi, it is quite normal that converting data frame from spark to pandas takes time.
Although there is a way we can optimize it.
Enable Arrow Optimization: Starting from Spark 3.0.0, We can enable arrow optimization, this will speed up the process by enabling the use of Apache Arrow for faster data transfer between Spark and Python.

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Re: count or toPandas taking too long

anardinelli — Mon, 27 May 2024 17:52:19 GMT

Hey @jimcast how are you?

You can check the internals and have a good hint of what's happening using the SparkUI. Filter and select the jobs that are taking the longest and check what is being requested on the SQL/Data Frame tab, as well as their plans.

If your data is public, please also share more details (such as logs, prints and dumps) so we can better help you with.

Best,

Alessandro