06-22-2022 05:57 AM
I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 seconds.
I am a bit surprised that shape and head - simplest of the dataframe functions - take this long. I would assume that value_counts should take longer because if var1 values are split over different nodes then data shuffle is needed. shape is a simple count whereas head is a simple fetch of 5 rows from any node.
Am I doing something wrong? Is there a documentation on best practices and guidance on how to use Spark Pandas API
06-23-2022 12:48 PM
Be sure to import as "import pyspark.pandas as ps"
Please compare times with similar operations on the usual spark first. There can be multiple problems related to the dataset.
06-27-2022 09:10 AM
Hi @Wiki , We haven’t heard from you on the last response from @Hubert Dudek , and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please do share that with the community as it can be helpful to others.
07-01-2022 02:58 PM
Corroborating Vik's experience. Head() and shape() are extremely slow, along with info().
Hubert provides some suggestions but I don't think any explain why such basic functions aren't performing when other functions can run in a half a second.
01-12-2023 04:36 PM
The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `df` with a million rows, then `df.pandas_api().head()` will take a long time, but `df.pandas_api(index_col='A').head()` will complete quickly.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.