I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 seconds.
I am a bit surprised that shape and head - simplest of the dataframe functions - take this long. I would assume that value_counts should take longer because if var1 values are split over different nodes then data shuffle is needed. shape is a simple count whereas head is a simple fetch of 5 rows from any node.
Am I doing something wrong? Is there a documentation on best practices and guidance on how to use Spark Pandas API