by
Vik1
• New Contributor II
- 5997 Views
- 4 replies
- 5 kudos
I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...
- 5997 Views
- 4 replies
- 5 kudos
Latest Reply
The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...
3 More Replies
- 2864 Views
- 7 replies
- 8 kudos
Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/")
import pyspark.pandas as ps
sql=("""select
field1, field2
From table
Where date>='2021-01.01""")
df = ps.sql(sql)
df.spark.checkpoint()That...
- 2864 Views
- 7 replies
- 8 kudos
Latest Reply
If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:df = ps.sql(sql).persist()
df.spark.checkpoint()
6 More Replies
- 2488 Views
- 8 replies
- 3 kudos
When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points. For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - i...
- 2488 Views
- 8 replies
- 3 kudos
Latest Reply
Hi @Davide Cagnoni​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team. Use the Ideas Portal to:Enter feature requests.View, comment, and vote up other users’ requests.Monitor the p...
7 More Replies