- 6219 Views
- 8 replies
- 9 kudos
Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/")
import pyspark.pandas as ps
sql=("""select
field1, field2
From table
Where date>='2021-01.01""")
df = ps.sql(sql)
df.spark.checkpoint()That...
- 6219 Views
- 8 replies
- 9 kudos
Latest Reply
checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:checkpointedDF = df.checkpoint()
7 More Replies
by
Vik1
• New Contributor II
- 9798 Views
- 3 replies
- 5 kudos
I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...
- 9798 Views
- 3 replies
- 5 kudos
Latest Reply
The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...
2 More Replies
- 5298 Views
- 4 replies
- 1 kudos
When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points. For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - i...
- 5298 Views
- 4 replies
- 1 kudos
Latest Reply
@Kaniz Fatma The problem is not about performance or plotly. It is about the pandas_on_spark dataframe arbitrarily subsampling the input data when plotting, without notifying the user about it.While subsampling is comprehensible and maybe even nece...
3 More Replies