cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Vik1
by New Contributor II
  • 5997 Views
  • 4 replies
  • 5 kudos

Some very simple functions in Pandas on Spark are very slow

I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...

  • 5997 Views
  • 4 replies
  • 5 kudos
Latest Reply
PeterDowdy
New Contributor II
  • 5 kudos

The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...

  • 5 kudos
3 More Replies
alejandrofm
by Valued Contributor
  • 2864 Views
  • 7 replies
  • 8 kudos

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/")   import pyspark.pandas as ps   sql=("""select field1, field2 From table Where date>='2021-01.01""")   df = ps.sql(sql) df.spark.checkpoint()That...

  • 2864 Views
  • 7 replies
  • 8 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 8 kudos

If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:df = ps.sql(sql).persist() df.spark.checkpoint()

  • 8 kudos
6 More Replies
DavideCagnoni
by Contributor
  • 2488 Views
  • 8 replies
  • 3 kudos

Resolved! How to force pandas_on_spark plots to use all dataframe data?

When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points. For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - i...

  • 2488 Views
  • 8 replies
  • 3 kudos
Latest Reply
Kaniz
Community Manager
  • 3 kudos

Hi @Davide Cagnoni​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team. Use the Ideas Portal to:Enter feature requests.View, comment, and vote up other users’ requests.Monitor the p...

  • 3 kudos
7 More Replies
Labels