cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

Vik1
by New Contributor II
  • 7955 Views
  • 4 replies
  • 5 kudos

Some very simple functions in Pandas on Spark are very slow

I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...

  • 7955 Views
  • 4 replies
  • 5 kudos
Latest Reply
PeterDowdy
New Contributor II
  • 5 kudos

The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...

  • 5 kudos
3 More Replies
alejandrofm
by Valued Contributor
  • 4549 Views
  • 7 replies
  • 9 kudos

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/")   import pyspark.pandas as ps   sql=("""select field1, field2 From table Where date>='2021-01.01""")   df = ps.sql(sql) df.spark.checkpoint()That...

  • 4549 Views
  • 7 replies
  • 9 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 9 kudos

If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:df = ps.sql(sql).persist() df.spark.checkpoint()

  • 9 kudos
6 More Replies
DavideCagnoni
by Contributor
  • 4258 Views
  • 8 replies
  • 3 kudos

Resolved! How to force pandas_on_spark plots to use all dataframe data?

When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points. For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - i...

  • 4258 Views
  • 8 replies
  • 3 kudos
Latest Reply
Kaniz_Fatma
Community Manager
  • 3 kudos

Hi @Davide Cagnoni​ , The Ideas Portal lets you influence the Databricks product roadmap by providing feedback directly to the product team. Use the Ideas Portal to:Enter feature requests.View, comment, and vote up other users’ requests.Monitor the p...

  • 3 kudos
7 More Replies
Labels