Topics with Label: Spark Pandas Api

Forum Posts

Sorted by:

by alejandrofm • Valued Contributor

03-31-2022 7:39:01 AM

6219 Views
8 replies
9 kudos

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Hi, I'm doing some something simple on Databricks notebook:spark.sparkContext.setCheckpointDir("/tmp/") import pyspark.pandas as ps sql=("""select field1, field2 From table Where date>='2021-01.01""") df = ps.sql(sql) df.spark.checkpoint()That...

Data Engineering

6219 Views
8 replies
9 kudos

03-31-2022 7:39:01 AM

View Replies

Latest Reply

annafina
New Contributor II

11-21-2024 6:34:04 AM

9 kudos

checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:checkpointedDF = df.checkpoint()

9 kudos

11-21-2024 6:34:04 AM

7 More Replies

by Vik1 • New Contributor II

06-22-2022 5:57:03 AM

9798 Views
3 replies
5 kudos

Some very simple functions in Pandas on Spark are very slow

I have a pandas on spark dataframe with 8 million rows and 20 columns. It took 3.48 minutes to run df.shape and it takes. It also takes a long time to run df.head took 4.55 minutes . By contrast df.var1.value_counts().reset_index() took only 0.18 sec...

Data Engineering

9798 Views
3 replies
5 kudos

06-22-2022 5:57:03 AM

View Replies

Latest Reply

PeterDowdy
New Contributor II

01-12-2023 4:36:35 PM

5 kudos

The reason why this is slow is because pandas needs an index column to perform `shape` or `head`. If you don't provide one, pyspark pandas enumerates the entire dataframe to create a default one. For example, given columns A, B, and C in dataframe `d...

5 kudos

01-12-2023 4:36:35 PM

2 More Replies

by DavideCagnoni • Contributor

02-11-2022 1:09:31 AM

5298 Views
4 replies
1 kudos

How to force pandas_on_spark plots to use all dataframe data?

When I load a table as a `pandas_on_spark` dataframe, and try to e.g. scatterplot two columns, what I obtain is a subset of the desired points. For example, if I try to plot two columns from a table with 1000000 rows, I only see some of the data - i...

Data Engineering

5298 Views
4 replies
1 kudos

02-11-2022 1:09:31 AM

View Replies

Latest Reply

DavideCagnoni
Contributor

03-03-2022 12:07:53 AM

1 kudos

@Kaniz Fatma The problem is not about performance or plotly. It is about the pandas_on_spark dataframe arbitrarily subsampling the input data when plotting, without notifying the user about it.While subsampling is comprehensible and maybe even nece...

1 kudos

03-03-2022 12:07:53 AM

3 More Replies

Databricks Community

Resolved! Pandas.spark.checkpoint() doesn't broke lineage

Some very simple functions in Pandas on Spark are very slow

How to force pandas_on_spark plots to use all dataframe data?