topic Pandas.spark.checkpoint() doesn't broke lineage in Data Engineering

Pandas.spark.checkpoint() doesn't broke lineage

alejandrofm — Thu, 31 Mar 2022 14:39:01 GMT

Hi, I'm doing some something simple on Databricks notebook:

spark.sparkContext.setCheckpointDir("/tmp/")
 
import pyspark.pandas as ps
 
sql=("""select 
field1, field2
From table
Where date>='2021-01.01""")
 
df = ps.sql(sql)
df.spark.checkpoint()

That runs great, saves the rdd on /mp/ then I want to save the df with

df.to_csv('/FileStore/tables/test.csv', index=False)

df1.spark.coalesce(1).to_csv('/FileStore/tables/test.csv', index=False)

And it recalculates the query again (it first did it on the checkpoint and then again to save the file).

What i'm doing wrong? currently, to solve this I'm saving the first dataframe without checkpoint, opening again and saving with coalesce.

If I use the coalesce(1) directly it doesn't parallelize.

EDIT:

Tried

df.spark.cache()

But still reprocesses when I try to save to CSV, I'm looking to avoid reprocessing and avoid saving twice. Thanks!

the question is, why it recalculates df1 after the checkpoint?

Thanks!

Re: Pandas.spark.checkpoint() doesn't broke lineage

Hubert-Dudek — Thu, 31 Mar 2022 15:38:16 GMT

Please use localCheckpoint(True) so it will be stored on executors and trigger immediately.

Re: Pandas.spark.checkpoint() doesn't broke lineage

alejandrofm — Thu, 31 Mar 2022 15:44:02 GMT

@Hubert Dudek , No luck with that, how do you use it on a ps dataframe?

Why do you think it doesn't work saving to DBFS?

Thanks!

Re: Pandas.spark.checkpoint() doesn't broke lineage

Hubert-Dudek — Mon, 04 Apr 2022 10:53:16 GMT

path should be directory in to_csv, not file as one file = 1 partition
try checkpoint(eager=True)
use df.spark.explain() before and after checkpointing
checkpointing saving files to disk requires disk and computation and removes RDD from memory. So then, when you read it from disk, it requires recomputation. I think it doesn't make sense. I used checkpoint only once with some UDF function that made REST API calls and needed to have that executed in that place of code. Ddatabricks/spark using lazy evaluation and many optimizations of your code. Sometimes you need to do a checkpoint, so it will not do it optimized way.

Re: Pandas.spark.checkpoint() doesn't broke lineage

alejandrofm — Mon, 04 Apr 2022 15:33:25 GMT

Hi, i'm repartitioning to 1 because it's easier and faster later to move 1 file instead of 10k files.

What I'm looking for is the possibility to use this or similar:

df.spark.checkpoint()

and later use df.head() without recompute or to_csv without recompute, just the time it takes to merge al the calculated partitions.

Thought eager was default true, will check on that, but as what i'm looking it created the rdd file on disk but isn't using it, it recomputes the query.

Thanks!

Re: Pandas.spark.checkpoint() doesn't broke lineage

alejandrofm — Thu, 21 Apr 2022 14:25:38 GMT

Hi, back here, any idea of what approach I should take if I want to do something like:

df.head()

df.info

df.to_csv

and make the computation only once and not three times

Thanks!!!

Re: Pandas.spark.checkpoint() doesn't broke lineage

alejandrofm — Tue, 03 May 2022 12:20:29 GMT

Sorry for the bump, still don't find the proper way to do this.

Thanks!

Re: Pandas.spark.checkpoint() doesn't broke lineage

Hubert-Dudek — Tue, 03 May 2022 12:31:34 GMT

If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:

df = ps.sql(sql).persist()
df.spark.checkpoint()

Re: Pandas.spark.checkpoint() doesn't broke lineage

annafina — Thu, 21 Nov 2024 14:34:04 GMT

checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:

checkpointedDF = df.checkpoint()