<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Pandas.spark.checkpoint() doesn't broke lineage in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24060#M16683</link>
    <description>&lt;P&gt;Hi, I'm doing some something simple on Databricks notebook:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.sparkContext.setCheckpointDir("/tmp/")
&amp;nbsp;
import pyspark.pandas as ps
&amp;nbsp;
sql=("""select 
field1, field2
From table
Where date&amp;gt;='2021-01.01""")
&amp;nbsp;
df = ps.sql(sql)
df.spark.checkpoint()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;That runs great, saves the rdd on /mp/ then I want to save the df with&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.to_csv('/FileStore/tables/test.csv', index=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;or&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df1.spark.coalesce(1).to_csv('/FileStore/tables/test.csv', index=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And it recalculates the query again (it first did it on the checkpoint and then again to save the file).&lt;/P&gt;&lt;P&gt;What i'm doing wrong? currently, to solve this I'm saving the first dataframe without checkpoint, opening again and saving with coalesce.&lt;/P&gt;&lt;P&gt;If I use the coalesce(1) directly it doesn't parallelize.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;EDIT:&lt;/P&gt;&lt;P&gt;Tried&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.spark.cache()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But still reprocesses when I try to save to CSV, I'm looking to avoid reprocessing and avoid saving twice. Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;the question is, why it recalculates df1 after the checkpoint?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Thu, 31 Mar 2022 14:39:01 GMT</pubDate>
    <dc:creator>alejandrofm</dc:creator>
    <dc:date>2022-03-31T14:39:01Z</dc:date>
    <item>
      <title>Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24060#M16683</link>
      <description>&lt;P&gt;Hi, I'm doing some something simple on Databricks notebook:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.sparkContext.setCheckpointDir("/tmp/")
&amp;nbsp;
import pyspark.pandas as ps
&amp;nbsp;
sql=("""select 
field1, field2
From table
Where date&amp;gt;='2021-01.01""")
&amp;nbsp;
df = ps.sql(sql)
df.spark.checkpoint()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;That runs great, saves the rdd on /mp/ then I want to save the df with&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.to_csv('/FileStore/tables/test.csv', index=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;or&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df1.spark.coalesce(1).to_csv('/FileStore/tables/test.csv', index=False)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And it recalculates the query again (it first did it on the checkpoint and then again to save the file).&lt;/P&gt;&lt;P&gt;What i'm doing wrong? currently, to solve this I'm saving the first dataframe without checkpoint, opening again and saving with coalesce.&lt;/P&gt;&lt;P&gt;If I use the coalesce(1) directly it doesn't parallelize.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;EDIT:&lt;/P&gt;&lt;P&gt;Tried&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.spark.cache()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;But still reprocesses when I try to save to CSV, I'm looking to avoid reprocessing and avoid saving twice. Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;the question is, why it recalculates df1 after the checkpoint?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 14:39:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24060#M16683</guid>
      <dc:creator>alejandrofm</dc:creator>
      <dc:date>2022-03-31T14:39:01Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24061#M16684</link>
      <description>&lt;P&gt;Please use localCheckpoint(True) so it will be stored on executors and trigger immediately.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 15:38:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24061#M16684</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-03-31T15:38:16Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24062#M16685</link>
      <description>&lt;P&gt;@Hubert Dudek​&amp;nbsp;, No luck with that, how do you use it on a ps dataframe?&lt;/P&gt;&lt;P&gt;Why do you think it doesn't work saving to DBFS?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 15:44:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24062#M16685</guid>
      <dc:creator>alejandrofm</dc:creator>
      <dc:date>2022-03-31T15:44:02Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24063#M16686</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;path should be directory in to_csv, not file as one file = 1 partition&lt;/LI&gt;&lt;LI&gt;try checkpoint(&lt;I&gt;eager=True&lt;/I&gt;)&lt;/LI&gt;&lt;LI&gt;use df.spark.explain() before and after checkpointing&lt;/LI&gt;&lt;LI&gt;checkpointing saving files to disk requires disk and computation and removes RDD from memory. So then, when you read it from disk, it requires recomputation. I think it doesn't make sense. I used checkpoint only once with some UDF function that made REST API calls and needed to have that executed in that place of code. Ddatabricks/spark using lazy evaluation and many optimizations of your code. Sometimes you need to do a checkpoint, so it will not do it optimized way.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2022 10:53:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24063#M16686</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-04-04T10:53:16Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24064#M16687</link>
      <description>&lt;P&gt;Hi, i'm repartitioning to 1 because it's easier and faster later to move 1 file instead of 10k files.&lt;/P&gt;&lt;P&gt;What I'm looking for is the possibility to use this or similar:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df.spark.checkpoint()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;and later use df.head() without recompute or to_csv without recompute, just the time it takes to merge al the calculated partitions.&lt;/P&gt;&lt;P&gt;Thought eager was default true, will check on that, but as what i'm looking it created the rdd file on disk but isn't using it, it recomputes the query.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 04 Apr 2022 15:33:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24064#M16687</guid>
      <dc:creator>alejandrofm</dc:creator>
      <dc:date>2022-04-04T15:33:25Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24065#M16688</link>
      <description>&lt;P&gt;Hi, back here, any idea of what approach I should take if I want to do something like:&lt;/P&gt;&lt;P&gt;df.head()&lt;/P&gt;&lt;P&gt;--&lt;/P&gt;&lt;P&gt;df.info&lt;/P&gt;&lt;P&gt;--&lt;/P&gt;&lt;P&gt;df.to_csv&lt;/P&gt;&lt;P&gt;and make the computation only once and not three times&lt;/P&gt;&lt;P&gt;Thanks!!!&lt;/P&gt;</description>
      <pubDate>Thu, 21 Apr 2022 14:25:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24065#M16688</guid>
      <dc:creator>alejandrofm</dc:creator>
      <dc:date>2022-04-21T14:25:38Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24066#M16689</link>
      <description>&lt;P&gt;Sorry for the bump, still don't find the proper way to do this.&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 12:20:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24066#M16689</guid>
      <dc:creator>alejandrofm</dc:creator>
      <dc:date>2022-05-03T12:20:29Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24067#M16690</link>
      <description>&lt;P&gt;If you need checkpointing, please try the below code. Thanks to persist, you will avoid reprocessing:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = ps.sql(sql).persist()
df.spark.checkpoint()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 03 May 2022 12:31:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/24067#M16690</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-05-03T12:31:34Z</dc:date>
    </item>
    <item>
      <title>Re: Pandas.spark.checkpoint() doesn't broke lineage</title>
      <link>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/99637#M40053</link>
      <description>&lt;P&gt;checkpoint() returns a checkpointed DataFrame, so you need to assign it to a new variable:&lt;/P&gt;&lt;P&gt;checkpointedDF = df.checkpoint()&lt;/P&gt;</description>
      <pubDate>Thu, 21 Nov 2024 14:34:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pandas-spark-checkpoint-doesn-t-broke-lineage/m-p/99637#M40053</guid>
      <dc:creator>annafina</dc:creator>
      <dc:date>2024-11-21T14:34:04Z</dc:date>
    </item>
  </channel>
</rss>

