<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Dataframe rows missing after write_to_delta and read_from_delta in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24112#M16735</link>
    <description>&lt;P&gt;Hi @mime liu​&amp;nbsp;, Do you have any other error message other than the reported one? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 03 Nov 2022 06:11:12 GMT</pubDate>
    <dc:creator>Debayan</dc:creator>
    <dc:date>2022-11-03T06:11:12Z</dc:date>
    <item>
      <title>Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24110#M16733</link>
      <description>&lt;P&gt;Hi, i am trying to load mongo into s3 using pyspark 3.1.1 by reading them into a parquet. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My code snippets are like:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df = spark \&lt;/P&gt;&lt;P&gt;    .read \&lt;/P&gt;&lt;P&gt;    .format("mongo") \&lt;/P&gt;&lt;P&gt;    .options(**read_options) \&lt;/P&gt;&lt;P&gt;    .load(schema=schema)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df = df.coalesce(64)&lt;/P&gt;&lt;P&gt;write_df_to_delta(spark, df, s3_path)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;read_count = df.count()&lt;/P&gt;&lt;P&gt;inserted_df = read_delta_to_df(spark, s3_path)&lt;/P&gt;&lt;P&gt;inserted_count = inserted_df.count()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;all sparksession, mongo connection and s3 path configured well. What i found is that read_count and inserted_df count do not match, there is a gap of around 300-1200 rows. But my write to delta did not give me any error. I wonder why is this the case? what's causing it? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;what i can see form rancher: 'read_count': 1373432, 'inserted_count': 1372492&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;def read_delta_to_df&lt;/B&gt;(&lt;/P&gt;&lt;P&gt;     spark: SparkSession,&lt;/P&gt;&lt;P&gt;    s3_path: str&lt;/P&gt;&lt;P&gt;     ) -&amp;gt; DataFrame:&lt;/P&gt;&lt;P&gt;     &lt;A href="https://log.info" alt="https://log.info" target="_blank"&gt;log.info&lt;/A&gt;("Reading delta table from path {} to df".format(s3_path))&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;     df = spark \&lt;/P&gt;&lt;P&gt;         .read \&lt;/P&gt;&lt;P&gt;         .format("delta") \&lt;/P&gt;&lt;P&gt;         .load(s3_path)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;      return df&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;def write_df_to_delta&lt;/B&gt;(&lt;/P&gt;&lt;P&gt; spark: SparkSession,&lt;/P&gt;&lt;P&gt; df: DataFrame,&lt;/P&gt;&lt;P&gt; s3_path: str,&lt;/P&gt;&lt;P&gt; mode: Optional[str] = "overwrite",&lt;/P&gt;&lt;P&gt; partition_by: Optional[Union[str, List[str]]]= None,&lt;/P&gt;&lt;P&gt; retention: Optional[int] = 0&lt;/P&gt;&lt;P&gt; ) -&amp;gt; None:&lt;/P&gt;&lt;P&gt; &lt;A href="https://log.info" alt="https://log.info" target="_blank"&gt;log.info&lt;/A&gt;("Writing df to delta table, {}".format(s3_path))&lt;/P&gt;&lt;P&gt; df.printSchema()&lt;/P&gt;&lt;P&gt; try:&lt;/P&gt;&lt;P&gt;     df \&lt;/P&gt;&lt;P&gt;        .write \&lt;/P&gt;&lt;P&gt;        .format("delta") \&lt;/P&gt;&lt;P&gt;        .mode(mode) \&lt;/P&gt;&lt;P&gt;        .option("overwriteSchema", "true") \&lt;/P&gt;&lt;P&gt;        .save(&lt;/P&gt;&lt;P&gt;        path=s3_path,&lt;/P&gt;&lt;P&gt;        partitionBy=partition_by)&lt;/P&gt;&lt;P&gt;        except Exception as e:&lt;/P&gt;&lt;P&gt;         log.error(f"error occured with error msg: {e}")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 01:46:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24110#M16733</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2022-11-03T01:46:44Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24111#M16734</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;In general, avoiding rm on Delta tables is a good idea  Delta's transaction log can prevent eventual consistency issues in most cases; however, when you delete and recreate a table in a short time, different versions of the transaction log can flicker in and out of existence.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table, you can:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df. write.format("delta").mode("overwrite").save("/delta/events")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 06:01:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24111#M16734</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-11-03T06:01:21Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24113#M16736</link>
      <description>&lt;P&gt;The code is correct. The only problem I can imagine is that on s3_path, something is left (like some lost partition). I think better it would be to register delta to metastore and use .write.table("table_name") instead of using the path.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 09:07:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24113#M16736</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-11-03T09:07:30Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24114#M16737</link>
      <description>&lt;P&gt;Hi Debayan no no error reported thruout&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 22:42:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24114#M16737</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2022-11-03T22:42:33Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24115#M16738</link>
      <description>&lt;P&gt;hi @May Olszewski​&amp;nbsp;thanks for replying. the mode i used was "overwrite" initially already, i forgot to put it in the above demo code sorry as it's predefined. any other sugestions? i also did vacume that directory before writing the new delta table into it&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 22:44:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24115#M16738</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2022-11-03T22:44:29Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24116#M16739</link>
      <description>&lt;P&gt;hi @Hubert Dudek​&amp;nbsp;thanks for the reply, yes maybe worth trying, i am also considering removing format("delta") to see if the issue persists, to diagnose whether this is a delta-related issue&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 22:45:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24116#M16739</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2022-11-03T22:45:32Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24117#M16740</link>
      <description>&lt;P&gt;still havent found an answer to this, just got back from holiday. will keep digging in if i found any cause will update here. &lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 09:20:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24117#M16740</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2022-11-29T09:20:33Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24118#M16741</link>
      <description>&lt;P&gt;So i think i have solved the mystery here&lt;span class="lia-unicode-emoji" title=":grinning_face:"&gt;😀&lt;/span&gt; it was to do with the retention config. By setting the retentionEnabled to True and rention hours being 0, we somewhat loses a few rows in the first file as they were mistaken as files from  last session and just got vacuumed. Further read please see here:  &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/kb/delta/data-missing-vacuum-parallel-write" target="test_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/kb/delta/data-missing-vacuum-parallel-write&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Jan 2023 05:45:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24118#M16741</guid>
      <dc:creator>mimezzz</dc:creator>
      <dc:date>2023-01-27T05:45:26Z</dc:date>
    </item>
    <item>
      <title>Re: Dataframe rows missing after write_to_delta and read_from_delta</title>
      <link>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24112#M16735</link>
      <description>&lt;P&gt;Hi @mime liu​&amp;nbsp;, Do you have any other error message other than the reported one? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 06:11:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dataframe-rows-missing-after-write-to-delta-and-read-from-delta/m-p/24112#M16735</guid>
      <dc:creator>Debayan</dc:creator>
      <dc:date>2022-11-03T06:11:12Z</dc:date>
    </item>
  </channel>
</rss>

