<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Trigger bad records in databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118138#M45610</link>
    <description>&lt;P&gt;It helps :). Thank you.&lt;/P&gt;&lt;P&gt;I have two questions to clarify and possibly optimize.&lt;/P&gt;&lt;P&gt;1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df.write.format("noop").mode("overwrite").save()&lt;/LI-CODE&gt;&lt;P&gt;and another one by writing to the table. Is that correct?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2) Does writing to table trigger a full evaluation of a data frame?&lt;/P&gt;&lt;P&gt;I use the following command for writing:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable("table_a")&lt;/LI-CODE&gt;&lt;P&gt;I could rewrite the control flow to first write data frame to a table and then use the written bad records.&lt;/P&gt;</description>
    <pubDate>Wed, 07 May 2025 11:39:07 GMT</pubDate>
    <dc:creator>bjn</dc:creator>
    <dc:date>2025-05-07T11:39:07Z</dc:date>
    <item>
      <title>Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118083#M45601</link>
      <description>&lt;P&gt;I use bad records while reading a csv as follows:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df = spark.read.format("csv")
            .schema(schema)
            .option("badRecordsPath", bad_records_path)&lt;/LI-CODE&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Since bad records are not written immediately,&lt;STRONG&gt; I want to know how can trigger the write of them efficiently.&amp;nbsp;&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Currently, I use df.collect() to trigger the bad records write which cause massive overhead and even out of memory problems, which is not acceptable.&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 07 May 2025 08:37:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118083#M45601</guid>
      <dc:creator>bjn</dc:creator>
      <dc:date>2025-05-07T08:37:32Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118121#M45605</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/163612"&gt;@bjn&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;If I underestand you correctly&lt;BR /&gt;&lt;BR /&gt;To efficiently trigger the writing of bad records captured via &lt;SPAN class=""&gt;.option("badRecordsPath", ...)&lt;/SPAN&gt; &lt;SPAN class=""&gt;without causing memory overhead or driver-side issues&lt;/SPAN&gt;, the best option is:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df.write.format("noop").mode("overwrite").save()&lt;/LI-CODE&gt;&lt;P class=""&gt;This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. It’s more efficient than using &lt;SPAN class=""&gt;.collect()&lt;/SPAN&gt; or even &lt;SPAN class=""&gt;.count()&lt;/SPAN&gt;, since it avoids aggregations and doesn’t load data into the driver.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;Hope this helps, &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 10:28:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118121#M45605</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-05-07T10:28:50Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118138#M45610</link>
      <description>&lt;P&gt;It helps :). Thank you.&lt;/P&gt;&lt;P&gt;I have two questions to clarify and possibly optimize.&lt;/P&gt;&lt;P&gt;1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df.write.format("noop").mode("overwrite").save()&lt;/LI-CODE&gt;&lt;P&gt;and another one by writing to the table. Is that correct?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2) Does writing to table trigger a full evaluation of a data frame?&lt;/P&gt;&lt;P&gt;I use the following command for writing:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable("table_a")&lt;/LI-CODE&gt;&lt;P&gt;I could rewrite the control flow to first write data frame to a table and then use the written bad records.&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 11:39:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118138#M45610</guid>
      <dc:creator>bjn</dc:creator>
      <dc:date>2025-05-07T11:39:07Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118145#M45612</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/163612"&gt;@bjn&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;1) &lt;/STRONG&gt;&lt;/H3&gt;&lt;P class=""&gt;Yes, if you run both &lt;SPAN class=""&gt;df.write.format("noop")...&lt;/SPAN&gt; &lt;SPAN class=""&gt;and&lt;/SPAN&gt; &lt;SPAN class=""&gt;df.write.format("delta").saveAsTable(...)&lt;/SPAN&gt;, you’re triggering &lt;STRONG&gt;&lt;SPAN class=""&gt;two separate actions&lt;/SPAN&gt;&lt;/STRONG&gt;, and Spark will &lt;SPAN class=""&gt;evaluate the DataFrame twice&lt;/SPAN&gt;. That includes parsing the CSV and, importantly, processing bad records each time.&lt;/P&gt;&lt;P class=""&gt;So you’re right to be cautious.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;2) &lt;/STRONG&gt;&lt;/H3&gt;&lt;P class=""&gt;Yes, writing to a table &lt;SPAN class=""&gt;is a full action&lt;/SPAN&gt;. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So &lt;SPAN class=""&gt;this alone is sufficient to force the write to badRecordsPath&lt;/SPAN&gt;&amp;nbsp;there’s no need to call &lt;SPAN class=""&gt;.write("noop")&lt;/SPAN&gt; beforehand.&lt;BR /&gt;&lt;BR /&gt;Best Regards,&lt;BR /&gt;&lt;BR /&gt;Isi&lt;/P&gt;</description>
      <pubDate>Wed, 07 May 2025 12:06:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/118145#M45612</guid>
      <dc:creator>Isi</dc:creator>
      <dc:date>2025-05-07T12:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/119758#M45965</link>
      <description>&lt;LI-CODE lang="markup"&gt;    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)&lt;/LI-CODE&gt;&lt;P&gt;doesn't trigger a bad record write. How is that possible?&lt;/P&gt;</description>
      <pubDate>Tue, 20 May 2025 14:14:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/119758#M45965</guid>
      <dc:creator>bjn</dc:creator>
      <dc:date>2025-05-20T14:14:50Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger bad records in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/119773#M45967</link>
      <description>&lt;P&gt;I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 May 2025 14:25:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-bad-records-in-databricks/m-p/119773#M45967</guid>
      <dc:creator>bjn</dc:creator>
      <dc:date>2025-05-20T14:25:10Z</dc:date>
    </item>
  </channel>
</rss>

