<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/110083#M43485</link>
    <description>&lt;P&gt;Thank you so much for the suggestion.&lt;/P&gt;</description>
    <pubDate>Thu, 13 Feb 2025 07:14:28 GMT</pubDate>
    <dc:creator>CamdenJacobs</dc:creator>
    <dc:date>2025-02-13T07:14:28Z</dc:date>
    <item>
      <title>Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing</title>
      <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107287#M42757</link>
      <description>&lt;P&gt;I have created a notebook which helps in creating three different gold layer objects from one single silver table. All these tables are processed incremently. I want to develop the failure handling scenario in case if the pipeline fails after loading few of the records in the first table or if one of the gold table loaded successfully and the second failed.&lt;BR /&gt;&lt;BR /&gt;In this case while re-running the pipeline from scratch I don't want to insert already inserted records again in one of&amp;nbsp; the gold table. How to handle this type of scenario ?&lt;/P&gt;</description>
      <pubDate>Mon, 27 Jan 2025 19:17:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107287#M42757</guid>
      <dc:creator>rrajan</dc:creator>
      <dc:date>2025-01-27T19:17:23Z</dc:date>
    </item>
    <item>
      <title>Re: Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing</title>
      <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107290#M42759</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;To handle the scenario where your pipeline fails after loading some records into the first gold table or if one gold table loads successfully while the second fails, you can implement a failure handling mechanism that ensures already inserted records are not reprocessed when the pipeline is re-run. Here are some steps you can follow:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Use Delta Lake for ACID Transactions&lt;/STRONG&gt;: Delta Lake provides ACID transactions, which can help ensure that your data is consistent and reliable. If a failure occurs, you can use Delta Lake's transaction log to identify which records have already been processed.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Implement Checkpoints&lt;/STRONG&gt;: Use checkpoints to save the state of your data processing at various stages. This way, if a failure occurs, you can restart the pipeline from the last successful checkpoint rather than from scratch.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Idempotent Writes&lt;/STRONG&gt;: Ensure that your write operations are idempotent. This means that re-running the same operation multiple times will not result in duplicate records. You can achieve this by using upsert operations (merge) instead of insert operations.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Delta Live Tables (DLT)&lt;/STRONG&gt;: Consider using Delta Live Tables, which provide built-in capabilities for handling incremental data processing and failure recovery. DLT can automatically manage the state of your data pipeline and ensure that only new or changed data is processed.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;SPAN&gt;&lt;STRONG&gt;Repair and Rerun&lt;/STRONG&gt;: Utilize the "Repair and Rerun" feature in Databricks jobs. This feature allows you to rerun only the tasks that were impacted by a failure, without reprocessing the entire pipeline. This can save time and resources. You can find more details about this feature in the Databricks blog post titled "Save Time and Money on Data and ML Workflows With 'Repair and Rerun'".&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Mon, 27 Jan 2025 20:56:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107290#M42759</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-01-27T20:56:00Z</dc:date>
    </item>
    <item>
      <title>Re: Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing</title>
      <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107369#M42785</link>
      <description>&lt;P&gt;&lt;FONT size="3"&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/138469"&gt;@rrajan&lt;/a&gt;&amp;nbsp;,&lt;/FONT&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;FONT size="3"&gt;The simplest solution is to check the max timestamp in each gold table when processing incrementally to get source data. Here's how you can handle this (this would be your source in MERGE statement with every rum:&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT * FROM silver AS s
WHERE s.last_update_timestamp 
   &amp;gt; (SELECT MAX(last_update_timestamp) FROM gold_table_1)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;Similarly for the other gold tables.&lt;/FONT&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;This way:&lt;/FONT&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;&lt;FONT size="3"&gt;If one gold table fails, other tables' timestamps remain unchanged&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;On retry, you'll only process records newer than what's already in each gold table&lt;/FONT&gt;&lt;/LI&gt;&lt;LI&gt;&lt;FONT size="3"&gt;No need for additional tracking tables or complex logic&lt;/FONT&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;For more robust pipelines, you could consider:&lt;/FONT&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;FONT size="3"&gt;1. A separate metadata table to avoid scanning full gold tables for timestamps&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;CREATE TABLE metadata_control (
    table_name STRING,
    last_processed_timestamp TIMESTAMP
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;FONT size="3"&gt;In this scenario,&amp;nbsp;when loading each gold table, you filter your silver source data based on the last processed timestamp for that specific gold table:&lt;/FONT&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT * FROM silver 
WHERE last_update_timestamp &amp;gt; (
    SELECT last_processed_timestamp 
    FROM metadata_control 
    WHERE table_name = 'gold_table_1'
)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After successful processing of each gold table, you update its timestamp in the metadata table. This ensures that if a later table fails, you won't reprocess already loaded data on retry.&lt;/P&gt;&lt;P&gt;2.&amp;nbsp;You could also consider using Structured Streaming with separate checkpoints for each gold table as an alternative approach. This provides automatic failure handling and exactly-once guarantees. See the Databricks documentation on streaming writes for details: &lt;A href="https://docs.databricks.com/en/structured-streaming/delta-lake.html" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/structured-streaming/delta-lake.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 28 Jan 2025 08:46:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107369#M42785</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2025-01-28T08:46:37Z</dc:date>
    </item>
    <item>
      <title>Re: Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing</title>
      <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107373#M42787</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/88823"&gt;@Walter_C&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;Thanks for your suggestion.&lt;BR /&gt;&lt;BR /&gt;Can you help me with all the possible failure scenario need to be handled while doing incremental load ? Gold table is delta only. The data coming into Gold table is from DLT silver table. We are not suppose to use the DLT in the gold layer because of complex transformations. We have create a single pyspark notebook to read the silver table and then extracted the incremental data from it based on last run timestamp. Now performing the transformation on the dataframe . Once this is done then loading the all three final objects. We have a single task so repair run is not possible here.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Jan 2025 09:06:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/107373#M42787</guid>
      <dc:creator>rrajan</dc:creator>
      <dc:date>2025-01-28T09:06:07Z</dc:date>
    </item>
    <item>
      <title>Re: Urgent Help Needed - Databricks Notebook Failure Handle for Incremental Processing</title>
      <link>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/110083#M43485</link>
      <description>&lt;P&gt;Thank you so much for the suggestion.&lt;/P&gt;</description>
      <pubDate>Thu, 13 Feb 2025 07:14:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/urgent-help-needed-databricks-notebook-failure-handle-for/m-p/110083#M43485</guid>
      <dc:creator>CamdenJacobs</dc:creator>
      <dc:date>2025-02-13T07:14:28Z</dc:date>
    </item>
  </channel>
</rss>

