<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Looking for Advice: Robust Backup Strategy for Databricks System Tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137674#M50792</link>
    <description>&lt;P&gt;Great, thank you for your robust explanation and suggestions. Just wondering—can external tables be supported by LDP? For example, can we use &lt;STRONG&gt;CREATE OR REFRESH STREAMING TABLE&lt;/STRONG&gt; with an external location or table like &lt;STRONG&gt;"..."&lt;/STRONG&gt;?&lt;/P&gt;</description>
    <pubDate>Wed, 05 Nov 2025 05:00:59 GMT</pubDate>
    <dc:creator>Danish11052000</dc:creator>
    <dc:date>2025-11-05T05:00:59Z</dc:date>
    <item>
      <title>Looking for Advice: Robust Backup Strategy for Databricks System Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137620#M50777</link>
      <description>&lt;P&gt;HI,&lt;/P&gt;&lt;P&gt;I’m planning to &lt;STRONG&gt;build a backup system for all Databricks system tables (audit, usage, price, history, etc.)&lt;/STRONG&gt; to preserve data beyond retention limits. Currently, I’m using S&lt;STRONG&gt;park Streaming with readStream + writeStrea&lt;/STRONG&gt;m and &lt;STRONG&gt;checkpointing&lt;/STRONG&gt; in LakeFlow jobs for durability.&lt;/P&gt;&lt;P&gt;I recently noticed the &lt;STRONG&gt;CREATE OR REFRESH STREAMING TABLE&lt;/STRONG&gt; feature in declarative Lakeflow pipelines and I’m wondering:&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_right:"&gt;👉&lt;/span&gt; Would it better switch to a &lt;STRONG&gt;declarative pipeline&lt;/STRONG&gt; for automatic checkpointing and simpler setup?&lt;BR /&gt;&lt;span class="lia-unicode-emoji" title=":backhand_index_pointing_right:"&gt;👉&lt;/span&gt; Or stick with &lt;STRONG&gt;manual read/write streams&lt;/STRONG&gt; for more control?&lt;/P&gt;&lt;P&gt;Would love to hear thoughts on which approach is more &lt;STRONG&gt;robust and maintainable&lt;/STRONG&gt;.&lt;/P&gt;</description>
      <pubDate>Tue, 04 Nov 2025 17:56:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137620#M50777</guid>
      <dc:creator>Danish11052000</dc:creator>
      <dc:date>2025-11-04T17:56:03Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Advice: Robust Backup Strategy for Databricks System Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137626#M50781</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/188867"&gt;@Danish11052000&lt;/a&gt;&amp;nbsp;, here’s a pragmatic way to choose, based on the nature of Databricks system tables and the guarantees you want.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 class="paragraph"&gt;Bottom line&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;For ongoing replication to preserve data beyond free retention, a &lt;STRONG&gt;Lakeflow Declarative Pipeline with Streaming Tables&lt;/STRONG&gt; is the most robust and maintainable default—use the Python interface so you can set the required &lt;CODE&gt;skipChangeCommits&lt;/CODE&gt; option when reading system tables. This gives you automatic checkpointing, scheduling, lifecycle management, lineage, and guardrails with far less code to maintain.&lt;/LI&gt;
&lt;LI class="paragraph"&gt;Keep &lt;STRONG&gt;manual Structured Streaming (readStream/writeStream)&lt;/STRONG&gt; for edge cases where you need knobs LDP doesn’t offer (custom triggers not supported for system tables via Delta Sharing, bespoke failure handling, complex multi-sink writes, custom exactly-once semantics, or non-append processing).&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Why declarative pipelines are strong for system tables backup&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;&lt;STRONG&gt;Automatic checkpointing/state and managed maintenance.&lt;/STRONG&gt; Streaming tables are stateful and optimized for incremental append; Lakeflow Declarative Pipelines manage updates, maintenance, and optimization tasks on your behalf, reducing operational toil.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Simple scheduling.&lt;/STRONG&gt; You can declare refresh schedules (&lt;CODE&gt;SCHEDULE EVERY ...&lt;/CODE&gt; or &lt;CODE&gt;TRIGGER ON UPDATE&lt;/CODE&gt;) directly on streaming tables; creation/refresh are run by serverless Lakeflow pipelines rather than warehouse compute.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Security/governance aligned.&lt;/STRONG&gt; Unity Catalog governs access; pipelines publish streaming tables into UC with lineage visible in Catalog Explorer, which is helpful for auditability of your backups.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Clear limitations that fit the “replicate append-only” use case.&lt;/STRONG&gt; Streaming tables are designed for append-only sources and don’t support schema-evolving DML on the target; owners refresh; ALTER is restricted—perfect for a write-once backup pattern.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;System tables specifics that drive the design&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Many system tables support streaming (for example, &lt;CODE&gt;system.access.audit&lt;/CODE&gt;, &lt;CODE&gt;system.billing.usage&lt;/CODE&gt;, job/cluster timelines), but some don’t (for example, &lt;CODE&gt;system.billing.list_prices&lt;/CODE&gt; and &lt;CODE&gt;system.query.history&lt;/CODE&gt;). Plan per-table: streaming ingestion for supported ones; periodic batch materialization for the rest.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;When streaming system tables, you must set &lt;STRONG&gt;&lt;CODE&gt;skipChangeCommits = true&lt;/CODE&gt;&lt;/STRONG&gt; because Databricks may publish deletions/updates in system tables; this avoids stream disruption. Also, with Delta Sharing sources, &lt;STRONG&gt;&lt;CODE&gt;Trigger.AvailableNow&lt;/CODE&gt; is not supported&lt;/STRONG&gt; (it becomes Trigger.Once), so rely on scheduled refreshes instead.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;The &lt;STRONG&gt;free retention period&lt;/STRONG&gt; varies by table (commonly 365 days, but some are 180/90/30 days or indefinite). Your backup target will preserve data beyond those windows as you continuously append, which is exactly why replication is valuable.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;System tables can &lt;STRONG&gt;add new columns any time&lt;/STRONG&gt;; avoid brittle fixed schemas—use &lt;CODE&gt;SELECT * BY NAME&lt;/CODE&gt; semantics and enable schema evolution on the target. This favors declarative ingestion where you don’t hand-code schema mappings.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H3 class="paragraph"&gt;Recommended architecture (declarative first, with Python)&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;1. &lt;STRONG&gt;Bronze backup streaming tables (append-only) for streaming-supported system tables&lt;/STRONG&gt;&lt;BR /&gt;Define each source in Python so you can set &lt;CODE&gt;skipChangeCommits&lt;/CODE&gt;, then create a streaming table with clustering/partitioning on date/workspace for efficient queries.&lt;/DIV&gt;
&lt;PRE&gt;&lt;CODE class="markdown-code-python"&gt;   from pyspark import pipelines as dp

   @dp.table(name="backup_billing_usage")
   def read_usage():
       return (spark.readStream
               .option("skipChangeCommits", "true")
               .table("system.billing.usage"))

   # Create the target streaming table and add schedule
   dp.create_streaming_table(
       name="backup_billing_usage",
       table_properties={
           "delta.appendOnly": "true",    # defensive: no deletes on target
           "pipelines.channel": "current" # production channel
       }
   )
   &lt;/CODE&gt;&lt;/PRE&gt;
&lt;DIV class="paragraph"&gt;If you prefer SQL for the table definition and scheduling, still read via Python to apply &lt;CODE&gt;skipChangeCommits&lt;/CODE&gt;, then use &lt;CODE&gt;CREATE OR REFRESH STREAMING TABLE ... SCHEDULE EVERY 1 hour&lt;/CODE&gt; on the target definition to declare the refresh cadence.&lt;/DIV&gt;
&lt;OL start="2"&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Materialize non-streaming system tables periodically&lt;/STRONG&gt;&lt;BR /&gt;For example, &lt;CODE&gt;system.billing.list_prices&lt;/CODE&gt; (indefinite) and &lt;CODE&gt;system.query.history&lt;/CODE&gt; (non-streaming) should be copied via:
&lt;UL&gt;
&lt;LI&gt;A &lt;STRONG&gt;materialized view&lt;/STRONG&gt; with &lt;CODE&gt;SCHEDULE&lt;/CODE&gt; for incremental refresh, or&lt;/LI&gt;
&lt;LI&gt;A &lt;STRONG&gt;CREATE FLOW ... INSERT INTO ONCE&lt;/STRONG&gt; backfill plus a periodic job to append deltas.&lt;BR /&gt;The query-history doc even provides a sample job pattern for periodic materialization.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Initial backfill flows&lt;/STRONG&gt;&lt;BR /&gt;Use &lt;CODE&gt;CREATE FLOW ... INSERT INTO ONCE&lt;/CODE&gt; to seed historical ranges if you need to segment or parallelize the catch-up. The ONCE flow re-runs on a full pipeline refresh, which is useful if you ever need to rebuild the target cleanly.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Schema evolution and expectations&lt;/STRONG&gt;&lt;BR /&gt;Since columns can be added anytime in system tables, prefer &lt;CODE&gt;BY NAME&lt;/CODE&gt; semantics and add table-level expectations only for critical quality constraints (drop row or fail update). Manage them via &lt;CODE&gt;CREATE OR REFRESH STREAMING TABLE&lt;/CODE&gt; properties; alter via &lt;CODE&gt;ALTER STREAMING TABLE&lt;/CODE&gt; as needed.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Operational guardrails&lt;/STRONG&gt;&lt;BR /&gt;
&lt;UL&gt;
&lt;LI&gt;Avoid full refreshes on streaming tables backed by sources with limited retention (like message buses or system tables) because it will truncate and reprocess only the still-available window and you can’t recover older records from the source.&lt;/LI&gt;
&lt;LI&gt;Use the &lt;STRONG&gt;current&lt;/STRONG&gt; runtime channel for production; preview channel only for testing new features.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;HR /&gt;
&lt;H3 class="paragraph"&gt;When to stick with manual readStream/writeStream&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;Use manual Structured Streaming if you need any of the following:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;&lt;STRONG&gt;Custom triggers&lt;/STRONG&gt; and backpressure strategies not available in LDP for Delta Sharing sources (for example, nuanced use of Trigger.AvailableNow beyond its conversion to Once).&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Per-microbatch logic&lt;/STRONG&gt; (&lt;CODE&gt;foreachBatch&lt;/CODE&gt;), complex dedup keyed by composite IDs, multi-target writes from one stream, or &lt;STRONG&gt;fine-grained checkpoint location control&lt;/STRONG&gt; and custom failure handling/retries that exceed pipeline defaults.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;&lt;STRONG&gt;Specialized CDC semantics&lt;/STRONG&gt; beyond AUTO CDC flows, or patterns that materially conflict with streaming-table limitations (for example, frequent schema-changing DML on the target).&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;H3 class="paragraph"&gt;Decision guide Choose declarative streaming tables if:&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Your goal is ongoing append-only replication for durability beyond retention, with simple schedules and minimal ops overhead.&lt;/LI&gt;
&lt;LI class="paragraph"&gt;You can read in Python to apply &lt;CODE&gt;skipChangeCommits&lt;/CODE&gt; and accept streaming-table limitations (no ALTER TABLE, owner refresh, no DML that changes schema).&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;Choose manual streaming if:&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;You need the advanced controls above and are comfortable owning the code/ops (monitoring, restarts, checkpoint paths, backpressure, and custom error handling).&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;In practice, many teams use a hybrid: LDP streaming tables for the bulk of append replication and small pockets of Structured Streaming where special control is required.&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H3 class="paragraph"&gt;Example SQL schedule for a streaming-table refresh&lt;/H3&gt;
&lt;DIV class="paragraph"&gt;If you’ve defined the table (via Python function as above), you can attach a schedule with SQL: &lt;CODE&gt;sql
CREATE OR REFRESH STREAMING TABLE backup_billing_usage
SCHEDULE EVERY 1 hour
AS SELECT * FROM STREAM(backup_billing_usage);
&lt;/CODE&gt;&lt;/DIV&gt;
&lt;HR /&gt;
&lt;H3 class="paragraph"&gt;Final recommendations for your backup system&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Default to &lt;STRONG&gt;Lakeflow Declarative Pipelines + Streaming Tables&lt;/STRONG&gt;, and read system tables via Python to set &lt;CODE&gt;skipChangeCommits&lt;/CODE&gt;. This yields simple, durable, and low-touch replication with automatic checkpointing. Set refresh schedules per table and use ONCE flows for initial backfills where needed.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;For tables without streaming support (like &lt;CODE&gt;system.query.history&lt;/CODE&gt;), set up &lt;STRONG&gt;materialized views or periodic jobs&lt;/STRONG&gt; to copy new rows; don’t leave them out just because they’re non-streaming.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Partition/cluster backup targets by &lt;CODE&gt;event_date&lt;/CODE&gt; and &lt;CODE&gt;workspace_id&lt;/CODE&gt;, mark them append-only, and avoid full refreshes on short-retention sources.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;Expect schema evolution—prefer &lt;CODE&gt;SELECT * BY NAME&lt;/CODE&gt; and avoid strict schemas for system tables to keep pipelines resilient to new columns.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;I trust these suggestions help you form an opinion that best suits your situation.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Cheers, Louis.&lt;/DIV&gt;</description>
      <pubDate>Tue, 04 Nov 2025 18:09:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137626#M50781</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-04T18:09:01Z</dc:date>
    </item>
    <item>
      <title>Re: Looking for Advice: Robust Backup Strategy for Databricks System Tables</title>
      <link>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137674#M50792</link>
      <description>&lt;P&gt;Great, thank you for your robust explanation and suggestions. Just wondering—can external tables be supported by LDP? For example, can we use &lt;STRONG&gt;CREATE OR REFRESH STREAMING TABLE&lt;/STRONG&gt; with an external location or table like &lt;STRONG&gt;"..."&lt;/STRONG&gt;?&lt;/P&gt;</description>
      <pubDate>Wed, 05 Nov 2025 05:00:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/looking-for-advice-robust-backup-strategy-for-databricks-system/m-p/137674#M50792</guid>
      <dc:creator>Danish11052000</dc:creator>
      <dc:date>2025-11-05T05:00:59Z</dc:date>
    </item>
  </channel>
</rss>

