Databricks Community

Danish11052000 · 2 weeks ago

HI,

I’m planning to build a backup system for all Databricks system tables (audit, usage, price, history, etc.) to preserve data beyond retention limits. Currently, I’m using Spark Streaming with readStream + writeStream and checkpointing in LakeFlow jobs for durability.

I recently noticed the CREATE OR REFRESH STREAMING TABLE feature in declarative Lakeflow pipelines and I’m wondering:
👉 Would it better switch to a declarative pipeline for automatic checkpointing and simpler setup?
👉 Or stick with manual read/write streams for more control?

Would love to hear thoughts on which approach is more robust and maintainable.

Louis_Frolio · 2 weeks ago

Greetings @Danish11052000 , here’s a pragmatic way to choose, based on the nature of Databricks system tables and the guarantees you want.

Bottom line

For ongoing replication to preserve data beyond free retention, a Lakeflow Declarative Pipeline with Streaming Tables is the most robust and maintainable default—use the Python interface so you can set the required skipChangeCommits option when reading system tables. This gives you automatic checkpointing, scheduling, lifecycle management, lineage, and guardrails with far less code to maintain.
Keep manual Structured Streaming (readStream/writeStream) for edge cases where you need knobs LDP doesn’t offer (custom triggers not supported for system tables via Delta Sharing, bespoke failure handling, complex multi-sink writes, custom exactly-once semantics, or non-append processing).

Why declarative pipelines are strong for system tables backup

Automatic checkpointing/state and managed maintenance. Streaming tables are stateful and optimized for incremental append; Lakeflow Declarative Pipelines manage updates, maintenance, and optimization tasks on your behalf, reducing operational toil.
Simple scheduling. You can declare refresh schedules (SCHEDULE EVERY ... or TRIGGER ON UPDATE) directly on streaming tables; creation/refresh are run by serverless Lakeflow pipelines rather than warehouse compute.
Security/governance aligned. Unity Catalog governs access; pipelines publish streaming tables into UC with lineage visible in Catalog Explorer, which is helpful for auditability of your backups.
Clear limitations that fit the “replicate append-only” use case. Streaming tables are designed for append-only sources and don’t support schema-evolving DML on the target; owners refresh; ALTER is restricted—perfect for a write-once backup pattern.

System tables specifics that drive the design

Many system tables support streaming (for example, system.access.audit, system.billing.usage, job/cluster timelines), but some don’t (for example, system.billing.list_prices and system.query.history). Plan per-table: streaming ingestion for supported ones; periodic batch materialization for the rest.
When streaming system tables, you must set skipChangeCommits = true because Databricks may publish deletions/updates in system tables; this avoids stream disruption. Also, with Delta Sharing sources, Trigger.AvailableNow is not supported (it becomes Trigger.Once), so rely on scheduled refreshes instead.
The free retention period varies by table (commonly 365 days, but some are 180/90/30 days or indefinite). Your backup target will preserve data beyond those windows as you continuously append, which is exactly why replication is valuable.
System tables can add new columns any time; avoid brittle fixed schemas—use SELECT * BY NAME semantics and enable schema evolution on the target. This favors declarative ingestion where you don’t hand-code schema mappings.

Recommended architecture (declarative first, with Python)

1. Bronze backup streaming tables (append-only) for streaming-supported system tables
Define each source in Python so you can set skipChangeCommits, then create a streaming table with clustering/partitioning on date/workspace for efficient queries.

   from pyspark import pipelines as dp

   @dp.table(name="backup_billing_usage")
   def read_usage():
       return (spark.readStream
               .option("skipChangeCommits", "true")
               .table("system.billing.usage"))

   # Create the target streaming table and add schedule
   dp.create_streaming_table(
       name="backup_billing_usage",
       table_properties={
           "delta.appendOnly": "true",    # defensive: no deletes on target
           "pipelines.channel": "current" # production channel
       }
   )

If you prefer SQL for the table definition and scheduling, still read via Python to apply skipChangeCommits, then use CREATE OR REFRESH STREAMING TABLE ... SCHEDULE EVERY 1 hour on the target definition to declare the refresh cadence.

Materialize non-streaming system tables periodically
For example, system.billing.list_prices (indefinite) and system.query.history (non-streaming) should be copied via:
- A materialized view with SCHEDULE for incremental refresh, or
- A CREATE FLOW ... INSERT INTO ONCE backfill plus a periodic job to append deltas.
  The query-history doc even provides a sample job pattern for periodic materialization.
Initial backfill flows
Use CREATE FLOW ... INSERT INTO ONCE to seed historical ranges if you need to segment or parallelize the catch-up. The ONCE flow re-runs on a full pipeline refresh, which is useful if you ever need to rebuild the target cleanly.
Schema evolution and expectations
Since columns can be added anytime in system tables, prefer BY NAME semantics and add table-level expectations only for critical quality constraints (drop row or fail update). Manage them via CREATE OR REFRESH STREAMING TABLE properties; alter via ALTER STREAMING TABLE as needed.
Operational guardrails
- Avoid full refreshes on streaming tables backed by sources with limited retention (like message buses or system tables) because it will truncate and reprocess only the still-available window and you can’t recover older records from the source.
- Use the current runtime channel for production; preview channel only for testing new features.

When to stick with manual readStream/writeStream

Use manual Structured Streaming if you need any of the following:

Custom triggers and backpressure strategies not available in LDP for Delta Sharing sources (for example, nuanced use of Trigger.AvailableNow beyond its conversion to Once).
Per-microbatch logic (foreachBatch), complex dedup keyed by composite IDs, multi-target writes from one stream, or fine-grained checkpoint location control and custom failure handling/retries that exceed pipeline defaults.
Specialized CDC semantics beyond AUTO CDC flows, or patterns that materially conflict with streaming-table limitations (for example, frequent schema-changing DML on the target).

Decision guide Choose declarative streaming tables if:

Your goal is ongoing append-only replication for durability beyond retention, with simple schedules and minimal ops overhead.
You can read in Python to apply skipChangeCommits and accept streaming-table limitations (no ALTER TABLE, owner refresh, no DML that changes schema).

Choose manual streaming if:

You need the advanced controls above and are comfortable owning the code/ops (monitoring, restarts, checkpoint paths, backpressure, and custom error handling).

In practice, many teams use a hybrid: LDP streaming tables for the bulk of append replication and small pockets of Structured Streaming where special control is required.

Example SQL schedule for a streaming-table refresh

If you’ve defined the table (via Python function as above), you can attach a schedule with SQL:

sql
CREATE OR REFRESH STREAMING TABLE backup_billing_usage
SCHEDULE EVERY 1 hour
AS SELECT * FROM STREAM(backup_billing_usage);

Final recommendations for your backup system

Default to Lakeflow Declarative Pipelines + Streaming Tables, and read system tables via Python to set skipChangeCommits. This yields simple, durable, and low-touch replication with automatic checkpointing. Set refresh schedules per table and use ONCE flows for initial backfills where needed.
For tables without streaming support (like system.query.history), set up materialized views or periodic jobs to copy new rows; don’t leave them out just because they’re non-streaming.
Partition/cluster backup targets by event_date and workspace_id, mark them append-only, and avoid full refreshes on short-retention sources.
Expect schema evolution—prefer SELECT * BY NAME and avoid strict schemas for system tables to keep pipelines resilient to new columns.

I trust these suggestions help you form an opinion that best suits your situation.

Cheers, Louis.

Danish11052000 · a week ago

Great, thank you for your robust explanation and suggestions. Just wondering—can external tables be supported by LDP? For example, can we use CREATE OR REFRESH STREAMING TABLE with an external location or table like "..."?