cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Ingesting data from views

naman0012
New Contributor

Hi all , 

I have been looking to create gold tables from views , and also considering to have features of streaming and change data capture .

I know in DLT Workflows this is not possible , so I was wondering is there any other way to do the same please 
The data ingestion from views will be daily and I want to preserve history in my gold table , and add the latest data from views after transformations 

5 REPLIES 5

MoJaMa
Databricks Employee
Databricks Employee

If you are already doing Declarative Pipelines you can indeed encapsulate the transformation logic a view inside the pipeline. That view would be reading from a source (lets say) silver streaming table and populating a (lets say) gold MV.

So raw files (autoloader) --> bronze (append streaming table) --> silver (SCD streaming table) --> view (some additional transformations, joins) --> gold (MV).

This is already possible.

Maybe I'm misunderstanding the request. Feel free to clarify.

yoongkang
Databricks Partner

can try using change data feed in combination with Structured Streaming to incrementally process changes from Delta tables.

Hope it helps 

pradeep_singh
Contributor

I believe you are looking for two things here - 

  1. Incrementally refreshing gold tables(materialized views ) 
  2. History on those tables ( materialized views ) 

You can get the first feature with materialized views in gold layer if you use SDP (Serverless Pipelines - Powered by Enzyme ) . History is still problematic with MVs . This second requirement would needs a regular delta table . Thats when you will have to switch to using Delta CDF (change data feed ) with structured streaming . 



Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

Start from Bronze ingestion (with metadata tables), refine to Silver (with mdd), then aggregate to Gold MLVs for optimized Delta persistence. Schedule full/incremental modes via Fabric's UI, integrating data quality rules for governance. i think Fabric features is good. 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @naman0012,

There are a few different approaches depending on where these "views" live and what your exact architecture looks like. Let me walk through the options.


CLARIFYING THE SCENARIO

The key question is: are these database views on an external system (SQL Server, Oracle, Postgres, etc.), or are they views/tables already inside Databricks (Unity Catalog)?

The approaches differ significantly depending on the answer.


OPTION 1: SOURCE VIEWS ARE IN AN EXTERNAL DATABASE

If you are reading from views in an external database and want to land that data into gold Delta tables with history tracking, Lakeflow Spark Declarative Pipelines (SDP) streaming tables cannot directly stream from external views. Streaming tables require append-only sources like Auto Loader (cloud files), Kafka, or Delta tables with change data feed.

Instead, you can use a batch ingestion pattern:

1. Use a scheduled notebook or workflow task to read from the external view via JDBC/query federation and write into a Bronze Delta table.

2. Enable Change Data Feed on the Bronze table:

ALTER TABLE catalog.schema.bronze_table
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

3. Then use Structured Streaming with Change Data Feed to process incremental changes downstream:

from pyspark.sql import functions as F

changes_df = (
spark.readStream
.option("readChangeFeed", "true")
.table("catalog.schema.bronze_table")
)

# Apply transformations
gold_df = changes_df.filter(...).withColumn(...)

(gold_df.writeStream
.outputMode("append")
.option("checkpointLocation", "/path/to/checkpoint")
.toTable("catalog.schema.gold_table")
)

For history preservation specifically, you can use the _change_type, _commit_version, and _commit_timestamp metadata columns from the change data feed to build an audit/history table that tracks all changes over time.


OPTION 2: USE SDP WITH AUTO CDC FOR SCD TYPE 2

If you want automatic history tracking (slowly changing dimension Type 2), Lakeflow Spark Declarative Pipelines (SDP) supports this natively through the AUTO CDC API. You would:

1. Land data into a Bronze streaming table (via Auto Loader or another supported source).
2. Use AUTO CDC with SCD Type 2 to automatically maintain history in your gold table.

In Python (using the current pyspark.pipelines module):

from pyspark import pipelines as dp
from pyspark.sql.functions import col

dp.create_auto_cdc_flow(
target="gold_customers",
source="bronze_customers",
keys=["customer_id"],
sequence_by=col("updated_at"),
stored_as_scd_type=2
)

This automatically creates __START_AT and __END_AT columns to track the validity period of each record version. SCD Type 2 preserves full history of changes.

In SQL:

CREATE OR REFRESH STREAMING TABLE gold_customers;

AUTO CDC INTO gold_customers
FROM bronze_customers
KEYS (customer_id)
SEQUENCE BY updated_at
STORED AS SCD TYPE 2;


OPTION 3: MATERIALIZED VIEWS FOR SIMPLER AGGREGATION

If you do not need full row-level history but want a gold table that stays current with transformations applied, materialized views in SDP are a good fit. They incrementally refresh from upstream streaming tables or other Delta tables:

CREATE MATERIALIZED VIEW gold_summary AS
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM silver_orders
GROUP BY customer_id;

Materialized views support automatic incremental refresh when the source tables have row tracking and change data feed enabled. You can schedule refreshes or use TRIGGER ON UPDATE for production freshness.

Note: materialized views replace their contents on each refresh, so they do not preserve history by default. For history, pair them with a separate history/audit table or use the SCD Type 2 approach above.


RECOMMENDED ARCHITECTURE FOR YOUR USE CASE

Based on your requirements (daily ingestion from views, history preservation, transformations in gold), here is the recommended pattern:

1. BRONZE: Scheduled notebook/task reads from external views daily via JDBC and writes (appends or merges) into a Bronze Delta table with Change Data Feed enabled.

2. SILVER: Use SDP with a streaming table to clean and validate the data.

3. GOLD: Use AUTO CDC with SCD Type 2 to maintain a gold table with full history tracking. This gives you the change data capture semantics you are looking for.

This pattern ensures daily incremental processing, full history in your gold layer, and clean separation of concerns across the medallion architecture.


USEFUL DOCUMENTATION

Change Data Feed:
https://docs.databricks.com/aws/en/delta/delta-change-data-feed

Materialized Views:
https://docs.databricks.com/aws/en/views/materialized

AUTO CDC in SDP:
https://docs.databricks.com/aws/en/delta-live-tables/cdc

Structured Streaming with Delta:
https://docs.databricks.com/aws/en/structured-streaming/delta-lake

Hope this helps. Let me know if you can share more about where the source views live and I can refine the recommendation further.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.