cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Are streaming tables suitable for Gold layer Star schema?

yit337
Contributor

Based on docs, we can't use identity columns and ANALYZE TABLE on streaming tables. So, should we avoid using streaming tables for Gold layer Star schema?
https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-streaming-table#limitations 

4 REPLIES 4

MoJaMa
Databricks Employee
Databricks Employee

The power of Streaming Tables is that you can more easily do Streaming + Apply Changes (Deletes/Upserts, SCD1/2) without having to reason with "how it is done". So this may make it more appropriate for Bronze/Silver and then you use Materialized Views as your Gold Layer because it's likely your consumption layer is a "join" of multiple dims/facts that you present in some cleansed/aggregated fashion for your Gold consumers.

There are definitely customers who use ST's in Gold, but I think Materialized Views are a better fit for a general consumption layer. You can then take it a step further with Metric Views (Business Semantics) when you want "BI-friendly" dimensions and measures you want to enable for your analyst personas.

Thanks for the reply. However, I want to have tables in gold. In order to have identity columns and 'analyze table' do I have to use Delta table, or can I use streaming table too ?

balajij8
Contributor

You can avoid streaming tables if the gold table is modeled in star schema

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @yit337,

You are on the right track noticing those limitations. The short answer is: for a Gold layer star schema, materialized views are generally the better fit, though streaming tables are not completely ruled out depending on the specific table role.

Here is how to think about it:

WHY MATERIALIZED VIEWS ARE PREFERRED FOR GOLD STAR SCHEMA

The Databricks best practices documentation explicitly recommends materialized views for the Gold layer, stating they should be used to pre-compute "aggregations, metrics, and summaries served to dashboards." There are several reasons this guidance applies strongly to star schema designs:

1. Correct join behavior: In a star schema, your fact tables join to dimension tables. When a dimension changes (a customer updates their address, a product gets reclassified), you need the Gold layer to reflect those changes. Materialized views automatically recompute joins when underlying dimensions change, keeping results correct. Streaming tables, by contrast, snapshot dimension values at stream start time and do not go back and fix already-processed rows when dimensions update. This can lead to stale or incorrect data in your Gold tables.

2. Aggregation support: Gold layer tables in a star schema are often aggregated fact tables or summary tables. Materialized views handle aggregations natively and refresh them incrementally when possible.

3. The limitations you noted: As you found, streaming tables do not support identity columns (which are commonly used for surrogate keys in star schemas) or ANALYZE TABLE (which helps the query optimizer produce better plans for complex star schema queries).

WHEN STREAMING TABLES CAN STILL PLAY A ROLE

Streaming tables are not entirely off-limits in a Gold layer, but their sweet spot is different:

- Append-only fact logging: If you have a fact table that is strictly append-only (event logs, transaction records) and you do not need to join against changing dimensions at this layer, a streaming table can work well for low-latency ingestion into Gold.

- Pre-aggregation staging: You could use a streaming table to land near-real-time data, then have a materialized view on top that performs the joins and aggregations for the final star schema output.

For surrogate keys without identity columns, you can use deterministic hashing as an alternative:

SELECT
md5(cast(customer_id AS STRING)) AS customer_sk,
customer_name,
customer_region
FROM silver_customers

RECOMMENDED PATTERN

A common and effective pattern in Lakeflow Spark Declarative Pipelines (SDP) for a star schema Gold layer:

- Bronze: Streaming tables for raw ingestion
- Silver: Streaming tables for cleaned/conformed data
- Gold dimensions: Materialized views (they recompute when source data changes, keeping dimensions current)
- Gold facts: Materialized views if you need joins to dimensions or aggregations. Streaming tables only if the fact is purely append-only with no dimension lookups at this layer.

This aligns with the official best practices guidance here:
https://docs.databricks.com/aws/en/ldp/best-practices.html

And here is the streaming tables limitations reference you already found:
https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-streaming-table.html

SUMMARY

For a Gold layer star schema, lean toward materialized views. They handle dimension changes correctly, support the aggregation patterns star schemas require, and avoid the identity column and ANALYZE TABLE limitations you identified. Reserve streaming tables for Bronze/Silver layers or for specific append-only fact tables where those constraints do not apply.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.