Databricks Community

FAHADURREHMAN · 4 weeks ago

Hi All, I have a DLT Pipeline setup which reading Parquets from S3 Bucket and create a materialized view. Created view is quite big and contains Billion of records and contain around few TB of data. Predictive Optimization is already enabled. automatic liquid clustering is enabled at catalogue level. However queries execution takes a bit time for example simple queries with a single JOIN etc. Is there anything I am missing out? And is read_stream and creating DELTA tables is better option then Materialized VIEWs? And should I partition this table based on Date?

spark.read.format("parquet").option("ignoreCorruptFiles", "true")

aleksandra_ch · 4 weeks ago

Hi @FAHADURREHMAN ,

In order to improve performance of the Materialized View:

First, land parquet files into a Streaming Table. Use Auto Loader for incremental ingestion:

CREATE OR REFRESH STREAMING TABLE raw AS
SELECT * FROM STREAM read_files(s3_path, format => 'parquet');

Use this Streaming Table as a source of the Materialized View.
The Materialized View will get incrementally refreshed, meaning that the joins and aggregations will be recalculated for the newly ingested rows only.
Prerequisites for incremental refresh:
- Pipeline should run on Serverless
- The Materialized View should have eligible operators only

With this setup, you make sure the pipeline runs as fast as possible by minimal recomputations.

Hope it helps.

Best regards,

balajij8 · 4 weeks ago

You can use streaming delta tables and avoid materialized views. Avoid manual date partitioning if liquid clustering is setup.

SteveOstrowski · 2 weeks ago

Hi @FAHADURREHMAN,

There are several layers to optimizing query performance on a multi-TB materialized view, and the other replies here cover the ingestion/refresh side well. Let me add some guidance on the query-side tuning and help you decide between the options you mentioned.

LIQUID CLUSTERING: VERIFY IT IS ACTUALLY APPLIED

Having automatic liquid clustering enabled at the catalog level means new tables will get CLUSTER BY AUTO by default, but it is worth confirming that the materialized view itself actually has clustering applied. Run this to check:

DESCRIBE DETAIL your_catalog.your_schema.your_mv;

Look at the clusteringColumns field in the output. If it is empty, the MV was created before the catalog-level setting was enabled or was not picked up. You can explicitly set it:

CREATE OR REPLACE MATERIALIZED VIEW your_catalog.your_schema.your_mv
CLUSTER BY AUTO
AS
SELECT ...;

CLUSTER BY AUTO lets Databricks choose the best clustering keys based on observed query patterns. Alternatively, if you know your queries consistently filter on specific columns (e.g., a date column or a join key), you can specify them explicitly:

CLUSTER BY (your_date_column, your_join_key)

Up to four clustering columns are supported.

DO NOT ADD PARTITIONING ON TOP OF LIQUID CLUSTERING

To answer your question about partitioning by date: liquid clustering and PARTITIONED BY cannot be combined. They are mutually exclusive. Liquid clustering is the recommended approach and provides the data skipping and layout optimization benefits that partitioning would give you, without the small-file problems that date partitioning can cause on very large tables. So no, do not add date partitioning if liquid clustering is active.

ENABLE ROW TRACKING, DELETION VECTORS, AND CHANGE DATA FEED ON SOURCE TABLES

For incremental refresh to work most efficiently, Databricks recommends enabling these three features on all source tables that feed the materialized view:

ALTER TABLE your_source_table SET TBLPROPERTIES (
'delta.enableRowTracking' = 'true',
'delta.enableDeletionVectors' = 'true',
'delta.enableChangeDataFeed' = 'true'
);

Without row tracking, many incremental refresh operators fall back to full recomputation.

USE A STREAMING TABLE AS AN INTERMEDIATE LAYER

As @aleksandra_ch mentioned, landing your Parquet files into a Streaming Table first using Auto Loader, and then building the materialized view on top of that Streaming Table, is the recommended pattern. This enables true incremental refresh so that only newly arrived data gets processed during each pipeline run:

CREATE OR REFRESH STREAMING TABLE raw_data
AS SELECT * FROM STREAM read_files('s3://your-bucket/path/', format => 'parquet');

CREATE OR REPLACE MATERIALIZED VIEW your_mv
CLUSTER BY AUTO
AS
SELECT ... FROM your_catalog.your_schema.raw_data ...;

RUN THE PIPELINE ON SERVERLESS COMPUTE

Incremental refresh for materialized views is only available on serverless compute. If your Lakeflow Spark Declarative Pipeline is running on classic compute, the MV will do a full recompute every refresh. Switch the pipeline to serverless in the pipeline settings to take advantage of incremental refresh.

QUERY-SIDE OPTIMIZATIONS

Beyond the data layout, a few things to check for your slow JOIN queries:

1. Check the query profile in the SQL warehouse to see where time is being spent (scanning, shuffling, or spilling to disk).

2. Make sure your SQL warehouse is appropriately sized. For multi-TB tables with JOINs, a Medium or larger warehouse often makes a meaningful difference.

3. If you are joining the MV against another large table, make sure that table also has liquid clustering or appropriate data layout on the join key columns.

4. Consider whether you need all billion+ rows in a single MV, or if you can push filters into the MV definition itself to reduce its size.

5. Predictive Optimization handles OPTIMIZE and VACUUM scheduling automatically, but verify it is actually running compaction on the MV. Check the operation history:

SELECT * FROM your_catalog.information_schema.table_storage_metrics
WHERE table_name = 'your_mv';

MATERIALIZED VIEW VS. STREAMING TABLE + DELTA TABLE

To your question about whether read_stream with Delta tables is better: it depends on your use case.

- Use a materialized view when you need aggregations, JOINs, or transformations computed and maintained automatically with each refresh. The MV stores the precomputed results so downstream queries are fast.

- Use a streaming table when your transformation is append-only or you need exactly-once processing guarantees. Streaming tables process each record only once.

For a read-heavy analytics workload with JOINs, a materialized view on top of a streaming table (as described above) is typically the best pattern. The streaming table handles efficient ingestion, and the MV handles the transformation/aggregation layer.

DOCUMENTATION REFERENCES

- Liquid clustering: https://docs.databricks.com/en/delta/clustering.html
- Incremental refresh for materialized views: https://docs.databricks.com/en/optimizations/incremental-refresh.html
- CREATE MATERIALIZED VIEW syntax: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-materialized-view.html
- Auto Loader file ingestion: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

Databricks Community

Optimizing Large Materialized View to expedite query execution

Share your thoughts on Genie and receive a $50 gift card!

Solution Accelerator Series | Real-Time Point-of-Sale Analytics

🌟 Community Pulse: Your Weekly Roundup! March 09 – 15, 2026

Industry Month BrickTalk #2! AI-Powered Worker Training & Augmented Reality in Manufacturing

Big update - Azure Databricks Lakebase is Now Generally Available