Hi @FAHADURREHMAN,
There are several layers to optimizing query performance on a multi-TB materialized view, and the other replies here cover the ingestion/refresh side well. Let me add some guidance on the query-side tuning and help you decide between the options you mentioned.
LIQUID CLUSTERING: VERIFY IT IS ACTUALLY APPLIED
Having automatic liquid clustering enabled at the catalog level means new tables will get CLUSTER BY AUTO by default, but it is worth confirming that the materialized view itself actually has clustering applied. Run this to check:
DESCRIBE DETAIL your_catalog.your_schema.your_mv;
Look at the clusteringColumns field in the output. If it is empty, the MV was created before the catalog-level setting was enabled or was not picked up. You can explicitly set it:
CREATE OR REPLACE MATERIALIZED VIEW your_catalog.your_schema.your_mv
CLUSTER BY AUTO
AS
SELECT ...;
CLUSTER BY AUTO lets Databricks choose the best clustering keys based on observed query patterns. Alternatively, if you know your queries consistently filter on specific columns (e.g., a date column or a join key), you can specify them explicitly:
CLUSTER BY (your_date_column, your_join_key)
Up to four clustering columns are supported.
DO NOT ADD PARTITIONING ON TOP OF LIQUID CLUSTERING
To answer your question about partitioning by date: liquid clustering and PARTITIONED BY cannot be combined. They are mutually exclusive. Liquid clustering is the recommended approach and provides the data skipping and layout optimization benefits that partitioning would give you, without the small-file problems that date partitioning can cause on very large tables. So no, do not add date partitioning if liquid clustering is active.
ENABLE ROW TRACKING, DELETION VECTORS, AND CHANGE DATA FEED ON SOURCE TABLES
For incremental refresh to work most efficiently, Databricks recommends enabling these three features on all source tables that feed the materialized view:
ALTER TABLE your_source_table SET TBLPROPERTIES (
'delta.enableRowTracking' = 'true',
'delta.enableDeletionVectors' = 'true',
'delta.enableChangeDataFeed' = 'true'
);
Without row tracking, many incremental refresh operators fall back to full recomputation.
USE A STREAMING TABLE AS AN INTERMEDIATE LAYER
As @aleksandra_ch mentioned, landing your Parquet files into a Streaming Table first using Auto Loader, and then building the materialized view on top of that Streaming Table, is the recommended pattern. This enables true incremental refresh so that only newly arrived data gets processed during each pipeline run:
CREATE OR REFRESH STREAMING TABLE raw_data
AS SELECT * FROM STREAM read_files('s3://your-bucket/path/', format => 'parquet');
CREATE OR REPLACE MATERIALIZED VIEW your_mv
CLUSTER BY AUTO
AS
SELECT ... FROM your_catalog.your_schema.raw_data ...;
RUN THE PIPELINE ON SERVERLESS COMPUTE
Incremental refresh for materialized views is only available on serverless compute. If your Lakeflow Spark Declarative Pipeline is running on classic compute, the MV will do a full recompute every refresh. Switch the pipeline to serverless in the pipeline settings to take advantage of incremental refresh.
QUERY-SIDE OPTIMIZATIONS
Beyond the data layout, a few things to check for your slow JOIN queries:
1. Check the query profile in the SQL warehouse to see where time is being spent (scanning, shuffling, or spilling to disk).
2. Make sure your SQL warehouse is appropriately sized. For multi-TB tables with JOINs, a Medium or larger warehouse often makes a meaningful difference.
3. If you are joining the MV against another large table, make sure that table also has liquid clustering or appropriate data layout on the join key columns.
4. Consider whether you need all billion+ rows in a single MV, or if you can push filters into the MV definition itself to reduce its size.
5. Predictive Optimization handles OPTIMIZE and VACUUM scheduling automatically, but verify it is actually running compaction on the MV. Check the operation history:
SELECT * FROM your_catalog.information_schema.table_storage_metrics
WHERE table_name = 'your_mv';
MATERIALIZED VIEW VS. STREAMING TABLE + DELTA TABLE
To your question about whether read_stream with Delta tables is better: it depends on your use case.
- Use a materialized view when you need aggregations, JOINs, or transformations computed and maintained automatically with each refresh. The MV stores the precomputed results so downstream queries are fast.
- Use a streaming table when your transformation is append-only or you need exactly-once processing guarantees. Streaming tables process each record only once.
For a read-heavy analytics workload with JOINs, a materialized view on top of a streaming table (as described above) is typically the best pattern. The streaming table handles efficient ingestion, and the MV handles the transformation/aggregation layer.
DOCUMENTATION REFERENCES
- Liquid clustering: https://docs.databricks.com/en/delta/clustering.html
- Incremental refresh for materialized views: https://docs.databricks.com/en/optimizations/incremental-refresh.html
- CREATE MATERIALIZED VIEW syntax: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-materialized-view.html
- Auto Loader file ingestion: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/index.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.