DLT || Python || Aggregate Functions recomputing all the records

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi all,

I am building a realtime dashboard using Databricks Delta Live Tables Pipeline and using the following steps : -

Bronze Table : Using the autoloader functionality provided by databricks, its incrementally ingesting new files records into a bronze table.
Silver Table : Using the read_stream function provided in spark for structured streaming, we are creating the silver table by filtering the records and selecting few fields from the bronze table that are required.
Gold Table : Using the read function provided in spark for reading complete record, we are creating the gold table, which is the materialized view and also using aggregate function (SUM), and group by clause to create it.

Problem :
Bronze and silver table are doing incremental ingestion, however incase of gold table, the entire record in the table is getting recomputed everytime a new record is received in the silver table.

What I want to ensure is that for the particular group by clause only updates should be performed and rest of the records are locked and dont require any update.

I have also tried using streaming table instead of materialized view for gold as well, in this case also the entire records are getting recomputed.
Any help would be appreciated.

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.