cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incrementally ingesting from a static db into a Delta Table

rk1994
New Contributor

Hello everyone,

I’m very new to Delta Live Tables (and Delta Tables too), so please forgive me if this question has been asked here before.

Some context: I have over 100M records stored in a Postgres table. I can connect to this table using the conventional spark.read.format(“jdbc”)….load() method.

Unfortunately, due to some organizational restrictions, I cannot use streaming frameworks such as Kafka or Debezium, so using the AutoLoader is out of scope for me.

I have already created a materialized view and backfilled it with ~100M records.

Now, the use-case: I ingest ~500k new data points in the Postgres table every day, I would like to schedule a workflow/DLT pipeline to append this new data to the existing delta table. Is DLT a feasible solution for this? Or should I stick to a regular Notebook-based workflow?

I tried using DLTs because I would like to perform some intermediate transforms using expectations, but every time I run the pipeline, the existing data in the materialized view is overwritten. To counter this, I used a hacky workaround to check if the table exists and if it does, it reads the existing table and uses union() to append the new data. But then I’m not leveraging DLTs optimally. Is there a better way to avoid recomputing the materialized view(s) every time?

A last question: Are DLTs even an optimal application for my use-case? I know I can achieve this just as easily with a regular workflow, so I’d like to know what advantages I could get by using DLTs v/s not using them.

Thanks!

2 REPLIES 2

TPSteve
New Contributor II

First, you need to understand why your current solution is failing.

Materialized views and views in DLT don't differ conceptually from mat. views and views in PostgreSQL. Every time the pipeline is run, both the mat. view and the view will be recalculated. The difference is that a mat. view is persisted while a view is only available during the lifetime of the pipeline. If your backfilled records are not in the mat. views source table, then they will not be included in the refreshed result. You also don't want to recalculate all of the historic data unless you're updating the results by aggregating the historic and new data together. It doesn't sound like that's the use case. You're interested in appending the incremental results to the table.

As always, there are multiple ways to construct a pipeline. Without knowing more details about your data, DLT should be an option. Whether or not it's the most optimal is something else to consider.

I also suggest you follow the lakehouse paradigm. Land the untransformed data in a delta table first and then process the transformations. If you ever need to recalculate all the data, you won't need to read it from PostgreSQL.

rdmeyersf
New Contributor II

If I'm reading this right you created a materialized view to prep your data in Postgres. You may not need to do that, and it will also limit your integration options. It puts more work on Postgres, usually creates more data to move, and will not as many integration options support views. Databricks and ETL tools will be faster at it.

If you want to look at other CDC/ETL/ELT options you can read this article on substack: https://estuarydev.substack.com/p/the-data-engineers-guide-to-cdc-for 

If you're looking for batch, it covers that as well. You may not need real-time for this. Estuary has real-time and batch Postgres connectors, and databricks connectors too.

The main reason for looking at tools is that you're starting to integrate. You'll save yourself time now and avoid an integration nightmare in the future if you start integrating the right way now. Just make sure you can replace your tools quickly; design for change.

Several ETL tools will be free to start with. Try https://estuary.dev/, which is arguably the best CDC and great at batch too. It's free for up to 10GB/month if that's within your limit.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group