cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Incrementally ingesting from a static db into a Delta Table

rk1994
New Contributor

Hello everyone,

I’m very new to Delta Live Tables (and Delta Tables too), so please forgive me if this question has been asked here before.

Some context: I have over 100M records stored in a Postgres table. I can connect to this table using the conventional spark.read.format(“jdbc”)….load() method.

Unfortunately, due to some organizational restrictions, I cannot use streaming frameworks such as Kafka or Debezium, so using the AutoLoader is out of scope for me.

I have already created a materialized view and backfilled it with ~100M records.

Now, the use-case: I ingest ~500k new data points in the Postgres table every day, I would like to schedule a workflow/DLT pipeline to append this new data to the existing delta table. Is DLT a feasible solution for this? Or should I stick to a regular Notebook-based workflow?

I tried using DLTs because I would like to perform some intermediate transforms using expectations, but every time I run the pipeline, the existing data in the materialized view is overwritten. To counter this, I used a hacky workaround to check if the table exists and if it does, it reads the existing table and uses union() to append the new data. But then I’m not leveraging DLTs optimally. Is there a better way to avoid recomputing the materialized view(s) every time?

A last question: Are DLTs even an optimal application for my use-case? I know I can achieve this just as easily with a regular workflow, so I’d like to know what advantages I could get by using DLTs v/s not using them.

Thanks!

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @rk1994Based on the information provided, it seems that using Delta Live Tables (DLT) may not be the optimal solution for your specific use case.

In your use case, where you have a relatively static data source (Postgres table) and need to perform incremental updates, a regular notebook-based workflow might be a more suitable and straightforward solution compared to using DLT. The key advantages of DLT are:

  • Declarative pipeline definition
  • Automated task orchestration and cluster management
  • Integrated data quality monitoring with expectations
  • Improved lineage and observability

However, these benefits may not be as impactful in your specific scenario, where you are dealing with a relatively simple ETL process without the need for complex streaming or batch processing.

Based on the information provided, it seems that a regular notebook-based workflow might be a more appropriate solution for your use case. This would allow you to maintain more control over the ingestion and transformation process, and you can still leverage the data quality monitoring capabilities of Databricks through other means, such as using custom SQL checks or Databricks SQL Expectations.

TPSteve
New Contributor II

First, you need to understand why your current solution is failing.

Materialized views and views in DLT don't differ conceptually from mat. views and views in PostgreSQL. Every time the pipeline is run, both the mat. view and the view will be recalculated. The difference is that a mat. view is persisted while a view is only available during the lifetime of the pipeline. If your backfilled records are not in the mat. views source table, then they will not be included in the refreshed result. You also don't want to recalculate all of the historic data unless you're updating the results by aggregating the historic and new data together. It doesn't sound like that's the use case. You're interested in appending the incremental results to the table.

As always, there are multiple ways to construct a pipeline. Without knowing more details about your data, DLT should be an option. Whether or not it's the most optimal is something else to consider.

I also suggest you follow the lakehouse paradigm. Land the untransformed data in a delta table first and then process the transformations. If you ever need to recalculate all the data, you won't need to read it from PostgreSQL.

rdmeyersf
New Contributor

If I'm reading this right you created a materialized view to prep your data in Postgres. You may not need to do that, and it will also limit your integration options. It puts more work on Postgres, usually creates more data to move, and will not as many integration options support views. Databricks and ETL tools will be faster at it.

If you want to look at other CDC/ETL/ELT options you can read this article on substack: https://estuarydev.substack.com/p/the-data-engineers-guide-to-cdc-for 

If you're looking for batch, it covers that as well. You may not need real-time for this. Estuary has real-time and batch Postgres connectors, and databricks connectors too.

The main reason for looking at tools is that you're starting to integrate. You'll save yourself time now and avoid an integration nightmare in the future if you start integrating the right way now. Just make sure you can replace your tools quickly; design for change.

Several ETL tools will be free to start with. Try https://estuary.dev/, which is arguably the best CDC and great at batch too. It's free for up to 10GB/month if that's within your limit.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!