Hello everyone,
I’m very new to Delta Live Tables (and Delta Tables too), so please forgive me if this question has been asked here before.
Some context: I have over 100M records stored in a Postgres table. I can connect to this table using the conventional spark.read.format(“jdbc”)….load() method.
Unfortunately, due to some organizational restrictions, I cannot use streaming frameworks such as Kafka or Debezium, so using the AutoLoader is out of scope for me.
I have already created a materialized view and backfilled it with ~100M records.
Now, the use-case: I ingest ~500k new data points in the Postgres table every day, I would like to schedule a workflow/DLT pipeline to append this new data to the existing delta table. Is DLT a feasible solution for this? Or should I stick to a regular Notebook-based workflow?
I tried using DLTs because I would like to perform some intermediate transforms using expectations, but every time I run the pipeline, the existing data in the materialized view is overwritten. To counter this, I used a hacky workaround to check if the table exists and if it does, it reads the existing table and uses union() to append the new data. But then I’m not leveraging DLTs optimally. Is there a better way to avoid recomputing the materialized view(s) every time?
A last question: Are DLTs even an optimal application for my use-case? I know I can achieve this just as easily with a regular workflow, so I’d like to know what advantages I could get by using DLTs v/s not using them.
Thanks!