cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to activate ignoreChanges in Delta Live Table read_stream ?

adrianlwn
New Contributor III

Hello everyone,

I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")).

I keep receiving this error :

> Detected a data update (for example part-00000-6723832a-b8ca-4a20-b576-d69bd5e42652-c000.snappy.parquet) in the source table at version 11. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.

And I've tried these options to activate this configuration :

@dlt.view(name="_wp_strategies_dup",
           comment="This table contains the test strategy table",
          spark_conf={"ignoreChanges": "true"})
spark.readStream.option("ignoreChanges","true").table("LIVE.wp_parameters")
dlt.option("ignoreChanges","true").read_stream("wp_parameters")

And so far nothing has worked. Is it because this configuration is not possible with DLT ? Or is it because there is another way to set this configuration up ?

18 REPLIES 18

TH
New Contributor II

Hi @Swapnil Kamle​ ,

we also implemented change data capture for deduplication purposes in DLTs. We do it in SQL using the APPLY CHANGES INTO command. How does your workaround solve the issue of updates in such a case? Would you mind explaining?

Thanks

SRK
Contributor III

Hi TH,

If you look at the code which I have shared, there I am using append to write the data in Json first then I read the Json file using autoloader.

df_table.write.mode("append").json("/mnt/temp_table/ Employee ",ignoreNullFields=False)

So, it's only appending the data not updating, which helps me to fix the issue related to updates.

Thanks

TH
New Contributor II

Thanks for your answer. But then you are not doing Change Data Capture (for deduplication purposes) as initially asked. I am looking for a solution that still lets me do deduplication...

gopínath
New Contributor II

In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstream. (Spark structured streaming supports ever growing / append only sources).

If you have use cases where the upstream can have updates / deletes and you want to pass these operations automatically to downstream you can follow the below suggested architectures in DLT. In both setup using live tables helps to handle updates / deletes from upstream.

Architecture 1:

You can use live tables to handle this. For use cases where you perform updates/deletes on the bronze table to reflect these deletes/updates in the silver table, you can create silver table as live table

Refer below diagram:

image 

Architecture 2:

Other way to handle updates / deletes and pass through downstream is you can use DLT CDC. The CDC architecture looks something like below.

DLT bronze table --> DLT silver using CDC apply_changes --> DLT gold live table

Here silver table picks change data from bronze(updates or delete) and do necessary operations.

In both setups, if you delete/update any record in bronze table for use cases like GDPR, this delete/update will automatically flow to silver table(you no need to manually delete/update from silver and then gold). Now gold will pick this silver table and perform full refresh. (live table).

Also DLT has a special feature called enžyme. Enžyme helps to avoid full re-computation for the LIVE table and improve the performance.

What is enžyme?

Compared to the existing method of fully recomputing all rows in the live table – even rows which do not need to be changed – enžyme may significantly reduce resource utilization and improve overall pipeline latency by only updating the rows in the live table which are necessary to materialize the result.

For more details on enžyme you can refer this blog: https://www.databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performa...

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!