Databricks Community

adrianlwn · ‎10-06-2022

Hello everyone,

I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")).

I keep receiving this error :

> Detected a data update (for example part-00000-6723832a-b8ca-4a20-b576-d69bd5e42652-c000.snappy.parquet) in the source table at version 11. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.

And I've tried these options to activate this configuration :

@dlt.view(name="_wp_strategies_dup",
           comment="This table contains the test strategy table",
          spark_conf={"ignoreChanges": "true"})

spark.readStream.option("ignoreChanges","true").table("LIVE.wp_parameters")

dlt.option("ignoreChanges","true").read_stream("wp_parameters")

And so far nothing has worked. Is it because this configuration is not possible with DLT ? Or is it because there is another way to set this configuration up ?

adrianlwn · ‎10-10-2022

Hi @Kaniz Fatma , thanks you for your answer. Unfortunately it doesn't solve my issues.

My question was about Delta Live Tables and not classical Delta Tables. I was wondering if applying the suggested settings : ignoreChanges was even possible in DLT ...

Hubert-Dudek · ‎10-14-2022

Hi, the team @Prabakar Ammeappin @Werner Stinckens @Jose Gonzalez @Lindsay Olson . Recently, I had the same issue with the .option("ignoreChanges", "true") not working for DLT tables, and it was frustrating 🙂 Maybe we could get some internal insides about that.

TH · ‎01-25-2023

any update on this? will this be possible anytime soon with DLTs?

Valentin1 · ‎04-19-2023

We would be also interested in this. This is critical functionality for us as we need to handle changes in the data. Otherwise, we cannot consider DLT as a viable solution although we would want to.

sudhanshu1 · ‎01-18-2023

I am also facing the same issue . is there any update on how to enable ignoreChanges for dlt tables please?

below is my code and it's not working

def messages_raw():

return (

# load incrementally

spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.schema(JSONschema)

.option("ignoreChanges", "true")

# .load("/mnt/raj-zuk-comparis-poc/messages*.json"))

.load("s3://zuk-comparis-poc/"))

Louis_Perreault · ‎10-27-2022

Hi @Kaniz Fatma ,

We're facing with the same issue, but with the "ignoreDeletes" option. Is there any progress in solving the problem?

fecavalc08 · ‎12-05-2022

Have anyone found the issue? We are facing the same thing

AMadan · ‎01-10-2023

Hi @Kaniz Fatma ,

I am working on a use case where I am keeping customer data using medallion architecture using Delta live Tables.

But I would like to also delete data based on GDPR. So, I have tried deleting using simple delete script basically deleting that consumer older than 5 years from bronze, silver and gold tables.

After that, I tried to run DLT pipeline again and ran into issue like mentioned above.

" Detected a data update (for example part-00000-6723832a-b8ca-4a20-b576-d69bd5e42652-c000.snappy.parquet) in the source table at version 11. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory."

Any idea how to implement ignore changes and ignore deletes in DLT?

Hubert-Dudek · ‎01-10-2023

Yes, that is a pain currently. I bet that for now, you need to perform a full refresh with cleaned checkpoints.

SRK · ‎01-18-2023

We have identified a work around to resolve this issue:

df_table = spark.sql(f'''SELECT * FROM Employee''')

df_table.write.mode("append").json("/mnt/temp_table/ Employee ",ignoreNullFields=False)

CREATE STREAMING LIVE TABLE Employee_temp

COMMENT "Employee temp"

AS

SELECT

*

FROM cloud_files("/mnt/temp_table/ Employee ", "json")

-- Create and populate the target table.

CREATE OR REFRESH STREAMING LIVE TABLE dim_employee;

APPLY CHANGES INTO

live.dim_employee

FROM

stream(Live. Employee_temp)

KEYS

(employeeid)

IGNORE NULL UPDATES

SEQUENCE BY

load_datetime

STORED AS

SCD TYPE 2;

TH · ‎01-24-2023

Hi @Swapnil Kamle ,

we also implemented change data capture for deduplication purposes in DLTs. We do it in SQL using the APPLY CHANGES INTO command. How does your workaround solve the issue of updates in such a case? Would you mind explaining?

Thanks

SRK · ‎01-24-2023

Hi TH,

If you look at the code which I have shared, there I am using append to write the data in Json first then I read the Json file using autoloader.

df_table.write.mode("append").json("/mnt/temp_table/ Employee ",ignoreNullFields=False)

So, it's only appending the data not updating, which helps me to fix the issue related to updates.

Thanks

TH · ‎01-25-2023

Thanks for your answer. But then you are not doing Change Data Capture (for deduplication purposes) as initially asked. I am looking for a solution that still lets me do deduplication...

gopínath · ‎02-27-2023

In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstream. (Spark structured streaming supports ever growing / append only sources).

If you have use cases where the upstream can have updates / deletes and you want to pass these operations automatically to downstream you can follow the below suggested architectures in DLT. In both setup using live tables helps to handle updates / deletes from upstream.

Architecture 1:

You can use live tables to handle this. For use cases where you perform updates/deletes on the bronze table to reflect these deletes/updates in the silver table, you can create silver table as live table

Refer below diagram:

Architecture 2:

Other way to handle updates / deletes and pass through downstream is you can use DLT CDC. The CDC architecture looks something like below.

DLT bronze table --> DLT silver using CDC apply_changes --> DLT gold live table

Here silver table picks change data from bronze(updates or delete) and do necessary operations.

In both setups, if you delete/update any record in bronze table for use cases like GDPR, this delete/update will automatically flow to silver table(you no need to manually delete/update from silver and then gold). Now gold will pick this silver table and perform full refresh. (live table).

Also DLT has a special feature called enžyme. Enžyme helps to avoid full re-computation for the LIVE table and improve the performance.

What is enžyme?

Compared to the existing method of fully recomputing all rows in the live table – even rows which do not need to be changed – enžyme may significantly reduce resource utilization and improve overall pipeline latency by only updating the rows in the live table which are necessary to materialize the result.

For more details on enžyme you can refer this blog: https://www.databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performa...