How to activate ignoreChanges in Delta Live Table read_stream ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-06-2022 10:35 AM
Hello everyone,
I'm using DLT (Delta Live Tables) and I've implemented some Change Data Capture for deduplication purposes. Now I am creating a downstream table that will read the DLT as a stream (dlt.read_stream("<tablename>")).
I keep receiving this error :
> Detected a data update (for example part-00000-6723832a-b8ca-4a20-b576-d69bd5e42652-c000.snappy.parquet) in the source table at version 11. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory.
And I've tried these options to activate this configuration :
@dlt.view(name="_wp_strategies_dup",
comment="This table contains the test strategy table",
spark_conf={"ignoreChanges": "true"})
spark.readStream.option("ignoreChanges","true").table("LIVE.wp_parameters")
dlt.option("ignoreChanges","true").read_stream("wp_parameters")
And so far nothing has worked. Is it because this configuration is not possible with DLT ? Or is it because there is another way to set this configuration up ?
- Labels:
-
Delta
-
DLT
-
Spark structured streaming
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-10-2022 08:55 AM
Hi @Kaniz Fatmaโ , thanks you for your answer. Unfortunately it doesn't solve my issues.
My question was about Delta Live Tables and not classical Delta Tables. I was wondering if applying the suggested settings : ignoreChanges was even possible in DLT ...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-14-2022 05:31 AM
Hi, the team @Prabakar Ammeappinโ @Werner Stinckensโ @Jose Gonzalezโ @Lindsay Olsonโ . Recently, I had the same issue with the .option("ignoreChanges", "true") not working for DLT tables, and it was frustrating ๐ Maybe we could get some internal insides about that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-25-2023 12:23 PM
any update on this? will this be possible anytime soon with DLTs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ04-19-2023 10:42 AM
We would be also interested in this. This is critical functionality for us as we need to handle changes in the data. Otherwise, we cannot consider DLT as a viable solution although we would want to.โ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 04:42 AM
I am also facing the same issue . is there any update on how to enable ignoreChanges for dlt tables please?
below is my code and it's not working
def messages_raw():
return (
# load incrementally
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.schema(JSONschema)
.option("ignoreChanges", "true")
# .load("/mnt/raj-zuk-comparis-poc/messages*.json"))
.load("s3://zuk-comparis-poc/"))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-27-2022 07:04 AM
Hi @Kaniz Fatmaโ ,
We're facing with the same issue, but with the "ignoreDeletes" option. Is there any progress in solving the problem?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ12-05-2022 09:23 AM
Have anyone found the issue? We are facing the same thing
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-10-2023 02:17 AM
Hi @Kaniz Fatmaโ ,
I am working on a use case where I am keeping customer data using medallion architecture using Delta live Tables.
But I would like to also delete data based on GDPR. So, I have tried deleting using simple delete script basically deleting that consumer older than 5 years from bronze, silver and gold tables.
After that, I tried to run DLT pipeline again and ran into issue like mentioned above.
" Detected a data update (for example part-00000-6723832a-b8ca-4a20-b576-d69bd5e42652-c000.snappy.parquet) in the source table at version 11. This is currently not supported. If you'd like to ignore updates, set the option 'ignoreChanges' to 'true'. If you would like the data update to be reflected, please restart this query with a fresh checkpoint directory."
Any idea how to implement ignore changes and ignore deletes in DLT?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-10-2023 04:54 AM
Yes, that is a pain currently. I bet that for now, you need to perform a full refresh with cleaned checkpoints.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-18-2023 09:44 PM
We have identified a work around to resolve this issue:
df_table = spark.sql(f'''SELECT * FROM Employee''')
df_table.write.mode("append").json("/mnt/temp_table/ Employee ",ignoreNullFields=False)
CREATE STREAMING LIVE TABLE Employee_temp
COMMENT "Employee temp"
AS
SELECT
*
FROM cloud_files("/mnt/temp_table/ Employee ", "json")
-- Create and populate the target table.
CREATE OR REFRESH STREAMING LIVE TABLE dim_employee;
APPLY CHANGES INTO
live.dim_employee
FROM
stream(Live. Employee_temp)
KEYS
(employeeid)
IGNORE NULL UPDATES
SEQUENCE BY
load_datetime
STORED AS
SCD TYPE 2;
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-24-2023 12:38 PM
Hi @Swapnil Kamleโ ,
we also implemented change data capture for deduplication purposes in DLTs. We do it in SQL using the APPLY CHANGES INTO command. How does your workaround solve the issue of updates in such a case? Would you mind explaining?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-24-2023 10:56 PM
Hi TH,
If you look at the code which I have shared, there I am using append to write the data in Json first then I read the Json file using autoloader.
df_table.write.mode("append").json("/mnt/temp_table/ Employee ",ignoreNullFields=False)
So, it's only appending the data not updating, which helps me to fix the issue related to updates.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-25-2023 12:21 PM
Thanks for your answer. But then you are not doing Change Data Capture (for deduplication purposes) as initially asked. I am looking for a solution that still lets me do deduplication...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-27-2023 07:03 PM
In DLT read_stream, we can't use ignoreChanges / ignoreDeletes. These are the configs helps to avoid the failures but it is actually ignoring the operations done on the upstream. So you need to manually perform the deletes or updates in the downstream. (Spark structured streaming supports ever growing / append only sources).
If you have use cases where the upstream can have updates / deletes and you want to pass these operations automatically to downstream you can follow the below suggested architectures in DLT. In both setup using live tables helps to handle updates / deletes from upstream.
Architecture 1:
You can use live tables to handle this. For use cases where you perform updates/deletes on the bronze table to reflect these deletes/updates in the silver table, you can create silver table as live table
Refer below diagram:
Architecture 2:
Other way to handle updates / deletes and pass through downstream is you can use DLT CDC. The CDC architecture looks something like below.
DLT bronze table --> DLT silver using CDC apply_changes --> DLT gold live table
Here silver table picks change data from bronze(updates or delete) and do necessary operations.
In both setups, if you delete/update any record in bronze table for use cases like GDPR, this delete/update will automatically flow to silver table(you no need to manually delete/update from silver and then gold). Now gold will pick this silver table and perform full refresh. (live table).
Also DLT has a special feature called enลพyme. Enลพyme helps to avoid full re-computation for the LIVE table and improve the performance.
What is enลพyme?
Compared to the existing method of fully recomputing all rows in the live table โ even rows which do not need to be changed โ enลพyme may significantly reduce resource utilization and improve overall pipeline latency by only updating the rows in the live table which are necessary to materialize the result.
For more details on enลพyme you can refer this blog: https://www.databricks.com/blog/2022/06/29/delta-live-tables-announces-new-capabilities-and-performa...

