DLT - Continuously Updated File Issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-13-2024 01:42 AM
Hi everyone,
I'm encountering an issue with my DLT pipeline that I haven't been able to resolve. My pipeline reads a single CSV file that is over 100 GB in size. This file is continuously updated throughout the day. When DLT attempts to read the file (which takes a few minutes), it fails with an 'underlying files have been updated' error. As a result, I have to read snapshots, and the pipeline reads the entire file each time instead of just the recently added rows.
What would you recommend in this scenario?
Please note that the file size and type are managed by a third party, so I can't make any changes on that front.
- Labels:
-
Workflows
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-13-2024 06:10 AM
Hi @cat017,
Here are a few recommendations:
Use Auto Loader with File Notification Mode: Instead of reading the entire CSV file each time, you can use Databricks Auto Loader with File Notification Mode. This mode allows you to efficiently process new data files as they arrive in cloud storage without re-reading the entire file. Auto Loader can be configured to use AWS SQS for file notifications, which helps in detecting new files or changes to existing files. This approach minimizes the risk of encountering the "underlying files have been updated" error.
Implement Change Data Capture (CDC): If the third-party system supports it, consider implementing a Change Data Capture (CDC) mechanism. CDC captures only the changes (inserts, updates, deletes) made to the data and can be ingested into your DLT pipeline. This way, you can process only the new or changed rows instead of the entire file. Databricks provides APIs to simplify CDC with Delta Live Tables.
https://docs.databricks.com/en/delta-live-tables/cdc.html
Would be a good idea to file a case with us to better understand your use-case and suggest: https://docs.databricks.com/en/resources/support.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-14-2024 12:26 PM
Hi @Alberto_Umana ,
thank you for the reply. I've already tried the Auto Loader a few times, but didnt work. i will try it again. But, as you suggest, its better to file a case.

