cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT - Continuously Updated File Issue

cat017
New Contributor III

Hi everyone,

I'm encountering an issue with my DLT pipeline that I haven't been able to resolve. My pipeline reads a single CSV file that is over 100 GB in size. This file is continuously updated throughout the day. When DLT attempts to read the file (which takes a few minutes), it fails with an 'underlying files have been updated' error. As a result, I have to read snapshots, and the pipeline reads the entire file each time instead of just the recently added rows.

What would you recommend in this scenario?

Please note that the file size and type are managed by a third party, so I can't make any changes on that front.

2 REPLIES 2

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @cat017,

Here are a few recommendations:

Use Auto Loader with File Notification Mode: Instead of reading the entire CSV file each time, you can use Databricks Auto Loader with File Notification Mode. This mode allows you to efficiently process new data files as they arrive in cloud storage without re-reading the entire file. Auto Loader can be configured to use AWS SQS for file notifications, which helps in detecting new files or changes to existing files. This approach minimizes the risk of encountering the "underlying files have been updated" error.

Implement Change Data Capture (CDC): If the third-party system supports it, consider implementing a Change Data Capture (CDC) mechanism. CDC captures only the changes (inserts, updates, deletes) made to the data and can be ingested into your DLT pipeline. This way, you can process only the new or changed rows instead of the entire file. Databricks provides APIs to simplify CDC with Delta Live Tables.

https://docs.databricks.com/en/delta-live-tables/cdc.html

Would be a good idea to file a case with us to better understand your use-case and suggest: https://docs.databricks.com/en/resources/support.html

cat017
New Contributor III

Hi @Alberto_Umana ,

thank you for the reply. I've already tried the Auto Loader a few times, but didnt work. i will try it again. But, as you suggest, its better to file a case. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group