Databricks Community

cat017 · ‎11-13-2024

Hi everyone,

I'm encountering an issue with my DLT pipeline that I haven't been able to resolve. My pipeline reads a single CSV file that is over 100 GB in size. This file is continuously updated throughout the day. When DLT attempts to read the file (which takes a few minutes), it fails with an 'underlying files have been updated' error. As a result, I have to read snapshots, and the pipeline reads the entire file each time instead of just the recently added rows.

What would you recommend in this scenario?

Please note that the file size and type are managed by a third party, so I can't make any changes on that front.

Alberto_Umana · ‎11-13-2024

Hi @cat017,

Here are a few recommendations:

Use Auto Loader with File Notification Mode: Instead of reading the entire CSV file each time, you can use Databricks Auto Loader with File Notification Mode. This mode allows you to efficiently process new data files as they arrive in cloud storage without re-reading the entire file. Auto Loader can be configured to use AWS SQS for file notifications, which helps in detecting new files or changes to existing files. This approach minimizes the risk of encountering the "underlying files have been updated" error.

Implement Change Data Capture (CDC): If the third-party system supports it, consider implementing a Change Data Capture (CDC) mechanism. CDC captures only the changes (inserts, updates, deletes) made to the data and can be ingested into your DLT pipeline. This way, you can process only the new or changed rows instead of the entire file. Databricks provides APIs to simplify CDC with Delta Live Tables.

https://docs.databricks.com/en/delta-live-tables/cdc.html

Would be a good idea to file a case with us to better understand your use-case and suggest: https://docs.databricks.com/en/resources/support.html

cat017 · ‎11-14-2024

Hi @Alberto_Umana ,

thank you for the reply. I've already tried the Auto Loader a few times, but didnt work. i will try it again. But, as you suggest, its better to file a case.

Databricks Community

DLT - Continuously Updated File Issue

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!