Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Showing results for 
Search instead for 
Did you mean: 

duplicate files in delta table

New Contributor

I am facing this issue from long time but so far there is no solution. I have delta table. My bronze layer is picking up the old files (mostly 8 days old file) randomly. My source of files is azure blob storage.


Contributor III

Hey @jaimeperry12345 

I will need more information to direct you in the right direction: 

  1. Confirm the behavior: Double-check that your Delta table is indeed reading 8-day-old files randomly. Provide any logs or error messages you have regarding this.
  2. Expected behavior: Explain how the table should be functioning ideally. Are you expecting it to pick up the latest files only?

Looking at the current details you mentioned please check:

  1. Check File timestamps: Verify that the file timestamps on Azure Blob Storage accurately reflect the actual creation time. Inconsistent timestamps can mislead the Delta Lake autoloader.
  2. Review Autoloader Configuration: Ensure your Delta Lake autoloader configuration points to the correct directory and includes parameters like minPartitions and partitionBy appropriately.
  3. Spark Configuration: Make sure your Spark session configuration doesn't have any settings that might interfere with reading the latest files (e.g., caching or checkpointing).
  4. Cluster Termination: If you're using a managed Databricks cluster, ensure it's not automatically terminating and restarting, as this can sometimes cause the autoloader to pick up older files.
  5. Logs and Diagnostics: Analyze the Delta Lake logs and Spark driver logs for any clues about what might be causing the issue. There might be specific error messages or warnings related to the autoloader.

Follow ups are appreciated! 

Leave a like if this helps! Kudos,