cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT File Level Deduplication

dgahram
New Contributor

I want to create a DLT pipeline that incrementally processes csv files arriving daily. However, some of those files are duplicate - they have the same names and data but are in different directories. What is the best way to handle this? I'm assuming that row-level deduplication would be inefficient, but not sure if file-level deduplication is possible with DLT streaming.

1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hello @dgahram ,

 

  • Auto Loader tracks ingestion progress by persisting discovered file metadata in a RocksDB store within the checkpoint, which provides “exactly-once” processing for discovered files.Dochttps://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/

  • However, if the same content appears under different file paths (for example, duplicated into another directory), Auto Loader will still “discover” it as a new file and ingest it (because it is, from a file-discovery standpoint).

In DLT, there is also no option for file deduplication. The best practice is to file de-dupe before DLT ingestion. Inside streaming, “file-level dedup” is awkward because the stream is already row-oriented after CSV parsing.

Anudeep