cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT File Level Deduplication

dgahram
Visitor

I want to create a DLT pipeline that incrementally processes csv files arriving daily. However, some of those files are duplicate - they have the same names and data but are in different directories. What is the best way to handle this? I'm assuming that row-level deduplication would be inefficient, but not sure if file-level deduplication is possible with DLT streaming.

1 REPLY 1

K_Anudeep
Databricks Employee
Databricks Employee

Hello @dgahram ,

 

  • Auto Loader tracks ingestion progress by persisting discovered file metadata in a RocksDB store within the checkpoint, which provides โ€œexactly-onceโ€ processing for discovered files.Dochttps://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/

  • However, if the same content appears under different file paths (for example, duplicated into another directory), Auto Loader will still โ€œdiscoverโ€ it as a new file and ingest it (because it is, from a file-discovery standpoint).

In DLT, there is also no option for file deduplication. The best practice is to file de-dupe before DLT ingestion. Inside streaming, โ€œfile-level dedupโ€ is awkward because the stream is already row-oriented after CSV parsing.

Anudeep