- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2024 09:43 PM
Hi,
I'm importing large numbers of parquet files (ca 5200 files per day, they each land in a separate folder) into Azure ADLS storage.
I have a DLT streaming table reading from the root folder.
I noticed a massive spike in storage account costs due to file system reads.
Questions: How does DLT identify newly arriving files? Does it always have to monitor the entire folder including all historical files?
Are there any design patterns to resolve this (i.e regarding folder structure, archiving of processed files)?
Many thanks for your help!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2024 10:02 PM
Please refer to the autoloader for details https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/ You can use autoloader in DLT to detect new files. Our document also mentions the file name patterns that work with the autoloader.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2024 10:02 PM
Please refer to the autoloader for details https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/ You can use autoloader in DLT to detect new files. Our document also mentions the file name patterns that work with the autoloader.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-20-2024 02:30 AM
To resolve the issue of excessive directory scanning, I have changed the folder structure to separate historical files from current files and reduce the number of folders and files that the Databrick process monitors.

