Databricks Community

hk-modi · ‎11-25-2024

I have an S3 bucket that has continuous data being written into it. My script reads these files, parses them and then appends into a delta table.

The data backs to 2022 with millions of files which are stored using partitions based on year/month/dayOfMonth/hourOfDay.

Up until now, I have been using previous day as a filter to read the data and process it. However, now I want to switch to incremental batch streaming using directory listing autoloader. How do I switch to it without having the need to parse the entire S3 to create the initial checkpoint?

radothede · ‎11-25-2024

hi @hk-modi

As I understand correctly, You have an existing delta table with tons of data already processed. You want to switch to autoloader, read files, parse them and process data incrementally to that delta table as a sink. The task is to start processing only newly arrived files without the need to reprocess all the historical data.

If so, I think there are some options You can leverage, they are mentioned autoloader docs .

Those look promising considering Your scenario:

cloudFiles.includeExistingFiles

Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect.

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Best,

Radek

Databricks Community

Switching to autoloader

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟