Databricks Community

hk-modi · ‎11-25-2024

I have an S3 bucket that has continuous data being written into it. My script reads these files, parses them and then appends into a delta table.

The data backs to 2022 with millions of files which are stored using partitions based on year/month/dayOfMonth/hourOfDay.

Up until now, I have been using previous day as a filter to read the data and process it. However, now I want to switch to incremental batch streaming using directory listing autoloader. How do I switch to it without having the need to parse the entire S3 to create the initial checkpoint?

radothede · ‎11-25-2024

hi @hk-modi

As I understand correctly, You have an existing delta table with tons of data already processed. You want to switch to autoloader, read files, parse them and process data incrementally to that delta table as a sink. The task is to start processing only newly arrived files without the need to reprocess all the historical data.

If so, I think there are some options You can leverage, they are mentioned autoloader docs .

Those look promising considering Your scenario:

cloudFiles.includeExistingFiles

Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect.

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

Best,

Radek