cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Switching to autoloader

hk-modi
New Contributor

I have an S3 bucket that has continuous data being written into it. My script reads these files, parses them and then appends into a delta table. 

The data backs to 2022 with millions of files which are stored using partitions based on year/month/dayOfMonth/hourOfDay.

Up until now, I have been using previous day as a filter to read the data and process it. However, now I want to switch to incremental batch streaming using directory listing autoloader. How do I switch to it without having the need to parse the entire S3 to create the initial checkpoint?

1 REPLY 1

radothede
Contributor II

hi @hk-modi 

As I understand correctly, You have an existing delta table with tons of data already processed. You want to switch to autoloader, read files, parse them and process data incrementally to that delta table as a sink. The task is to start processing only newly arrived files without the need to reprocess all the historical data.

If so, I think there are some options You can leverage, they are mentioned autoloader docs .

Those look promising considering Your scenario:

cloudFiles.includeExistingFiles

Whether to include existing files in the stream processing input path or to only process new files arriving after initial setup. This option is evaluated only when you start a stream for the first time. Changing this option after restarting the stream has no effect.

modifiedAfter

Type: Timestamp String, for example, 2021-01-01 00:00:00.000000 UTC+0

An optional timestamp to ingest files that have a modification timestamp after the provided timestamp.

 

Best,

Radek

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group