cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to get autoloader to load files in order

159312
New Contributor III

I'm new to spark and Databricks and I'm trying to write a pipeline to take CDC data from a postgres database stored in s3 and ingest it. The file names are numerically ascending unique ids based on datatime (ie20220630-215325970.csv). Right now autoloader seems to fetch all files at the source in random order. This means that updates to rows in DB may not happen in the correct order.

I have attached a screenshot with an example. Update, 1, 2, and 3 were entered sequentially after all other displayed records but they appear in the df in that order.

I've tried using latestFirst to see if I can get the files processed in a predictable order but that option doesn't seem to have any effect.

Is there a way to load and write files in order by filename using autoloader?

Thanks,

Ben

1 ACCEPTED SOLUTION

Accepted Solutions

Noopur_Nigam
Databricks Employee
Databricks Employee

Hi @Ben Bogart​ For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

View solution in original post

1 REPLY 1

Noopur_Nigam
Databricks Employee
Databricks Employee

Hi @Ben Bogart​ For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group