Databricks Community

159312 · ‎06-30-2022

I'm new to spark and Databricks and I'm trying to write a pipeline to take CDC data from a postgres database stored in s3 and ingest it. The file names are numerically ascending unique ids based on datatime (ie20220630-215325970.csv). Right now autoloader seems to fetch all files at the source in random order. This means that updates to rows in DB may not happen in the correct order.

I have attached a screenshot with an example. Update, 1, 2, and 3 were entered sequentially after all other displayed records but they appear in the df in that order.

I've tried using latestFirst to see if I can get the files processed in a predictable order but that option doesn't seem to have any effect.

Is there a way to load and write files in order by filename using autoloader?

Thanks,

Ben

Noopur_Nigam · ‎07-25-2022

Hi @Ben Bogart For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

View solution in original post

Noopur_Nigam · ‎07-25-2022

Hi @Ben Bogart For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

Databricks Community

How to get autoloader to load files in order

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences