Databricks Community

159312 · ‎06-30-2022

I'm new to spark and Databricks and I'm trying to write a pipeline to take CDC data from a postgres database stored in s3 and ingest it. The file names are numerically ascending unique ids based on datatime (ie20220630-215325970.csv). Right now autoloader seems to fetch all files at the source in random order. This means that updates to rows in DB may not happen in the correct order.

I have attached a screenshot with an example. Update, 1, 2, and 3 were entered sequentially after all other displayed records but they appear in the df in that order.

I've tried using latestFirst to see if I can get the files processed in a predictable order but that option doesn't seem to have any effect.

Is there a way to load and write files in order by filename using autoloader?

Thanks,

Ben

Noopur_Nigam · ‎07-25-2022

Hi @Ben Bogart For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

View solution in original post

Noopur_Nigam · ‎07-25-2022

Hi @Ben Bogart For lexicographically generated files, Auto Loader can leverage the lexical file ordering and optimized listing APIs. For more info on lexical ordering please go through the below link: https://docs.databricks.com/ingestion/auto-loader/file-detection-modes.html#lexical-ordering-of-file...

Since spark is distributed system, apart from the above, any other ordering is not guaranteed.

Databricks Community

How to get autoloader to load files in order

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.