Databricks Community

Kumarashokjmu · ‎12-05-2023

I have a need to ingest millions of csv files from aws s3 bucket. I am facing issue with aws s3 throttling issue and besides notebook process is running for 8 hours plus and sometimes failing. When looking at cluster performance, it is utilized 60%.

I need suggestions on avoiding throttling by aws and what should be source filesize if I have to combine small files to bigger for processing, speeding up ingestion and any other spark parameter needs tuning.

Thanks in advance.

Ash

Kumarashokjmu · ‎12-06-2023

Thank you so much Kaniz, I Really appreciate your response with detail reply on each topic. I will post more with time to get help from you.

Ashok

jose_gonzalez · ‎12-07-2023

Hi @Kumarashokjmu,

I would recommend to use Databricks auto loader to ingest your CSV files incrementally. You can find examples and more details here https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader

jose_gonzalez · ‎12-12-2023

Hi @Kumarashokjmu,

Just a friendly follow-up. Did you have time to test auto loader? do you have any follow-up questions? Please let us know

kulkpd · ‎12-12-2023

If you want to load all the data at once use autoloader or DLT pipeline with directory listing if files are lexically ordered.

OR
If you want to perform incremental load, divide the load into two job like historic data load vs live data load:
Live data:
Use autoloader or delta live pipeline using fileNotification to load the data into Delta table. File Notification is scalable and recommended solution from Databricks.
https://docs.databricks.com/en/ingestion/auto-loader/options.html#directory-listing-options

Historic Load:
Use autoloader job to load all the data. if files are not lexically ordered then try using s3 inventory option to divide the workload into micro-batches. Using this approach multiple batches can be executed in parallel.

Handle S3 throttling issues:
if you can facing issue with s3 throttling. Try limit maxFilesPerTrigger to 10k-15k.
Increase spark.network.timeout configuration in spark init block.

Let us know if you need more information

Databricks Community

need to ingest millions of csv files from aws s3

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐