cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

need to ingest millions of csv files from aws s3

Kumarashokjmu
New Contributor II

I have a need to ingest millions of csv files from aws s3 bucket. I am facing issue with aws s3 throttling issue and besides notebook process is running for 8 hours plus and sometimes failing. When looking at cluster performance, it is utilized 60%.

I need suggestions on avoiding throttling by aws and what should be source filesize if I have to combine small files to bigger for processing, speeding up ingestion and any other spark parameter needs tuning.

Thanks in advance.

Ash

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Kumarashokjmu, Certainly! Let’s address each part of your query:

 

Avoiding AWS S3 Throttling:

  • Throttling in Amazon S3 is not specific to Availability Zones (AZs). It applies to the entire bucket. When you encounter throttling, Amazon S3 may return “503 Slow Down” errors while it scales to handle the request rate.
  • To minimize throttling:
    • Verify Prefixes: Ensure that the number of unique prefixes in your bucket supports your required transactions per second (TPS). Evenly distribute objects and requests across these prefixes.
    • Exponential Backoff: Implement exponential backoff with retries. If your application is sensitive to performance, consider handing off uploads to a background process that can be retried later.
  • Note that throttling is not mitigated by moving workloads between AZs since they share the same endp....

Combining Small Files:

  • Combining small files with larger ones can improve performance. 
  • Here are some approaches:
    • Tar or Zip: Use tar or zip to bundle files together. For example:tar cz * | aws s3 cp - s3://your-bucket/archive.tar
    • Parallel Compression: Combine compression with upload in parallel: tar cf - * | gzip -c | aws s3 cp - s3://your-bucket/archive.tar.gz
    • CPIO: Consider using cpio for faster and smaller files:ls | cpio -o | gzip -c | aws s3 cp - s3://your-bucket/archive.cpio.gz
  • Adjust the approach based on your specific use case and requirements.

Spark Parameter Tuning:

By following these steps, you can improve performance, reduce costs, and enhance the efficiency of your Spark jobs. 🚀

Kumarashokjmu
New Contributor II

Thank you so much Kaniz, I Really appreciate your response with detail reply on each topic. I will post more with time to get help from you.

Ashok

 

jose_gonzalez
Moderator
Moderator

Hi @Kumarashokjmu,

I would recommend to use Databricks auto loader to ingest your CSV files incrementally. You can find examples and more details here https://docs.databricks.com/en/ingestion/auto-loader/index.html#what-is-auto-loader

jose_gonzalez
Moderator
Moderator

Hi @Kumarashokjmu,

Just a friendly follow-up. Did you have time to test auto loader? do you have any follow-up questions? Please let us know

kulkpd
Contributor

If you want to load all the data at once use autoloader or DLT pipeline with directory listing if files are lexically ordered. 

OR
If you want to perform incremental load, divide the load into two job like historic data load vs live data load:
Live data:
Use autoloader or delta live pipeline using fileNotification to load the data into Delta table. File Notification is scalable and recommended solution from Databricks. 
https://docs.databricks.com/en/ingestion/auto-loader/options.html#directory-listing-options

Historic Load: 
Use autoloader job to load all the data. if files are not lexically ordered then try using s3 inventory option to divide the workload into micro-batches. Using this approach multiple batches can be executed in parallel.

Handle S3 throttling issues:
if you can facing issue with s3 throttling. Try limit maxFilesPerTrigger to 10k-15k. 
Increase spark.network.timeout configuration in spark init block.

Let us know if you need more information

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.