Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-09-2025 02:49 AM
Hi @bunny1174 ,
You have 4-5 millions of files in s3 and their size is 1.5gb - this clearly indicates small files problem. You need compact those files to bigger size. There's no way your pipeline will be performant if you have such many files and theirs size is around 1-2kb.
You can read about this problem in general at following articles:
Breaking the Big Data Bottleneck: Solving Spark’s “Small Files” Problem
Tackling the Small Files Problem in Apache Spark | by Henkel Data & Analytics | Henkel Data & Analyt...
Spark Small Files Problem: Optimizing Data Processing