I have a need to ingest millions of csv files from aws s3 bucket. I am facing issue with aws s3 throttling issue and besides notebook process is running for 8 hours plus and sometimes failing. When looking at cluster performance, it is utilized 60%.
I need suggestions on avoiding throttling by aws and what should be source filesize if I have to combine small files to bigger for processing, speeding up ingestion and any other spark parameter needs tuning.
Thanks in advance.
Ash