hari
Contributor

I am not that clear on why concurrency will be affected by the fs, Seems strange, since we can also have the same amount of writes to fs with less than 1k concurrent jobs(By simply increasing the number of workers nodes or cores). So if the concurrency limit is due to fs limitation it should be different based on worker node configuration.

I understand that spark is meant to work with a large amount of data split across workers. Sorry, I might not have been clear on our use case. We actually have use cases where the task to be performed can vary with each customer.

Our pipeline will format customer data into a single unified format. After this stage, we can process the entire data with a single job. But to get to this stage we need to process raw data from each customer differently.