This is a generic problem.
Cheap solution is to increase number of shuffle partitions (in case loads are skewed) or restart the cluster.
Safe solution is to increase cluster size or node sizes (SSD, RAM,…)
Eventually, you have to make sure that you ...
My guess is that the reason this may not work is the fact that the dictionary input does not have unique keys. With this syntax, column-names are keys and if you have two or more aggregation for the same column, some internal loops may forget the no...
Your schema is tight, but make sure that the conversion to it does not throw an exception.
Try with Memory Optimized Nodes, you may be fine.
My problem was parsing a lot of data from sequence files containing 10K xml files and saving them as a table...
In a similar problem following fixed the problem:
- Using Memory Optimised Nodes (Compute Optimised had problems)
- Tighter definition of schema (specially for nested clusters in pyspark, where order may matter)
- Using S3a mount instead of S3n moun...