Re: How to efficiently process a 50Gb JSON file an...

Hubert-Dudek · ‎04-05-2022

@Radu Gheorghiu , Just save df = spark.read.option('multiline', 'true').json('dbfs:/mnt/data/delta/bronze/BBR_Actual_Totals.json').format("delta").save("/mnt/.../delta/" ) immediattly to delta without any processing (so create bronze delta layer).

All processing is done in the next step/notebook.

I think you will need to have a lot of small partitions. Remember that exploding makes partition grow two times at least.
In Spark, UI looks for data skews and spills also. So skews shouldn't be the problem here but better check.
When you save to bronze delta, you can salt some additional columns with values from 1 to 512. So you will have 512delta partitions (folders/files) with the same part of your JSON (similar size). I would go with a number multiplied by the number of executors cores so, for example, 32 cores so 512 partitions/folders(files).
The driver could be a bit bigger than the executors
when you load that bronze delta to silver, also control how many spark partitions you have (it different partitions, but number 512 could be good as well, the partition should be around 100 MB).

I love to process big data sets 🙂 so let me know how it goes. There are some other solutions as well so no worry spark will handle it 🙂

My blog: https://databrickster.medium.com/