Re: Parsing 5 GB json file is running long on clus...

-werners- · ‎03-03-2022

So the databricks docs state the following:

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.

What this means is that you will not have parallelism while reading the json.

So you have a few options:

do not use multiline. This is only possible if your json file contains one json object per line. You can try to see if it works
use a larger cluster. The driver will read the json file so the driver needs enough memory. The number of cores is less important.
if you can: split up the file

View solution in original post