-werners-
Esteemed Contributor III

So the databricks docs state the following:

You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.

What this means is that you will not have parallelism while reading the json.

So you have a few options:

  1. do not use multiline. This is only possible if your json file contains one json object per line. You can try to see if it works
  2. use a larger cluster. The driver will read the json file so the driver needs enough memory. The number of cores is less important.
  3. if you can: split up the file

View solution in original post