Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-03-2022 10:28 AM
So the databricks docs state the following:
You can read JSON files in single-line or multi-line mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.
What this means is that you will not have parallelism while reading the json.
So you have a few options:
- do not use multiline. This is only possible if your json file contains one json object per line. You can try to see if it works
- use a larger cluster. The driver will read the json file so the driver needs enough memory. The number of cores is less important.
- if you can: split up the file