Hi @chris84,
You already identified the root cause: the JSON file was pretty-printed across multiple lines. By default, Spark's JSON reader expects one JSON record per line (sometimes called "JSON Lines" or NDJSON format). When it encounters a pretty-printed file where a single JSON object spans multiple lines, it tries to parse each line independently, which causes a schema/parsing error.
Rather than reformatting your file to a single line, you can tell Spark to treat the entire file as one JSON record by using the multiline option.
PYTHON (PYSPARK)
df = spark.read.option("multiline", "true").json("/Volumes/workspace/default/test_volume/user_0.json")
df.show()
SQL (USING read_files)
SELECT * FROM read_files(
'/Volumes/workspace/default/test_volume/user_0.json',
format => 'json',
multiLine => true
)
SQL (USING A TEMPORARY VIEW)
CREATE TEMPORARY VIEW user_data
USING json
OPTIONS (
path '/Volumes/workspace/default/test_volume/user_0.json',
multiline 'true'
);
SELECT * FROM user_data;
WHY THIS HAPPENS
Spark's default behavior (multiline = false) assumes each line in the file is a complete, self-contained JSON record. This is optimized for parallel reads of large files. When a single JSON object is formatted with line breaks and indentation (pretty-printed), each line is not valid JSON on its own, so parsing fails.
Setting multiline to true tells Spark to read the entire file as one entity and parse it as a whole, which handles pretty-printed JSON correctly.
DOCUMENTATION REFERENCES
- JSON file format documentation: https://docs.databricks.com/aws/en/query/formats/json
- read_files SQL function: https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.