bianca_unifeye
Databricks MVP

You can read a JSON file into relational columns in Databricks by explicitly binding a schema and choosing the right parsing mode.

PySpark (recommended when you already have a schema JSON)

import json
from pyspark.sql.types import StructType

schema_json = json.loads(dbutils.fs.head(
    "file:/Workspace/path/receipt.schema.json", 1000000))
schema = StructType.fromJson(schema_json)

df = (spark.read
      .schema(schema)
      .option("multiLine", "true")  # if JSON is multi-line
      .option("mode", "PERMISSIVE")
      .option("columnNameOfCorruptRecord", "_corrupt_record")
      .json("file:/Workspace/path/receipt.json"))

display(df)

Error handling

  • PERMISSIVE → keeps rows, bad records go to _corrupt_record

  • DROPMALFORMED → drops bad rows

  • FAILFAST → fails on first error

Databricks SQL

SELECT
  from_json(content,
    schema_of_json(
      (SELECT content
       FROM read_files('file:/Workspace/path/receipt.schema.json', format => 'text')
      )
  ) AS parsed
FROM read_files('file:/Workspace/path/receipt.json', format => 'text');

Then select fields from parsed to get relational columns.

Notes

  • Use file:/Workspace/... paths for workspace files (best for small dev/test data).

  • For production, use DBFS Volumes or cloud storage.

Docs