cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading JSON file to columns as relational ?

RIDBX
Contributor

Reading JSON file to columns as relational ?

======================================

 

 

Thanks for reviewing my threads. I like to explore Reading JSON file to columns as relational  within Databricks.

I have input file at workspace path > path\receipt.json

I have schema at workspace path > path\receipt.schema.json

How can I read the input json file binding with schema and output the data in relational columns?

I like to have both option with SQL , pyspark.

What to do with errors while parsing ?

Are there any doc/whitepapers on this subject?


Thanks for your insights.

5 REPLIES 5

bianca_unifeye
Databricks MVP

You can read a JSON file into relational columns in Databricks by explicitly binding a schema and choosing the right parsing mode.

PySpark (recommended when you already have a schema JSON)

import json
from pyspark.sql.types import StructType

schema_json = json.loads(dbutils.fs.head(
    "file:/Workspace/path/receipt.schema.json", 1000000))
schema = StructType.fromJson(schema_json)

df = (spark.read
      .schema(schema)
      .option("multiLine", "true")  # if JSON is multi-line
      .option("mode", "PERMISSIVE")
      .option("columnNameOfCorruptRecord", "_corrupt_record")
      .json("file:/Workspace/path/receipt.json"))

display(df)

Error handling

  • PERMISSIVE → keeps rows, bad records go to _corrupt_record

  • DROPMALFORMED → drops bad rows

  • FAILFAST → fails on first error

Databricks SQL

SELECT
  from_json(content,
    schema_of_json(
      (SELECT content
       FROM read_files('file:/Workspace/path/receipt.schema.json', format => 'text')
      )
  ) AS parsed
FROM read_files('file:/Workspace/path/receipt.json', format => 'text');

Then select fields from parsed to get relational columns.

Notes

  • Use file:/Workspace/... paths for workspace files (best for small dev/test data).

  • For production, use DBFS Volumes or cloud storage.

Docs


RIDBX
Contributor

Thanks for weighing in, I tested this piece, I am getting:

"Execution Error: An error occurred while calling o481.head

:org.apache.spark.security.exception : [INSUFFICIENT_PERMISSION] Insufficient previleges  

user does not have permission to select on any file SQLSTATE : 42501"

 

How do I fix this? Is it a policy issue for my credentials?

Thanks for your guidance.

 

emma_s
Databricks Employee
Databricks Employee

Hi, this looks like you don't have select persmision on where you have saved the file? Is it in a UC volume?

RIDBX
Contributor

Hi I did not see any response. 

Are there any alternate way for use of schema = StructType.fromJson(schema_json)  ?

 

Thanks 

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @RIDBX,

Thanks for the thorough description. Flattening JSON into relational columns is one of the most common data engineering tasks in Databricks, and there are several powerful approaches depending on your JSON structure. Let me walk you through them.


STEP 1: READ THE JSON FILE

The first step is reading your JSON file into a DataFrame. Spark can infer the schema automatically.

For single-line JSON (one JSON object per line):

df = spark.read.json("/path/to/your/file.json")

For multi-line JSON (a single JSON object or array spanning multiple lines, which is the more common format for files exported from APIs):

df = spark.read.option("multiLine", "true").json("/path/to/your/file.json")

You can inspect the inferred schema with:

df.printSchema()
df.display()

This is the most important first step because it tells you exactly what nested structures you are dealing with (structs, arrays, arrays of structs, etc.).


STEP 2: ACCESS NESTED STRUCT FIELDS (DOT NOTATION)

If your JSON has nested objects like:

{"name": "Alice", "address": {"city": "Seattle", "state": "WA"}}

Spark reads "address" as a StructType. You can flatten it by selecting individual fields using dot notation:

from pyspark.sql.functions import col

df_flat = df.select(
col("name"),
col("address.city").alias("city"),
col("address.state").alias("state")
)

You can also use the star (*) operator to expand ALL fields in a struct:

df_flat = df.select("name", "address.*")

This automatically creates one column per field inside the "address" struct.


STEP 3: EXPLODE ARRAYS INTO ROWS

If your JSON has arrays like:

{"customer": "Alice", "orders": [{"id": 1, "amount": 50}, {"id": 2, "amount": 75}]}

You need to use explode() to turn each array element into its own row:

from pyspark.sql.functions import explode

df_exploded = df.select(
col("customer"),
explode(col("orders")).alias("order")
)

This gives you one row per order. Since each element is a struct, combine with dot notation:

df_final = df_exploded.select(
col("customer"),
col("order.id").alias("order_id"),
col("order.amount").alias("order_amount")
)

Or do it in one step using inline(), which explodes an array of structs directly into columns:

from pyspark.sql.functions import inline

df_final = df.select(
col("customer"),
inline(col("orders"))
)

Note: inline() is available in Databricks Runtime 12.2 LTS and above as a table-valued generator.


STEP 4: HANDLE DEEPLY NESTED JSON

For deeply nested JSON, you may need to chain these operations. For example:

{
"company": "Acme",
"departments": [
{
"name": "Engineering",
"employees": [
{"name": "Alice", "skills": ["Python", "Spark"]},
{"name": "Bob", "skills": ["Java", "Scala"]}
]
}
]
}

You would flatten this step by step:

from pyspark.sql.functions import col, explode

# Step 1: Explode departments array
df1 = df.select(
col("company"),
explode(col("departments")).alias("dept")
)

# Step 2: Explode employees array within each department
df2 = df1.select(
col("company"),
col("dept.name").alias("department"),
explode(col("dept.employees")).alias("emp")
)

# Step 3: Explode skills array and extract employee name
df3 = df2.select(
col("company"),
col("department"),
col("emp.name").alias("employee_name"),
explode(col("emp.skills")).alias("skill")
)


SQL APPROACH: THE COLON (:) SYNTAX

If you prefer SQL, Databricks has a powerful colon operator for accessing nested JSON fields. First, register your data as a view:

df.createOrReplaceTempView("raw_data")

Then query nested fields directly:

SELECT
raw:customer::string AS customer,
raw:orders[0].id::int AS first_order_id,
raw:orders[0].amount::double AS first_order_amount
FROM raw_data

For exploding arrays in SQL:

SELECT
raw:customer::string AS customer,
order_data.id::int AS order_id,
order_data.amount::double AS order_amount
FROM raw_data,
LATERAL VIEW explode(from_json(
raw:orders,
'ARRAY<STRUCT<id: INT, amount: DOUBLE>>'
)) AS order_data

Or if your table already has a properly inferred schema (not stored as a raw string column):

SELECT
customer,
o.id AS order_id,
o.amount AS order_amount
FROM my_table
LATERAL VIEW explode(orders) AS o


BONUS: USING from_json() FOR STRING COLUMNS

If your JSON is stored as a plain string column (for example, from a Kafka topic or a text file), use from_json() with a schema to parse it:

from pyspark.sql.functions import from_json, schema_of_json

# Let Spark infer the schema from a sample
json_sample = '{"name": "Alice", "address": {"city": "Seattle"}}'
schema = schema_of_json(json_sample)

df_parsed = df.select(
from_json(col("json_string_column"), schema).alias("parsed")
).select("parsed.*")


QUICK REFERENCE: WHICH FUNCTION TO USE

- Nested objects (structs) --> col("parent.child") or select("parent.*")
- Arrays --> explode() to get one row per element
- Arrays of structs --> inline() to get rows AND columns in one step
- Nested arrays --> flatten() first, then explode()
- JSON string column --> from_json() to parse, then use the above
- SQL queries --> use the : colon operator for extraction


DOCUMENTATION LINKS

- Reading JSON files: https://docs.databricks.com/en/query/formats/json.html
- Querying semi-structured data (colon syntax): https://docs.databricks.com/en/sql/language-manual/sql-ref-json-path-expression.html
- explode() function: https://docs.databricks.com/en/sql/language-manual/functions/explode.html
- inline() function: https://docs.databricks.com/en/sql/language-manual/functions/inline.html
- from_json() function: https://docs.databricks.com/en/sql/language-manual/functions/from_json.html
- schema_of_json() function: https://docs.databricks.com/en/sql/language-manual/functions/schema_of_json.html
- flatten() function: https://docs.databricks.com/en/sql/language-manual/functions/flatten.html

If you can share a sample of your JSON structure, I can give you a more specific solution tailored to your data. Hope this helps!

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.