cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Flatten a complex JSON file and load into a delta table

BeginnerBob
New Contributor III

Hi,

I am loading a JSON file into Databricks by simply doing the following:

from pyspark.sql.functions import *

from pyspark.sql.types import *

bronze_path="wasbs://....../140477.json"

df_incremental = spark.read.option("multiline","true").json(bronze_path)

display(df_incremental)

My JSON file is complicated and is displayed:

image 

I want to be able to load this data into a delta table.

My schema is:

type AutoGenerated struct {
	Audit struct {
		Refno         string `json:"refno"`
		Formid        string `json:"formid"`
		AuditName     string `json:"audit_name"`
		AuditorName   string `json:"auditor_name"`
		Location      string `json:"location"`
		Fulllocation  string `json:"fulllocation"`
		Published     string `json:"published"`
		Date          string `json:"date"`
		Compliant     string `json:"compliant"`
		Archived      string `json:"archived"`
		Score         string `json:"score"`
		PossibleScore string `json:"possible_score"`
		Percentage    string `json:"percentage"`
		Answers       []struct {
			QuestionNumber   string        `json:"question_number"`
			Question         string        `json:"question"`
			Status           string        `json:"status"`
			Answerid         string        `json:"answerid"`
			Questionid       string        `json:"questionid"`
			Answer           string        `json:"answer"`
			Ansoptid         string        `json:"ansoptid,omitempty"`
			Observation      string        `json:"observation"`
			Compliant        string        `json:"compliant"`
			Score            string        `json:"score"`
			PossibleScore    string        `json:"possible_score"`
			DateResolved     string        `json:"date_resolved"`
			ResolvedByUser   string        `json:"resolved_by_user"`
			DateCompliant    string        `json:"date_compliant"`
			Compliancy       []interface{} `json:"compliancy"`
			HookName         string        `json:"hookName"`
			DateAuthorised   string        `json:"date_authorised,omitempty"`
			AuthorisedByUser string        `json:"authorised_by_user,omitempty"`
		} `json:"answers"`
	} `json:"audit"`
}

Any idea how to do this?

4 REPLIES 4

-werners-
Esteemed Contributor III

delta can handle nested columns so you can just write it to delta lake.

Have you tried it already? It will probably just work.

If you talke about new incoming data that you want to merge in an existing table, that will be a tad more complex. You need to define a merge key which will decide what operation you will do (insert, delete update).

How would I write this to a delta table to see if it works?

-werners-
Esteemed Contributor III

ok so you already have a df with your json data (with inferred schema).

Next step is:

df.write \

.format("delta") \

.mode("overwrite") \

.save("<whereveryouwanttostoreyourdata>")

Vidula
Honored Contributor

Hi @Lloyd Vickery​ 

Does @Werner Stinckens​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group