cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Flatten a complex JSON file and load into a delta table

BeginnerBob
New Contributor III

Hi,

I am loading a JSON file into Databricks by simply doing the following:

from pyspark.sql.functions import *

from pyspark.sql.types import *

bronze_path="wasbs://....../140477.json"

df_incremental = spark.read.option("multiline","true").json(bronze_path)

display(df_incremental)

My JSON file is complicated and is displayed:

image 

I want to be able to load this data into a delta table.

My schema is:

type AutoGenerated struct {
	Audit struct {
		Refno         string `json:"refno"`
		Formid        string `json:"formid"`
		AuditName     string `json:"audit_name"`
		AuditorName   string `json:"auditor_name"`
		Location      string `json:"location"`
		Fulllocation  string `json:"fulllocation"`
		Published     string `json:"published"`
		Date          string `json:"date"`
		Compliant     string `json:"compliant"`
		Archived      string `json:"archived"`
		Score         string `json:"score"`
		PossibleScore string `json:"possible_score"`
		Percentage    string `json:"percentage"`
		Answers       []struct {
			QuestionNumber   string        `json:"question_number"`
			Question         string        `json:"question"`
			Status           string        `json:"status"`
			Answerid         string        `json:"answerid"`
			Questionid       string        `json:"questionid"`
			Answer           string        `json:"answer"`
			Ansoptid         string        `json:"ansoptid,omitempty"`
			Observation      string        `json:"observation"`
			Compliant        string        `json:"compliant"`
			Score            string        `json:"score"`
			PossibleScore    string        `json:"possible_score"`
			DateResolved     string        `json:"date_resolved"`
			ResolvedByUser   string        `json:"resolved_by_user"`
			DateCompliant    string        `json:"date_compliant"`
			Compliancy       []interface{} `json:"compliancy"`
			HookName         string        `json:"hookName"`
			DateAuthorised   string        `json:"date_authorised,omitempty"`
			AuthorisedByUser string        `json:"authorised_by_user,omitempty"`
		} `json:"answers"`
	} `json:"audit"`
}

Any idea how to do this?

4 REPLIES 4

-werners-
Esteemed Contributor III

delta can handle nested columns so you can just write it to delta lake.

Have you tried it already? It will probably just work.

If you talke about new incoming data that you want to merge in an existing table, that will be a tad more complex. You need to define a merge key which will decide what operation you will do (insert, delete update).

How would I write this to a delta table to see if it works?

-werners-
Esteemed Contributor III

ok so you already have a df with your json data (with inferred schema).

Next step is:

df.write \

.format("delta") \

.mode("overwrite") \

.save("<whereveryouwanttostoreyourdata>")

Vidula
Honored Contributor

Hi @Lloyd Vickery​ 

Does @Werner Stinckens​  response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.