Databricks

BeginnerBob · ‎07-12-2022

Hi,

I am loading a JSON file into Databricks by simply doing the following:

from pyspark.sql.functions import *

from pyspark.sql.types import *

bronze_path="wasbs://....../140477.json"

df_incremental = spark.read.option("multiline","true").json(bronze_path)

display(df_incremental)

My JSON file is complicated and is displayed:

I want to be able to load this data into a delta table.

My schema is:

type AutoGenerated struct {
	Audit struct {
		Refno         string `json:"refno"`
		Formid        string `json:"formid"`
		AuditName     string `json:"audit_name"`
		AuditorName   string `json:"auditor_name"`
		Location      string `json:"location"`
		Fulllocation  string `json:"fulllocation"`
		Published     string `json:"published"`
		Date          string `json:"date"`
		Compliant     string `json:"compliant"`
		Archived      string `json:"archived"`
		Score         string `json:"score"`
		PossibleScore string `json:"possible_score"`
		Percentage    string `json:"percentage"`
		Answers       []struct {
			QuestionNumber   string        `json:"question_number"`
			Question         string        `json:"question"`
			Status           string        `json:"status"`
			Answerid         string        `json:"answerid"`
			Questionid       string        `json:"questionid"`
			Answer           string        `json:"answer"`
			Ansoptid         string        `json:"ansoptid,omitempty"`
			Observation      string        `json:"observation"`
			Compliant        string        `json:"compliant"`
			Score            string        `json:"score"`
			PossibleScore    string        `json:"possible_score"`
			DateResolved     string        `json:"date_resolved"`
			ResolvedByUser   string        `json:"resolved_by_user"`
			DateCompliant    string        `json:"date_compliant"`
			Compliancy       []interface{} `json:"compliancy"`
			HookName         string        `json:"hookName"`
			DateAuthorised   string        `json:"date_authorised,omitempty"`
			AuthorisedByUser string        `json:"authorised_by_user,omitempty"`
		} `json:"answers"`
	} `json:"audit"`
}

Any idea how to do this?

-werners- · ‎07-13-2022

delta can handle nested columns so you can just write it to delta lake.

Have you tried it already? It will probably just work.

If you talke about new incoming data that you want to merge in an existing table, that will be a tad more complex. You need to define a merge key which will decide what operation you will do (insert, delete update).

BeginnerBob · ‎07-13-2022

How would I write this to a delta table to see if it works?

-werners- · ‎07-13-2022

ok so you already have a df with your json data (with inferred schema).

Next step is:

df.write \

.format("delta") \

.mode("overwrite") \

.save("<whereveryouwanttostoreyourdata>")

Vidula · ‎09-03-2022

Hi @Lloyd Vickery

Does @Werner Stinckens response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!

Databricks

Flatten a complex JSON file and load into a delta table

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI