Databricks

hare · ‎05-15-2022

We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.

source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"

To read:

df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")

To write:

df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)

Day 1 file content:

{
		"id": "1",
		"name": "test",
		"check": [
			{
				"id": "1",
				"value": "con1",
				"comment": "test"
			},
			{
				"id": "2",
				"value": "con2",
				"comment": "test"
			}
		],
		"detail": [],
		"namedetail": {
			"1abc2efg": "Test Name"
			}
	}

Day 2 file content:

{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {}
	}
 
 
or
 
 
{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {"94jut2jfn": "Tester"}
	}

Issue 1: when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex & primitive columns.

Day 2 file has empty values since data type got changed.

For Ex:

Day 1: column "check" is "Array of struct" - It has nested structure

Day 2: column "check is "Array" - since the column doesn't have any records

Issue 2: Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")

Day 2: The same column "namedetail" has another value ("94jut2jfn": "Tester")

So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.

Also, if the "namedetail" has empty records then we are facing an issue.

Can you please help me to handle data types mismatch and dynamic column changes?

Thanks,

HK

Shalabh007 · ‎11-29-2022

@Hare Krishnan the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.

Sample code:

spark.read.option("mergeSchema", "true").json(<file paths>, multiLine=True)

The only scenario this will not be able to handle if the type inside your nested column is not same.

Sample file 1:

{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}

Sample file 2:

{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}

for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).

To handle this scenario you will have to write some custom function.

Databricks

Failed to merge incompatible data types

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs