cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Failed to merge incompatible data types

hare
New Contributor III

We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.

source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"

To read:

df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")

To write:

df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)

Day 1 file content:

{
		"id": "1",
		"name": "test",
		"check": [
			{
				"id": "1",
				"value": "con1",
				"comment": "test"
			},
			{
				"id": "2",
				"value": "con2",
				"comment": "test"
			}
		],
		"detail": [],
		"namedetail": {
			"1abc2efg": "Test Name"
			}
	}

Day 2 file content:

{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {}
	}
 
 
or
 
 
{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {"94jut2jfn": "Tester"}
	}

Issue 1: when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex & primitive columns.

Day 2 file has empty values since data type got changed.

For Ex:

Day 1: column "check" is "Array of struct" - It has nested structure

Day 2: column "check is "Array" - since the column doesn't have any records

Issue 2: Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")

Day 2: The same column "namedetail" has another value ("94jut2jfn": "Tester")

So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.

Also, if the "namedetail" has empty records then we are facing an issue.

Can you please help me to handle data types mismatch and dynamic column changes?

Thanks,

HK

1 REPLY 1

Shalabh007
Honored Contributor

@Hare Krishnan​ the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.

Sample code:

spark.read.option("mergeSchema", "true").json(<file paths>, multiLine=True)

The only scenario this will not be able to handle if the type inside your nested column is not same.

Sample file 1:

{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}

Sample file 2:

{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}

for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).

To handle this scenario you will have to write some custom function.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!