cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Failed to merge incompatible data types

hare
New Contributor III

We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.

source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"

To read:

df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")

To write:

df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)

Day 1 file content:

{
		"id": "1",
		"name": "test",
		"check": [
			{
				"id": "1",
				"value": "con1",
				"comment": "test"
			},
			{
				"id": "2",
				"value": "con2",
				"comment": "test"
			}
		],
		"detail": [],
		"namedetail": {
			"1abc2efg": "Test Name"
			}
	}

Day 2 file content:

{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {}
	}
 
 
or
 
 
{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {"94jut2jfn": "Tester"}
	}

Issue 1: when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex & primitive columns.

Day 2 file has empty values since data type got changed.

For Ex:

Day 1: column "check" is "Array of struct" - It has nested structure

Day 2: column "check is "Array" - since the column doesn't have any records

Issue 2: Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")

Day 2: The same column "namedetail" has another value ("94jut2jfn": "Tester")

So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.

Also, if the "namedetail" has empty records then we are facing an issue.

Can you please help me to handle data types mismatch and dynamic column changes?

Thanks,

HK

1 REPLY 1

Shalabh007
Honored Contributor

@Hare Krishnan​ the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.

Sample code:

spark.read.option("mergeSchema", "true").json(<file paths>, multiLine=True)

The only scenario this will not be able to handle if the type inside your nested column is not same.

Sample file 1:

{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}

Sample file 2:

{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}

for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).

To handle this scenario you will have to write some custom function.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.