cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Failed to merge incompatible data types

hare
New Contributor III

We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.

source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"

To read:

df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")

To write:

df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)

Day 1 file content:

{
		"id": "1",
		"name": "test",
		"check": [
			{
				"id": "1",
				"value": "con1",
				"comment": "test"
			},
			{
				"id": "2",
				"value": "con2",
				"comment": "test"
			}
		],
		"detail": [],
		"namedetail": {
			"1abc2efg": "Test Name"
			}
	}

Day 2 file content:

{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {}
	}
 
 
or
 
 
{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {"94jut2jfn": "Tester"}
	}

Issue 1: when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex & primitive columns.

Day 2 file has empty values since data type got changed.

For Ex:

Day 1: column "check" is "Array of struct" - It has nested structure

Day 2: column "check is "Array" - since the column doesn't have any records

Issue 2: Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")

Day 2: The same column "namedetail" has another value ("94jut2jfn": "Tester")

So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.

Also, if the "namedetail" has empty records then we are facing an issue.

Can you please help me to handle data types mismatch and dynamic column changes?

Thanks,

HK

1 REPLY 1

Shalabh007
Honored Contributor

@Hare Krishnan​ the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.

Sample code:

spark.read.option("mergeSchema", "true").json(<file paths>, multiLine=True)

The only scenario this will not be able to handle if the type inside your nested column is not same.

Sample file 1:

{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}

Sample file 2:

{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}

for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).

To handle this scenario you will have to write some custom function.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group