<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Failed to merge incompatible data types in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/failed-to-merge-incompatible-data-types/m-p/20526#M13862</link>
    <description>&lt;P&gt;@Hare Krishnan​&amp;nbsp;the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.&lt;/P&gt;&lt;P&gt;Sample code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.option("mergeSchema", "true").json(&amp;lt;file paths&amp;gt;, multiLine=True)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The only scenario this will not be able to handle if the type inside your nested column is not same.&lt;/P&gt;&lt;P&gt;Sample file 1:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Sample file 2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To handle this scenario you will have to write some custom function.&lt;/P&gt;</description>
    <pubDate>Tue, 29 Nov 2022 10:48:55 GMT</pubDate>
    <dc:creator>Shalabh007</dc:creator>
    <dc:date>2022-11-29T10:48:55Z</dc:date>
    <item>
      <title>Failed to merge incompatible data types</title>
      <link>https://community.databricks.com/t5/data-engineering/failed-to-merge-incompatible-data-types/m-p/20525#M13861</link>
      <description>&lt;P&gt;We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.&lt;/P&gt;&lt;P&gt;source_location_path:  "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;To read:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")&lt;/P&gt;&lt;P&gt;&lt;B&gt;To write:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Day 1 file content:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
		"id": "1",
		"name": "test",
		"check": [
			{
				"id": "1",
				"value": "con1",
				"comment": "test"
			},
			{
				"id": "2",
				"value": "con2",
				"comment": "test"
			}
		],
		"detail": [],
		"namedetail": {
			"1abc2efg": "Test Name"
			}
	}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Day 2 file content:&lt;/B&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {}
	}
&amp;nbsp;
&amp;nbsp;
or
&amp;nbsp;
&amp;nbsp;
{
		"id": "2",
		"name": "test2",
		"check": [],
		"detail": [],
		"namedetail": {"94jut2jfn": "Tester"}
	}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Issue 1: &lt;/B&gt;when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex &amp;amp; primitive columns.&lt;/P&gt;&lt;P&gt;Day 2 file has empty values since data type got changed.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For Ex:&lt;/P&gt;&lt;P&gt; Day 1: column "check" is "Array of struct"  - It has nested structure &lt;/P&gt;&lt;P&gt; Day 2: column "check is "Array" - since the column doesn't have any records&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Issue 2:&lt;/B&gt;   Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")&lt;/P&gt;&lt;P&gt;Day 2: The same column  "namedetail" has another value  ("94jut2jfn": "Tester")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, if the  "namedetail" has empty records then we are facing an issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you please help me to handle data types mismatch and dynamic column changes?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;HK&lt;/P&gt;</description>
      <pubDate>Mon, 16 May 2022 06:13:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/failed-to-merge-incompatible-data-types/m-p/20525#M13861</guid>
      <dc:creator>hare</dc:creator>
      <dc:date>2022-05-16T06:13:34Z</dc:date>
    </item>
    <item>
      <title>Re: Failed to merge incompatible data types</title>
      <link>https://community.databricks.com/t5/data-engineering/failed-to-merge-incompatible-data-types/m-p/20526#M13862</link>
      <description>&lt;P&gt;@Hare Krishnan​&amp;nbsp;the issues highlighted can easily be handled using the .option("mergeSchema", "true") at the time of reading all the files.&lt;/P&gt;&lt;P&gt;Sample code:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.option("mergeSchema", "true").json(&amp;lt;file paths&amp;gt;, multiLine=True)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The only scenario this will not be able to handle if the type inside your nested column is not same.&lt;/P&gt;&lt;P&gt;Sample file 1:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
	"name": "test",
	"check": [
		{
			"id": "1",
		},
	]
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Sample file 2:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;{
	"name": "test",
	"check": [
		{
			"id": 1,
		},
	]
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;for above 2 files mergeSchema option will fail because "id" column inside check has 2 different types of values: string (in file 1) and int (in file 2).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To handle this scenario you will have to write some custom function.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 10:48:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/failed-to-merge-incompatible-data-types/m-p/20526#M13862</guid>
      <dc:creator>Shalabh007</dc:creator>
      <dc:date>2022-11-29T10:48:55Z</dc:date>
    </item>
  </channel>
</rss>

