We are processing the josn file from the storage location on every day and it will get archived once the records are appended into the respective tables.
source_location_path: "..../mon=05/day=01/fld1" , "..../mon=05/day=01/fld2" ..... "..../mon=05/day=01/fldN"
To read:
df = spark.read.json(source_location_path,multiLine=True, pathGlobFilter=f"*_{file_name}")
To write:
df.write.partitionBy("month").option("mergeSchema","true").mode("append").saveAsTable(table_name)
Day 1 file content:
{
"id": "1",
"name": "test",
"check": [
{
"id": "1",
"value": "con1",
"comment": "test"
},
{
"id": "2",
"value": "con2",
"comment": "test"
}
],
"detail": [],
"namedetail": {
"1abc2efg": "Test Name"
}
}
Day 2 file content:
{
"id": "2",
"name": "test2",
"check": [],
"detail": [],
"namedetail": {}
}
or
{
"id": "2",
"name": "test2",
"check": [],
"detail": [],
"namedetail": {"94jut2jfn": "Tester"}
}
Issue 1: when we try to append the value into existing table we are getting "Failed to merge incompatible data types". Because day1 file has all the values for the corresponding complex & primitive columns.
Day 2 file has empty values since data type got changed.
For Ex:
Day 1: column "check" is "Array of struct" - It has nested structure
Day 2: column "check is "Array" - since the column doesn't have any records
Issue 2: Day 1 - Column "namedetail" is Struct type and it has the value ("10abc2efg": "Test Name")
Day 2: The same column "namedetail" has another value ("94jut2jfn": "Tester")
So its considering the day 2 "namedetail" column value as a new column. so we couldn't able to merge it.
Also, if the "namedetail" has empty records then we are facing an issue.
Can you please help me to handle data types mismatch and dynamic column changes?
Thanks,
HK