02-25-2022 09:52 PM
Azure DBR - Have to load list of json files into data frame and then from DF to data bricks table but the column has special character and getting below error.
Both column(key) and value (as json record) has special characters in the json file.
# Cannot create a table having a nested column whose name contains invalid characters (',', ':', ';') in Hive metastore. Table: `db`.`tablename`; Column: hydra:xxxx
Also needs to convert the struct type into string.
Ex:
{
"@context": "/contexts/xxxxx",
"@id": "/xxxxx",
"@type": "hydra:Collection",
"hydra:xxxx": [
{
"@id": "/xxxxx/1",
"@type": "product",
"type": "update",
"id": null,
"pid": 2,
"pCode": null,
"details": "abcdefgh",
"updt": "2019-01-09T03:04:27+01:00",
"codes": [
"123456789101"
]
},
{
"@id": "/xxxxx/2",
"@type": "product",
"type": "update",
"id": null,
"pid": 2,
"pCode": null,
"details": "abcdefgh",
"updt": "2019-01-09T03:04:27+01:00",
"codes": [
"123456789101"
]
}
],
"hydra:totalItems": 2,
"hydra:view": {
"@id": "/xxxxx?page=1",
"@type": "hydra:vw",
"hydra:first": "/xxxxx?page=1",
"hydra:last": "/xxxxx?page=12",
"hydra:next": "/xxxxx?page=245"
},
"hydra:search": {
"@type": "hydra:IriTemplate",
"hydra:template": "xxxxxxxxxxxxxxxxx",
"hydra:variableRepresentation": "BasicRepresentation",
"hydra:mapping": [
{
"hydra:value": "zzzzz"
"@type": "xx",
"variable": "yy[before]",
"property": "yy",
"required": false
},
{
"hydra:value": "ssssss"
"@type": "sf",
"variable": "asf",
"property": "sfs",
"required": false
}
]
}
}
Please provide your suggestion to resolve it. I am using pyspark in azure databricks.
02-25-2022 11:43 PM
You can rename the columns like here on SO (pyspark) or here in scala.
You can also define your schema manually.
To flatten the json struct, there is for example the explode function.
02-28-2022 08:37 AM
Thanks for your solution werners. Yes,we can able to rename the column as you suggested. But the problem here is the value of the column is also a json (Array of struct).
For an instance;
"hydra:mapping" - is a column which has the value called "hydra:value". Here, we have to rename the "hydra:value" as "hydra_value".
"hydra:mapping": [
{
"hydra:value": "zzzzz"
"@type": "xx",
"variable": "yy[before]",
"property": "yy",
"required": false
},
02-26-2022 10:05 AM
The best is just define schema manually. There is nice article from person who had exactly the same problem https://towardsdev.com/create-a-spark-hive-meta-store-table-using-nested-json-with-invalid-field-nam...
02-28-2022 08:39 AM
Thanks for the solution @Hubert Dudek . In our scenario we couldn't able to define the schema manually since schema may vary file to file. So i am thinking how to do it dynamically based on the Json schema. Please do let me know if you have any taught on this.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group