Databricks

hare · ‎02-25-2022

Azure DBR - Have to load list of json files into data frame and then from DF to data bricks table but the column has special character and getting below error.

Both column(key) and value (as json record) has special characters in the json file.

# Cannot create a table having a nested column whose name contains invalid characters (',', ':', ';') in Hive metastore. Table: `db`.`tablename`; Column: hydra:xxxx

Also needs to convert the struct type into string.

Ex:

{

"@context": "/contexts/xxxxx",

"@id": "/xxxxx",

"@type": "hydra:Collection",

"hydra:xxxx": [

{

"@id": "/xxxxx/1",

"@type": "product",

"type": "update",

"id": null,

"pid": 2,

"pCode": null,

"details": "abcdefgh",

"updt": "2019-01-09T03:04:27+01:00",

"codes": [

"123456789101"

]

},

{

"@id": "/xxxxx/2",

"@type": "product",

"type": "update",

"id": null,

"pid": 2,

"pCode": null,

"details": "abcdefgh",

"updt": "2019-01-09T03:04:27+01:00",

"codes": [

"123456789101"

]

}

],

"hydra:totalItems": 2,

"hydra:view": {

"@id": "/xxxxx?page=1",

"@type": "hydra:vw",

"hydra:first": "/xxxxx?page=1",

"hydra:last": "/xxxxx?page=12",

"hydra:next": "/xxxxx?page=245"

},

"hydra:search": {

"@type": "hydra:IriTemplate",

"hydra:template": "xxxxxxxxxxxxxxxxx",

"hydra:variableRepresentation": "BasicRepresentation",

"hydra:mapping": [

{

"hydra:value": "zzzzz"

"@type": "xx",

"variable": "yy[before]",

"property": "yy",

"required": false

},

{

"hydra:value": "ssssss"

"@type": "sf",

"variable": "asf",

"property": "sfs",

"required": false

}

]

}

Please provide your suggestion to resolve it. I am using pyspark in azure databricks.

-werners- · ‎02-25-2022

You can rename the columns like here on SO (pyspark) or here in scala.

You can also define your schema manually.

To flatten the json struct, there is for example the explode function.

Interesting read.

hare · ‎02-28-2022

Thanks for your solution werners. Yes,we can able to rename the column as you suggested. But the problem here is the value of the column is also a json (Array of struct).

For an instance;

"hydra:mapping" - is a column which has the value called "hydra:value". Here, we have to rename the "hydra:value" as "hydra_value".

"hydra:mapping": [

{

"hydra:value": "zzzzz"

"@type": "xx",

"variable": "yy[before]",

"property": "yy",

"required": false

},

Hubert-Dudek · ‎02-26-2022

The best is just define schema manually. There is nice article from person who had exactly the same problem https://towardsdev.com/create-a-spark-hive-meta-store-table-using-nested-json-with-invalid-field-nam...

hare · ‎02-28-2022

Thanks for the solution @Hubert Dudek . In our scenario we couldn't able to define the schema manually since schema may vary file to file. So i am thinking how to do it dynamically based on the Json schema. Please do let me know if you have any taught on this.

Databricks

Azure DBR - Have to load list of json files but the column has special character.(ex: {"hydra:xxxx": {"hydra:value":"yyyy", "hydra:value1":"zzzzz"}

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI