02-15-2022 03:49 PM
I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:
StructField("My Field Name", StringType())
Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?
02-15-2022 11:01 PM
if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.
But this example does not handle nested columns.
You could also try to create a schema without spaces and pass that when you read the data.
This can be done manually or programatically (although this can be a challenge for deeply nested structures).
The second method seems better imo. As the schema method returns a nested list/array/...
Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.
02-15-2022 09:29 PM
Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?
02-15-2022 11:01 PM
if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.
But this example does not handle nested columns.
You could also try to create a schema without spaces and pass that when you read the data.
This can be done manually or programatically (although this can be a challenge for deeply nested structures).
The second method seems better imo. As the schema method returns a nested list/array/...
Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.
04-02-2022 03:38 AM
Thanks! I used this pattern of adding underscores to simplify raw dumping.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group