- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2022 03:49 PM
I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:
StructField("My Field Name", StringType())
Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?
- Labels:
-
Field Names
-
MongoDB
-
Python notebook
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2022 11:01 PM
if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.
But this example does not handle nested columns.
You could also try to create a schema without spaces and pass that when you read the data.
This can be done manually or programatically (although this can be a challenge for deeply nested structures).
The second method seems better imo. As the schema method returns a nested list/array/...
Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2022 09:29 PM
Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2022 11:01 PM
if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.
But this example does not handle nested columns.
You could also try to create a schema without spaces and pass that when you read the data.
This can be done manually or programatically (although this can be a challenge for deeply nested structures).
The second method seems better imo. As the schema method returns a nested list/array/...
Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-02-2022 03:38 AM
Thanks! I used this pattern of adding underscores to simplify raw dumping.

