cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Importing MongoDB with field names containing spaces

Mr__E
Contributor II

I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:

StructField("My Field Name", StringType())

Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

View solution in original post

3 REPLIES 3

Mr__E
Contributor II

Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

Thanks! I used this pattern of adding underscores to simplify raw dumping.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!