cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Importing MongoDB with field names containing spaces

Mr__E
Contributor II

I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:

StructField("My Field Name", StringType())

Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

View solution in original post

3 REPLIES 3

Mr__E
Contributor II

Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

Thanks! I used this pattern of adding underscores to simplify raw dumping.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group