cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Importing MongoDB with field names containing spaces

Mr__E
Contributor II

I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:

StructField("My Field Name", StringType())

Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

View solution in original post

3 REPLIES 3

Mr__E
Contributor II

Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?

-werners-
Esteemed Contributor III

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

Thanks! I used this pattern of adding underscores to simplify raw dumping.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.