topic Re: Importing MongoDB with field names containing spaces in Data Engineering

Importing MongoDB with field names containing spaces

Mr__E — Tue, 15 Feb 2022 23:49:45 GMT

I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:

StructField("My Field Name", StringType())

Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?

Re: Importing MongoDB with field names containing spaces

Mr__E — Wed, 16 Feb 2022 05:29:42 GMT

Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?

Re: Importing MongoDB with field names containing spaces

-werners- — Wed, 16 Feb 2022 07:01:16 GMT

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

Re: Importing MongoDB with field names containing spaces

Mr__E — Sat, 02 Apr 2022 10:38:08 GMT

Thanks! I used this pattern of adding underscores to simplify raw dumping.