Databricks Community

Mr__E · ‎02-15-2022

I am currently using a Python notebook with a defined schema to import fairly unstructured documents in MongoDB. Some of these documents have spaces in their field names. I define the schema for the MongoDB PySpark connector like the following:

StructField("My Field Name", StringType())

Unfortunately, this gives me the error "Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema." I would be happy to rename the column, but I have to be able to import it from MongoDB first. Is there a way to do this with the schema? Or am I forced to write a UDF to convert a JSON string with the bad field name into normalized columns?

-werners- · ‎02-15-2022

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

View solution in original post

Mr__E · ‎02-15-2022

Solution: It turns out the issue is not the schema reading in, but the fact that I am writing to Delta tables, which do not currently support spaces. So, I need to transform them prior to dumping. I've been following a pattern of reading in raw data, which has spaces in the fields, then transforming after the fact. Since this is a highly nested structure (MongoDB), using renaming columns individually will be difficult. Any thoughts on the best practice? Should I just start transforming the raw data immediately?

-werners- · ‎02-15-2022

if the structure does not change all the time you could use the renaming of columns in a more automated way like described here.

But this example does not handle nested columns.

You could also try to create a schema without spaces and pass that when you read the data.

This can be done manually or programatically (although this can be a challenge for deeply nested structures).

The second method seems better imo. As the schema method returns a nested list/array/...

Python and Scala have quite some collection parsing possibilities. Also the fact that the StructField type has an attribute called 'name' is useful.

Example

Mr__E · ‎04-02-2022

Thanks! I used this pattern of adding underscores to simplify raw dumping.

Databricks Community

Importing MongoDB with field names containing spaces

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April