cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to write ObjectId value using Spark connector 10.2.2

ask005
New Contributor

In pySpark mongo connector while updating records how to handle _id as objectId.

spark 3.2.4
scala2.13
sparkMongoConnector 2.12-10.2.2

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will otherwise store it as a string in MongoDB rather than the expected BSON ObjectId type.

Required Technique

  • Format the ObjectId value using a JSON structure: {"$oid": "<hex string here>"}

  • When creating or updating your DataFrame, convert the _id field (or any ObjectId field) into this format.

  • Set the Spark option: .config("spark.mongodb.write.convertJson", "object_Or_Array_Only")

    • This enables the connector to convert the JSON structure to a BSON ObjectId when writing to MongoDB.

PySpark Example

python
from pyspark.sql.functions import col, struct, lit # Example with existing DataFrame `df` with '_id' field as string df = df.withColumn("_id", struct(lit("$oid").alias("oid"), col("_id"))) # Writing with configuration for ObjectId conversion df.write \ .format("mongodb") \ .option("uri", "mongodb://host:port/database.collection") \ .option("spark.mongodb.write.convertJson", "object_Or_Array_Only") \ .mode("append") \ .save()
  • Ensure that each ObjectId column is structured as {"$oid": "hexid"} before writing.โ€‹

MongoDB Connector Version, Spark, and Scala

  • Compatible with Spark 3.2.4, Scala 2.13, and mongo-spark-connector_2.12-10.2.2.โ€‹

Key Details

  • Direct string ObjectIds will be written as strings; you must use the JSON struct format above for true ObjectId behavior in MongoDB.

  • This method is necessary for both insertion and update operations.โ€‹

This approach should allow you to maintain the ObjectId type as required by MongoDB in your updates and inserts usingg PySpark.