mark_ott
Databricks Employee
Databricks Employee

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will otherwise store it as a string in MongoDB rather than the expected BSON ObjectId type.

Required Technique

  • Format the ObjectId value using a JSON structure: {"$oid": "<hex string here>"}

  • When creating or updating your DataFrame, convert the _id field (or any ObjectId field) into this format.

  • Set the Spark option: .config("spark.mongodb.write.convertJson", "object_Or_Array_Only")

    • This enables the connector to convert the JSON structure to a BSON ObjectId when writing to MongoDB.

PySpark Example

python
from pyspark.sql.functions import col, struct, lit # Example with existing DataFrame `df` with '_id' field as string df = df.withColumn("_id", struct(lit("$oid").alias("oid"), col("_id"))) # Writing with configuration for ObjectId conversion df.write \ .format("mongodb") \ .option("uri", "mongodb://host:port/database.collection") \ .option("spark.mongodb.write.convertJson", "object_Or_Array_Only") \ .mode("append") \ .save()
  • Ensure that each ObjectId column is structured as {"$oid": "hexid"} before writing.​

MongoDB Connector Version, Spark, and Scala

  • Compatible with Spark 3.2.4, Scala 2.13, and mongo-spark-connector_2.12-10.2.2.​

Key Details

  • Direct string ObjectIds will be written as strings; you must use the JSON struct format above for true ObjectId behavior in MongoDB.

  • This method is necessary for both insertion and update operations.​

This approach should allow you to maintain the ObjectId type as required by MongoDB in your updates and inserts usingg PySpark.