To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will otherwise store it as a string in MongoDB rather than the expected BSON ObjectId type.
Required Technique
-
Format the ObjectId value using a JSON structure: {"$oid": "<hex string here>"}
-
When creating or updating your DataFrame, convert the _id field (or any ObjectId field) into this format.
-
Set the Spark option: .config("spark.mongodb.write.convertJson", "object_Or_Array_Only")
-
This enables the connector to convert the JSON structure to a BSON ObjectId when writing to MongoDB.
PySpark Example
from pyspark.sql.functions import col, struct, lit
# Example with existing DataFrame `df` with '_id' field as string
df = df.withColumn("_id", struct(lit("$oid").alias("oid"), col("_id")))
# Writing with configuration for ObjectId conversion
df.write \
.format("mongodb") \
.option("uri", "mongodb://host:port/database.collection") \
.option("spark.mongodb.write.convertJson", "object_Or_Array_Only") \
.mode("append") \
.save()
-
Ensure that each ObjectId column is structured as {"$oid": "hexid"} before writing.โ
MongoDB Connector Version, Spark, and Scala
-
Compatible with Spark 3.2.4, Scala 2.13, and mongo-spark-connector_2.12-10.2.2.โ
Key Details
-
Direct string ObjectIds will be written as strings; you must use the JSON struct format above for true ObjectId behavior in MongoDB.
-
This method is necessary for both insertion and update operations.โ
This approach should allow you to maintain the ObjectId type as required by MongoDB in your updates and inserts usingg PySpark.