Databricks Community

ask005 · ‎07-13-2025

In pySpark mongo connector while updating records how to handle _id as objectId.

spark 3.2.4
scala2.13
sparkMongoConnector 2.12-10.2.2

mark_ott · 3 weeks ago

To write an ObjectId value using Spark Mongo Connector 10.2.2 in PySpark while updating records, you must convert the ObjectId string into a special format. The Spark Mongo Connector does not automatically recognize a string as an ObjectId; it will otherwise store it as a string in MongoDB rather than the expected BSON ObjectId type.

Required Technique

Format the ObjectId value using a JSON structure: {"$oid": "<hex string here>"}
When creating or updating your DataFrame, convert the _id field (or any ObjectId field) into this format.
Set the Spark option: .config("spark.mongodb.write.convertJson", "object_Or_Array_Only")
- This enables the connector to convert the JSON structure to a BSON ObjectId when writing to MongoDB.

PySpark Example

python

from pyspark.sql.functions import col, struct, lit

# Example with existing DataFrame `df` with '_id' field as string
df = df.withColumn("_id", struct(lit("$oid").alias("oid"), col("_id")))

# Writing with configuration for ObjectId conversion
df.write \
  .format("mongodb") \
  .option("uri", "mongodb://host:port/database.collection") \
  .option("spark.mongodb.write.convertJson", "object_Or_Array_Only") \
  .mode("append") \
  .save()

Ensure that each ObjectId column is structured as {"$oid": "hexid"} before writing.

MongoDB Connector Version, Spark, and Scala

Compatible with Spark 3.2.4, Scala 2.13, and mongo-spark-connector_2.12-10.2.2.

Key Details

Direct string ObjectIds will be written as strings; you must use the JSON struct format above for true ObjectId behavior in MongoDB.
This method is necessary for both insertion and update operations.

This approach should allow you to maintain the ObjectId type as required by MongoDB in your updates and inserts usingg PySpark.