Databricks Community

Mathias_Peters · 2 weeks ago

Hi,

I am implementing a Spark Job in Kotlin (unfortunately a must-have) which reads from and writes to MongoDB. The reason for this is to reuse existing code in a MapFunction. The result of applying that map is a DataSet of type Consumer, a custom object from our code base, which is serializable using the kotlinx serializer. I have code available to serialize that Consumer into a BsonDocument.

In my first attempt, I typed the MapFunction to return a BSonDocument and then called:

rm.write().format("mongodb").mode("append").save()

with rm being the dataset of type BSonDocument. However, that stores the data in binary like this:

Binary.createFromBase64('rO0ABXNyAChvcmcuYnNvbi5Cc29uRG9jdW1lbnQkU2VyaWFsaXphdGlvblByb3h5AAAAAAAAAAECAAFbAAVieXRlc3QAAltCeHB1…', 0)

I assume, that the DataSetWriter of MongoDB serializes the BsonDocuments again.

Is this the case?
How can I write the dataset of consumers to MongoDB and have them stored as normal documents?

Thank you

mark_ott · a week ago

You are correct—when you pass a BsonDocument to Spark's MongoDB connector using .write().format("mongodb"), Spark treats unknown types as generic serialized blobs, leading to documents stored as a single binary field (as you observed) rather than as normal embedded documents.

Why Binary Data Appears

Spark's DataFrame/Dataset write to MongoDB expects Row-like (schema-based) objects or case classes (for Scala/Java APIs), not a raw BSON type.
If you give it a Kotlin BsonDocument (or a serialized object/byte array), the connector serializes it again (usually via Java serialization), and the result is a single binary or base64 field in the stored document.

How to Store as Normal BSON Documents

Approach 1: Convert to a Schema-based Object/Map

You should convert each Consumer (or preferably, each BsonDocument) into a format Spark can natively map to columns/fields:

Convert to a Map<String, Any?> or a case class/data class structure matching your document fields.
Use Dataset<Row> (or Spark StructType schema inference from your case/data class).
Spark's MongoDB connector will map fields 1:1 with BSON document fields with no binary wrapper.

Example Workflow

Map your Consumer to a Map<String, Any?> or Kotlin data class mirroring the MongoDB schema.
Create a Dataset<Row> from this transformed data.
Write with Spark’s MongoDB connector:

kotlin

val schema: StructType = ... // define your schema
val df = spark.createDataFrame(rdd, schema)
df.write()
  .format("mongodb")
  .option("uri", "mongodb://...")
  .mode("append")
  .save()

Or, if you have a dataset of maps/data classes (using Spark’s built-in Kotlin/Java support):

kotlin

val consumerDataSet: Dataset<ConsumerPojo> = ...
consumerDataSet.write()
  .format("mongodb")
  .option("uri", "mongodb://...")
  .mode("append")
  .save()

Approach 2: Use Document-to-Row Conversion

You can use Spark's built-in functionality (or your own conversion logic) to translate BsonDocument into a Row with an associated schema. This enables Spark to treat each key-value as a document field rather than exposing an opaque binary blob.

Approach 3: Custom Encoder (if using kotlin/kotlinx.serializer)

If you leverage kotlinx serializer, serialize to a format Spark understands (i.e., a Map, or a POJO/data class). Avoid serializing to a byte array. Spark must see the logical fields of your object to map them directly to MongoDB document fields.

Key Takeaways

Never provide raw BsonDocument or serialized objects to Spark's MongoDB connector. This causes the binary field issue.
Represent each record as a structured object (Map/data class/Row) with simple types.
When Spark sees a logical schema, it automatically serializes objects to BSON fields correctly in MongoDB.