You are correct—when you pass a BsonDocument to Spark's MongoDB connector using .write().format("mongodb"), Spark treats unknown types as generic serialized blobs, leading to documents stored as a single binary field (as you observed) rather than as normal embedded documents.
Why Binary Data Appears
-
Spark's DataFrame/Dataset write to MongoDB expects Row-like (schema-based) objects or case classes (for Scala/Java APIs), not a raw BSON type.
-
If you give it a Kotlin BsonDocument (or a serialized object/byte array), the connector serializes it again (usually via Java serialization), and the result is a single binary or base64 field in the stored document.
How to Store as Normal BSON Documents
Approach 1: Convert to a Schema-based Object/Map
You should convert each Consumer (or preferably, each BsonDocument) into a format Spark can natively map to columns/fields:
-
Convert to a Map<String, Any?> or a case class/data class structure matching your document fields.
-
Use Dataset<Row> (or Spark StructType schema inference from your case/data class).
-
Spark's MongoDB connector will map fields 1:1 with BSON document fields with no binary wrapper.
Example Workflow
-
Map your Consumer to a Map<String, Any?> or Kotlin data class mirroring the MongoDB schema.
-
Create a Dataset<Row> from this transformed data.
-
Write with Spark’s MongoDB connector:
val schema: StructType = ... // define your schema
val df = spark.createDataFrame(rdd, schema)
df.write()
.format("mongodb")
.option("uri", "mongodb://...")
.mode("append")
.save()
Or, if you have a dataset of maps/data classes (using Spark’s built-in Kotlin/Java support):
val consumerDataSet: Dataset<ConsumerPojo> = ...
consumerDataSet.write()
.format("mongodb")
.option("uri", "mongodb://...")
.mode("append")
.save()
Approach 2: Use Document-to-Row Conversion
You can use Spark's built-in functionality (or your own conversion logic) to translate BsonDocument into a Row with an associated schema. This enables Spark to treat each key-value as a document field rather than exposing an opaque binary blob.
Approach 3: Custom Encoder (if using kotlin/kotlinx.serializer)
If you leverage kotlinx serializer, serialize to a format Spark understands (i.e., a Map, or a POJO/data class). Avoid serializing to a byte array. Spark must see the logical fields of your object to map them directly to MongoDB document fields.
Key Takeaways
-
Never provide raw BsonDocument or serialized objects to Spark's MongoDB connector. This causes the binary field issue.
-
Represent each record as a structured object (Map/data class/Row) with simple types.
-
When Spark sees a logical schema, it automatically serializes objects to BSON fields correctly in MongoDB.