cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Pyspark Dataframe error while displaying data read from mongodb

pankaj30
New Contributor II

Hi ,

We are trying to read data from mongodb using databricks notebook with pyspark connectivity.

When we try to display data frame data using show or display method , it gives error "org.bson.BsonInvalidOperationException:Document does not contain key count" 

Data in mongo collection is in timeseries (struct) format .

connectionString='mongodb+srv://CONNECTION_STRING_HERE/
database="sample_supplies"
collection="sales"
salesDF = spark.read.format("mongo").option("database", database).option("collection", collection).option("spark.mongodb.input.uri", connectionString).load()
display(salesDF)

"org.bson.BsonInvalidOperationException:Document does not contain key count" 

2 ACCEPTED SOLUTIONS

Accepted Solutions

an313x
New Contributor III

Thanks, @Kaniz_Fatma for your input. I had the same problem and couldn't display the dataframe and I had only mongo-spark-connector installed on my cluster (DBR 14.3 LTS Spark 3.5.0 and Scala 2.12). After I installed the rest of the suggested JAR files it still failed, but after I changed DBR to 13.3 LTS Spark 3.4.1 and Scala 2.12 it worked.

View solution in original post

an313x
New Contributor III

UPDATE:
Installing mongo-spark-connector_2.12-10.3.0-all.jar from Maven does NOT require the JAR files below to be installed on the cluster to display the dataframe

  • bson
  • mongodb-driver-core
  • mongodb-driver-sync

Also, I noticed that both DBR 13.3 LTS and 14.3 LTS work fine with this specific spark connector JAR file installed on the cluster.

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @pankaj30Thank you for your question! This error typically occurs when thereโ€™s a mismatch between the MongoDB driver and Spark connector versions.

  • Are you sure your code has all the necessary MongoDB drivers and BSON libraries available for Spark?
  • If Yes, please check if you have downloaded the below JAR files(for the appropriate Spark and Scala versions) from Maven:
    • mongo-spark-connector
    • mongodb-driver-sync
    • mongodb-driver-core
    • bson
  • These JAR files contain the necessary classes and methods for MongoDB connectivity.
  • You can find these JAR files on Maven Central or other repositories.
  • Place these JAR files in a directory accessible to your Spark cluster.
  • In your Databricks Notebook, set the Spark configuration to include the paths to the downloaded JAR files:
spark.conf.set("spark.jars", "/path/to/mongo-spark-connector.jar,/path/to/mongodb-driver-sync.jar,/path/to/mongodb-driver-core.jar,/path/to/bson.jar")
  • Ensure that your data schema matches the expected schema when reading it into a DataFrame. If there are missing fields or inconsistencies, it can lead to issues like the one youโ€™re encountering.
  • Make sure your connectionString is correctly formatted. It should include the MongoDB server details, username, password, and other required parameters.
  • Verify that the database and collection names match the actual names in your MongoDB instance.

  • Once youโ€™ve resolved the Bson reference issue, use the display(salesDF) command again to show the data in your DataFrame.

  • If you encounter any further issues, please ask for additional assistance! 

pankaj30
New Contributor II

Hi @Kaniz_Fatma  , I tried all above steps, still didn't work. Parallelly checking with Mongo team.

an313x
New Contributor III

Thanks, @Kaniz_Fatma for your input. I had the same problem and couldn't display the dataframe and I had only mongo-spark-connector installed on my cluster (DBR 14.3 LTS Spark 3.5.0 and Scala 2.12). After I installed the rest of the suggested JAR files it still failed, but after I changed DBR to 13.3 LTS Spark 3.4.1 and Scala 2.12 it worked.

an313x
New Contributor III

UPDATE:
Installing mongo-spark-connector_2.12-10.3.0-all.jar from Maven does NOT require the JAR files below to be installed on the cluster to display the dataframe

  • bson
  • mongodb-driver-core
  • mongodb-driver-sync

Also, I noticed that both DBR 13.3 LTS and 14.3 LTS work fine with this specific spark connector JAR file installed on the cluster.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group