cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DATA_SOURCE_NOT_FOUND Error with MongoDB (Suggestions in other similar posts have not worked)

kahrees
New Contributor

I am trying to load data from MongoDB into Spark. I am using the Community/Free version of DataBricks so my Jupiter Notebook is in a Chrome browser.

Here is my code:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.mongodb.read.connection.uri", uri) \
    .config("spark.mongodb.output.uri", uri) \
    .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:10.1.1") \
    .getOrCreate()





database = db
collection = tweets

df = spark.read.format("mongodb") \
    .option("uri", uri) \
    .option("database", database) \
    .option("collection", collection) \
    .load()

 

This is the error:


df.display()
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02

This project is for a class so please, kindly treat me as a novice. The data is in the correct MongoDB collection, my uri and all other variables are correct and the MongoDB connection/deployment pinged successfully. I am willing to provide any necessary information. I have spent over three hours trying to fix this.

Please help me, thank you.

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hey @kahrees ,

Good Day!

I tested this internally, and I was able to reproduce the issue. Screenshot below:

K_Anudeep_0-1762914681450.png

 

You’re getting [DATA_SOURCE_NOT_FOUND] ... mongodb because the MongoDB Spark connector jar isn’t actually on your cluster’s classpath. On Databricks (including Community Edition), setting
spark.jars.packages inside SparkSession.builder usually do not install cluster libraries—the cluster must have the jar pre-installed.

 

  • Install the connector as a cluster library (Libraries → Maven → Install → Restart).

  • Use a connector matching your Scala line (Databricks DBR typically = Scala 2.12 → use _2.12) .Not also check your connectivity from Databricks to mongo before rer-unning

I tested this, and it works locally in my env.

 

 

Anudeep

View solution in original post

3 REPLIES 3

K_Anudeep
Databricks Employee
Databricks Employee

Hey @kahrees ,

Good Day!

I tested this internally, and I was able to reproduce the issue. Screenshot below:

K_Anudeep_0-1762914681450.png

 

You’re getting [DATA_SOURCE_NOT_FOUND] ... mongodb because the MongoDB Spark connector jar isn’t actually on your cluster’s classpath. On Databricks (including Community Edition), setting
spark.jars.packages inside SparkSession.builder usually do not install cluster libraries—the cluster must have the jar pre-installed.

 

  • Install the connector as a cluster library (Libraries → Maven → Install → Restart).

  • Use a connector matching your Scala line (Databricks DBR typically = Scala 2.12 → use _2.12) .Not also check your connectivity from Databricks to mongo before rer-unning

I tested this, and it works locally in my env.

 

 

Anudeep

Thank you. Using the information you gave me I was able to move a step further. It turns out that because I am using a serverless cluster, I am unable to install the Maven library. I am not sure how to move to a cluster with a server but I will continue the project in another way.

Here are the two links that helped me.
https://docs.databricks.com/aws/en/libraries/package-repositories

https://docs.databricks.com/aws/en/libraries/cluster-libraries#install-a-library-on-a-cluster

And the response from @Louis_Frolio here:
https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/td-...

 

K_Anudeep
Databricks Employee
Databricks Employee

Thanks for the update! Yes, you cannot do this on a serverless platform.  But for a non-serverless cluster, the approach shared below is the right way ! If youar question is answered caould you please accept this as solution

 

Anudeep