a week ago
I am trying to load data from MongoDB into Spark. I am using the Community/Free version of DataBricks so my Jupiter Notebook is in a Chrome browser.
Here is my code:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.mongodb.read.connection.uri", uri) \
.config("spark.mongodb.output.uri", uri) \
.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:10.1.1") \
.getOrCreate()
database = db
collection = tweets
df = spark.read.format("mongodb") \
.option("uri", uri) \
.option("database", database) \
.option("collection", collection) \
.load()
This is the error:
df.display()
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
[DATA_SOURCE_NOT_FOUND] Failed to find the data source: mongodb. Make sure the provider name is correct and the package is properly registered and compatible with your Spark version. SQLSTATE: 42K02
This project is for a class so please, kindly treat me as a novice. The data is in the correct MongoDB collection, my uri and all other variables are correct and the MongoDB connection/deployment pinged successfully. I am willing to provide any necessary information. I have spent over three hours trying to fix this.
Please help me, thank you.
a week ago
Hey @kahrees ,
Good Day!
I tested this internally, and I was able to reproduce the issue. Screenshot below:
You’re getting [DATA_SOURCE_NOT_FOUND] ... mongodb because the MongoDB Spark connector jar isn’t actually on your cluster’s classpath. On Databricks (including Community Edition), setting
spark.jars.packages inside SparkSession.builder usually do not install cluster libraries—the cluster must have the jar pre-installed.
Install the connector as a cluster library (Libraries → Maven → Install → Restart).
Use a connector matching your Scala line (Databricks DBR typically = Scala 2.12 → use _2.12) .Not also check your connectivity from Databricks to mongo before rer-unning
I tested this, and it works locally in my env.
a week ago
Hey @kahrees ,
Good Day!
I tested this internally, and I was able to reproduce the issue. Screenshot below:
You’re getting [DATA_SOURCE_NOT_FOUND] ... mongodb because the MongoDB Spark connector jar isn’t actually on your cluster’s classpath. On Databricks (including Community Edition), setting
spark.jars.packages inside SparkSession.builder usually do not install cluster libraries—the cluster must have the jar pre-installed.
Install the connector as a cluster library (Libraries → Maven → Install → Restart).
Use a connector matching your Scala line (Databricks DBR typically = Scala 2.12 → use _2.12) .Not also check your connectivity from Databricks to mongo before rer-unning
I tested this, and it works locally in my env.
Thursday
Thank you. Using the information you gave me I was able to move a step further. It turns out that because I am using a serverless cluster, I am unable to install the Maven library. I am not sure how to move to a cluster with a server but I will continue the project in another way.
Here are the two links that helped me.
https://docs.databricks.com/aws/en/libraries/package-repositories
https://docs.databricks.com/aws/en/libraries/cluster-libraries#install-a-library-on-a-cluster
And the response from @Louis_Frolio here:
https://community.databricks.com/t5/data-engineering/install-maven-package-to-serverless-cluster/td-...
Thursday
Thanks for the update! Yes, you cannot do this on a serverless platform. But for a non-serverless cluster, the approach shared below is the right way ! If youar question is answered caould you please accept this as solution
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now