I am encountering an issue while trying to read data from MongoDB in a Unity Catalog Cluster using PySpark. I have shared my code below:
from pyspark.sql import SparkSession
database = "cloud"
collection = "data"
Scope = "XXXXXXXX"
Key = "XXXXXX-YYYYYY-ZZZZZZ"
connectionString = dbutils.secrets.get(scope=Scope, key=Key)
spark = (
SparkSession.builder.config("spark.mongodb.input.uri", connectionString)
.config("spark.mongodb.output.uri", connectionString)
.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.2.0")
.getOrCreate()
)
# Reading from MongoDB
df = (
spark.read.format("mongo")
.option("uri", connectionString)
.option("database", database)
.option("collection", collection)
.load()
)
However, I am encountering the following error:
org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find data source: mongo. Please find packages at `https://spark.apache.org/third-party-projects.html`.
I have already included the necessary MongoDB Spark Connector package, but it seems like Spark is unable to find the data source. Can someone please help me understand what might be causing this issue and how I can resolve it? Any insights or suggestions would be greatly appreciated. Thank you!