cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Mongo Spark Connector 3.0.1 seems not working with Databricks-Connect, but works fine in Databricks Cloud

Shadowsong27
New Contributor III

On latest DB-Connect==9.1.3 and dbr == 9.1, retrieving data from mongo using Maven coordinate of Mongo Spark Connector: org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 - https://docs.mongodb.com/spark-connector/current/ - working fine previously throws

...
 
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.ClassNotFoundException: Failed to find data source: mongo. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:765)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:819)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
        at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:101)
        at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:80)
        at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4724)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
        at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
        at com.databricks.service.SparkServiceRPCHandler$.getOrLoadAnonymousRelation(SparkServiceRPCHandler.scala:80)
        at com.databricks.service.SparkServiceRPCHandler.execute0(SparkServiceRPCHandler.scala:715)
        at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC0$1(SparkServiceRPCHandler.scala:478)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at com.databricks.service.SparkServiceRPCHandler.executeRPC0(SparkServiceRPCHandler.scala:370)
        at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:321)
        at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:307)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC$1(SparkServiceRPCHandler.scala:357)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at com.databricks.service.SparkServiceRPCHandler.executeRPC(SparkServiceRPCHandler.scala:334)
        at com.databricks.service.SparkServiceRPCServlet.doPost(SparkServiceRPCServer.scala:153)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:550)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: mongo.DefaultSource
        at java.lang.ClassLoader.findClass(ClassLoader.java:524)
        at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
        at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:739)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:739)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:739)
        ... 47 more

reverting DBC to version 8 won't really help since the supported dbr version (8.1) is no longer available. reverting DBC to version 7 throws other issue.

Any workarounds / solutions?

1 ACCEPTED SOLUTION

Accepted Solutions

Shadowsong27
New Contributor III

Folks the latest databricks-connect==9.1.7 fixed this.

View solution in original post

14 REPLIES 14

tigger
New Contributor III

I have exactly the same problem with my databricks-connect 9.1.2. Also tried explicit format name instead of 'mongo' but it didn't work. Please help!

spark.read.format('com.mongodb.spark.sql.DefaultSource')

Kaniz_Fatma
Community Manager
Community Manager

Hi @tigger and Shadowsong27! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @Yikun Song​  , Can you please attach the snap of your code?

tigger
New Contributor III

Hi @Kaniz Fatma​ 

The code is simple and it worked on a databricks notebook:

(spark.read.format('mongo')
.option("uri", uri)
.option("database", "abc")
.option("collection", "xyz")
.load()).display()

Is it possible to run this code via databricks-connect?

Shadowsong27
New Contributor III

@Kaniz Fatma​  Mine is very very similar to @Hugh Vo​ , it's just a standard spark read using `mongo` format.

Anonymous
Not applicable

@Yikun Song​ - Would you be happy to mark which answer is best so that others can find the solution more easily?

i am more than happy too, but there is no answer yet.

Anonymous
Not applicable

@Yikun Song​ - Thank you.

Deepak_Bhutada
Contributor III

Hi @Yikun Song​ 

We found an issue with it that will be fixed in the next round of patches (to be released mid-January).

As a workaround,

  1. You need to use the assembly jar like https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/3.0.1/mongo-spark-connec... to also include in transitive dependencies.
  2. After adding the JAR, you need to first run any query, like a “SELECT 1” query to get the JAR properly synced. Otherwise, if spark.read.format(“mongo”) is called directly, a request to use it to resolve the datasource will reach DBR too early, before the library is synced.

So adding the assembly jar to --jars, and first running a SELECT 1 query to make sure that it gets synced to the server should be a temp working workaround.

Shadowsong27
New Contributor III

Folks the latest databricks-connect==9.1.7 fixed this.

Awesome!

tigger
New Contributor III

It works now. Thanks!

trini
New Contributor II

Hi, I encountered the same issue with your problem. I know it's late, but do u still have a copy of your code so I can try it also with mine? Thanks so much 🙂

mehdi3x
New Contributor II

Hi everyone the solution for me it was to replace spark.read.format("mongo") by spark.read.format("mongodb") my spark version is 3.3.2 and my mongodb version is 6.0.6 .

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!