cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Mongo Spark Connector 3.0.1 seems not working with Databricks-Connect, but works fine in Databricks Cloud

Shadowsong27
New Contributor III

On latest DB-Connect==9.1.3 and dbr == 9.1, retrieving data from mongo using Maven coordinate of Mongo Spark Connector: org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 - https://docs.mongodb.com/spark-connector/current/ - working fine previously throws

...
 
py4j.protocol.Py4JJavaError: An error occurred while calling o45.load.
: java.lang.ClassNotFoundException: Failed to find data source: mongo. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:765)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:819)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:367)
        at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:101)
        at com.databricks.service.SparkServiceRPCHandler$$anon$1.call(SparkServiceRPCHandler.scala:80)
        at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4724)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3522)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2315)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2278)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2193)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3932)
        at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4721)
        at com.databricks.service.SparkServiceRPCHandler$.getOrLoadAnonymousRelation(SparkServiceRPCHandler.scala:80)
        at com.databricks.service.SparkServiceRPCHandler.execute0(SparkServiceRPCHandler.scala:715)
        at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC0$1(SparkServiceRPCHandler.scala:478)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at com.databricks.service.SparkServiceRPCHandler.executeRPC0(SparkServiceRPCHandler.scala:370)
        at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:321)
        at com.databricks.service.SparkServiceRPCHandler$$anon$2.call(SparkServiceRPCHandler.scala:307)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at com.databricks.service.SparkServiceRPCHandler.$anonfun$executeRPC$1(SparkServiceRPCHandler.scala:357)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
        at com.databricks.service.SparkServiceRPCHandler.executeRPC(SparkServiceRPCHandler.scala:334)
        at com.databricks.service.SparkServiceRPCServlet.doPost(SparkServiceRPCServer.scala:153)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:550)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
        at org.eclipse.jetty.server.Server.handle(Server.java:516)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:388)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:633)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:380)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)
        at org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:383)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:882)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1036)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: mongo.DefaultSource
        at java.lang.ClassLoader.findClass(ClassLoader.java:524)
        at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
        at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
        at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:739)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:739)
        at scala.util.Failure.orElse(Try.scala:224)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:739)
        ... 47 more

reverting DBC to version 8 won't really help since the supported dbr version (8.1) is no longer available. reverting DBC to version 7 throws other issue.

Any workarounds / solutions?

1 ACCEPTED SOLUTION

Accepted Solutions

Shadowsong27
New Contributor III

Folks the latest databricks-connect==9.1.7 fixed this.

View solution in original post

11 REPLIES 11

tigger
New Contributor III

I have exactly the same problem with my databricks-connect 9.1.2. Also tried explicit format name instead of 'mongo' but it didn't work. Please help!

spark.read.format('com.mongodb.spark.sql.DefaultSource')

tigger
New Contributor III

Hi @Kaniz Fatma​ 

The code is simple and it worked on a databricks notebook:

(spark.read.format('mongo')
.option("uri", uri)
.option("database", "abc")
.option("collection", "xyz")
.load()).display()

Is it possible to run this code via databricks-connect?

Shadowsong27
New Contributor III

@Kaniz Fatma​  Mine is very very similar to @Hugh Vo​ , it's just a standard spark read using `mongo` format.

Anonymous
Not applicable

@Yikun Song​ - Would you be happy to mark which answer is best so that others can find the solution more easily?

i am more than happy too, but there is no answer yet.

Anonymous
Not applicable

@Yikun Song​ - Thank you.

Deepak_Bhutada
Contributor III

Hi @Yikun Song​ 

We found an issue with it that will be fixed in the next round of patches (to be released mid-January).

As a workaround,

  1. You need to use the assembly jar like https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/3.0.1/mongo-spark-connec... to also include in transitive dependencies.
  2. After adding the JAR, you need to first run any query, like a “SELECT 1” query to get the JAR properly synced. Otherwise, if spark.read.format(“mongo”) is called directly, a request to use it to resolve the datasource will reach DBR too early, before the library is synced.

So adding the assembly jar to --jars, and first running a SELECT 1 query to make sure that it gets synced to the server should be a temp working workaround.

Shadowsong27
New Contributor III

Folks the latest databricks-connect==9.1.7 fixed this.

tigger
New Contributor III

It works now. Thanks!

trini
New Contributor II

Hi, I encountered the same issue with your problem. I know it's late, but do u still have a copy of your code so I can try it also with mine? Thanks so much 🙂

mehdi3x
New Contributor II

Hi everyone the solution for me it was to replace spark.read.format("mongo") by spark.read.format("mongodb") my spark version is 3.3.2 and my mongodb version is 6.0.6 .

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group