11-09-2021 02:44 AM
Hi,
I'm working for Couchbase on the Couchbase Spark Connector and noticed something weird which I haven't been able to get to the bottom of so far.
For query DataFrames we use the Datasource v2 API and we delegate the JSON parsing to the org.apache.spark.sql.catalyst.json.CreateJacksonParser -- (https://github.com/couchbase/couchbase-spark-connector/blob/master/src/main/scala/com/couchbase/spark/query/QueryPartitionReader.scala#L56) .. this all works fine, both in a local IDE setup or when the job is sent to a local spark distributed setup.
But when I run it in a databricks notebook, I get:
Job aborted due to stage failure.
Caused by: NoSuchMethodError: org.apache.spark.sql.catalyst.json.CreateJacksonParser$.string(Lcom/fasterxml/jackson/core/JsonFactory;Ljava/lang/String;)Lcom/fasterxml/jackson/core/JsonParser;
at org.apache.spark.sql.CouchbaseJsonUtils$.$anonfun$createParser$1(CouchbaseJsonUtils.scala:41)
at org.apache.spark.sql.catalyst.json.JacksonParser.$anonfun$parse$1(JacksonParser.scala:490)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2952)
at org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:490)
at com.couchbase.spark.query.QueryPartitionReader.$anonfun$rows$2(QueryPartitionReader.scala:54)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
....
at java.lang.Thread.run(Thread.java:748)
Any idea why Caused by: NoSuchMethodError: org.apache.spark.sql.catalyst.json.CreateJacksonParser$.string(Lcom/fasterxml/jackson/core/JsonFactory;Ljava/lang/String;)Lcom/fasterxml/jackson/core/JsonParser; is not available in this environment?
Thanks,
Michael
12-27-2021 07:22 AM
Since there hasn't been any progress on this for over a month, I applied a workaround and copied the classes into the connector source code so we don't have to rely on the databricks classloader. It seems to work in my testing and will be released with the next minor version (connector 3.2.0). Nonetheless I still think this is an issue in the databricks notebook and should be addressed on your side?
11-09-2021 11:03 PM
@Kaniz Fatma thanks for your reply. Since this question is very implementation specific and not really related to general usage, would it make sense to connect me to an engineer familiar with the environment and the internals of the datasource v2 API? Can also be via email or a different channel.
11-10-2021 09:08 AM
@Kaniz Fatma the entire question is in the original post - if there is further clarification needed I'm happy to provide that.
11-10-2021 10:48 AM
@Kaniz Fatma would appreciate if you can assign someone to help us get past this hurdle.
11-10-2021 11:44 PM
@Kaniz Fatma I think you are not quite understanding - we are currently in the process of updating the exact page you linked (we work for couchbase!) and in that process of updating to Spark 3 we ran into the issue above. So this is specific to the databricks notebook platform, since it works with a standalone spark application... what you are telling us here is to "turn it off and on again", and we'd appreciate if it's possible to get some input from actual Databricks engineers working on that environment. Thank you!
11-12-2021 05:39 AM
Hello @Michael Nitschinger , I am not aware of your cluster config but you may consider this jar to be updated as library. and see if are still running into this issue.
Also, please look into this -
You cannot access this data source from a cluster running Databricks Runtime 7.0 or above because a Couchbase connector that supports Apache Spark 3.0 is not available.
11-12-2021 05:51 AM
@Atanu Sarkar what do you mean by updating the jar? The Couchbase connector supports apache spark 3.0, I wrote the new connector. We are planning to update the page you linked and we ran into the issue above. I need someone to help me debug why our Spark connector works under Spark 3 but not under Databricks Notebook.
11-29-2021 11:31 AM
hi @Michael Nitschinger ,
Are you unblocked or still facing this issue?
11-29-2021 11:57 AM
Yes still facing the issue as described above!
12-02-2021 02:20 PM
@Michael Nitschinger Could you let us know your cluster DBR versions?
If you can add spark configuration: spark.driver.extraJavaOptions verbose:class to your cluster and run your use case again, this will print out the class org.apache.spark.sql.catalyst.json.CreateJacksonParser is loading from which jar in the driver stdout logs.
With those two info, I can decompile the jar and find out the root cause of the issue.
12-02-2021 11:19 PM
answered below, hope that helps!
12-02-2021 11:18 PM
So I've been using DBR 10.1 right now. Here are all the driver logs: https://gist.github.com/daschl/8f3e996caf003a903006fff57d6396e3
If needed (via email) I can also give you access to the JAR I'm using as well give you access to a couchbase cluster to actually test it end-to-end.
12-06-2021 09:16 AM
Hello @Xin Wang thank you for helping us out. Any further updates on this ?
12-10-2021 09:18 AM
Hello @Xin Wang , do you have everything that you had asked for previously, any ETA on this. Please let us know we are currently blocked. Appreciate a quick turnaround 🙂
12-10-2021 09:27 AM
Hello @ARUN VIJAYRAGHAVAN really apologize for the late response. Could you add spark.driver.extraJavaOptions verbose:class to your local spark distributed setup? So that we will have the same logs as you posted before for Databricks cluster. I need to do a comparison between local spark distributed setup and databricks cluster.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group