cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler

Shreyash
New Contributor II

I am trying to serve a pyspark model using an endpoint. I was able to load and register the model normally. I could also load that model and perform inference but while serving the model, I am getting the following error:

 

[94fffqts54] ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null'
[94fffqts54] ERROR StatusLogger Reconfiguration failed: No configuration found for '5ffd2b27' at 'null' in 'null'
[94fffqts54] ERROR StatusLogger Reconfiguration failed: No configuration found for 'Default' at 'null' in 'null'
[94fffqts54] An error occurred while loading the model. An error occurred while calling o63.load.
[94fffqts54] : java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler
[94fffqts54] at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)

 

My conf file looks like this:

 

conda_env_conf = {
    "channels": ["defaults"],
    "dependencies": [
        "python=3.9.5",
        "pip",
        {
            "pip": [
                "spark-nlp==5.3.1",
                "pyspark==3.3.2",
                "mlflow==2.9.2"
            ],
            "maven": [
              {"coordinates":"com.johnsnowlabs.nlp:spark-nlp_2.12:5.3.1"},
              {"coordinates":"mx.com.sw:sdk-java18:0.0.1.5"}
            ]
        },
    ],
    "name": "bert_env",
}

 

Please help!

5 REPLIES 5

Kaniz_Fatma
Community Manager
Community Manager

Hi @ShreyashIt looks like your code is encountering a java.lang.ClassNotFoundException for the com.johnsnowlabs.nlp.DocumentAssembler class while serving your PySpark model. This error occurs when the required class is not found in the classpath.

 

  • The spark-nlp library relies on a JAR file that must be present in the Spark classpath.
  • There are three ways to provide this JAR:
    • Automatically: When you start your Python app through an interpreter, call sparknlp.start(). The JAR will be automatically downloaded.
    • Manually: Pass the JAR to the pyspark command using the --jars switch. You can download the JAR manually from the releases page.
    • Using Maven Coordinates: Start pyspark and pass --packages. For example :-
      pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.5
      
      Make sure to choose the version you need.
  • Ensure that the environment where you’re serving the model has the necessary dependencies installed.
  • Verify that the Spark-NLP JAR is accessible to your Spark context.
  • If you’re running your code in a specific environment (e.g., PyCharm with Anaconda), make sure the JAR is correctly configured in the classpath.
  • You can also try adding the JAR explicitly:
  • spark.conf.set("spark.executor.extraClassPath", "/path/to/spark-nlp.jar")
    
  • Ensure that the versions of Spark, Spark-NLP, and other dependencies are compatible.
  • Sometimes issues arise due to version mismatches.

Hey Kaniz,

Thank you for that response. Although I passed in the jars via the conf as mentioned above. I tried passing it in the cluster conf as well. I also checked the version compatibility and it seems to be fine. Still does not work.

Hi @Shreyash

  • Ensure that the JAR files are available in both the Spark driver and executor classpaths.
  • You can add the JARs to the classpath using the --driver-class-path and --conf spark.executor.extraClassPath options when submitting your Spark job.
  • Depending on how you’re running your Spark application (cluster mode or client mode), the classpath behavior may differ.
  • In cluster mode, the driver runs on a worker node, so make sure the JARs are accessible there.
  • In client mode, the driver runs on your local machine, so ensure the classpath is correctly set there.
  • Verify that the environment variables (SPARK_HOME, PYSPARK_PYTHON, etc.) are consistent across your local machine and the cluster.
  • Sometimes discrepancies in environment variables can cause issues.
  • Check the Spark logs (both driver and executor logs) for any additional error messages or warnings.
  • Look for any specific details related to class loading or missing dependencies.
  • Adjust the log level if needed (--conf spark.driver.extraJavaOptions="-Dlog4j.configuration=file:/path/to/log4j.properties").
  • Ensure there are no conflicting dependencies between Spark-NLP and other libraries you’re using.
  • Sometimes different versions of the same library can cause issues.
  • As a last resort, you can explicitly add the JARs within your PySpark code using SparkContext.addPyFile() or SparkSession.sparkContext.addJar().
  • For example :- 
    from pyspark import SparkContext
    
    sc = SparkContext()
    sc.addPyFile("/path/to/spark-nlp_2.12-5.3.1.jar")
    ​
  • If you’ve made changes to the configuration or classpath, consider restarting your Spark cluster to ensure the changes take effect.
  • Remember to thoroughly check each step and verify that the necessary JARs are accessible in both the driver and executor environments. If the issue persists, feel free to provide additional details, and we’ll continue troubleshooting! 😊

Thanks for the reply Kaniz. I was able to recrete the model locally and it worked when I gave it the right jars using spark.config. The catch is that I am trying to do this in mlflow and I have no way or specifying this explicitly there. How can I give these jars in mlflow ? 

Rajora
New Contributor II

I'm having the same problem and have tried various solutions with no luck. I found some potentially relevant information on the following link: https://www.johnsnowlabs.com/serving-spark-nlp-via-api-3-3-databricks-jobs-and-mlflow-serve-apis/  

In the link I found the following answer:

IMPORTANT: As of 17/02/2022, there is an issue being studied by the Databricks team, regarding the creation on the fly of job clusters to serve MLFlow models that require configuring the Spark Session with specific jars. This will be fixed in later versions of Databricks. In the meantime, the way to go is using Databricks Jobs API. 

Has this already been resolved? Would it be possible to have a hands on task to show how to solve this?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group