cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Installed Library / Module not found through Databricks connect LST 12.2

maartenvr
New Contributor III

Hi all,

We recently upgraded our databricks compute cluster from runtime version 10.4 LST, to 12.2 LST.
After the upgrade one of our python scripts suddenly fails with a module not found error; indicating that our customly created module "xml_parser" is not found on the spark executors. This is strange since we did install the module / library through the databrick UI on the new upgraded cluster; in exactly the same way as we installed it on the old cluster. Everything was running fine on the old LST. Therefore, I am wondering what causes this issue.
Has anything changed between the two runtimes? Am I missing a new setting?

FYI:
- Our spark jobs run from scripts using databricks connect (not through DB notebooks) and we have updated all the databricks connect packages from 10.4.X to 12.2.X.
- We upload a python wheel file to the UI, which gets stored on the DBFS to be picked up by the cluster. 
The installation shows a success mark in the UI.

The error message is as follows:


```

Exception has occurred: Py4JJavaError
An error occurred while calling o52.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 19) (10.161.130.19 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):
    File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length return self.loads(obj)
    File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding)
    ModuleNotFoundError: No module named 'xml_parser''.
    Full traceback below:
    Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'xml_parser'
```
5 REPLIES 5

Debayan
Databricks Employee
Databricks Employee

Hi, This looks like package dependency issue. Could you also please try to update the DB connect to its latest version and try again? 

Also, please tag @Debayan with your next response so that I will get notified. Thanks.

maartenvr
New Contributor III

Hi @Debayan ,

We were already on the latest databricks-connect version (12.2.12) compatibly with LTS 12.2
I tried to run the code with a downgraded version as well after your comment (12.2.10) but that didn't do the trick.

@Retired_mod, thanks for the suggestions / checks.
We double checked all these points, and everything is fine except for the first one.
I am sure our application runs on the same Python version as the cluster (3.9.5.), but we have not set the PYSPARK_PYTHON environment variable. Where do we need to set it?  On the machine making the connection to the cluster or inside the databricks compute cluster itself?
Just for my information, isn't the databricks-connect package responsible for this? 
We didn't set it before either while everything was working fine.

Debayan
Databricks Employee
Databricks Employee

Hi, Also, what if you are trying with DBR version 13.x? 

maartenvr
New Contributor III

Going to 13.3 (LTS) unfortunately requires quite some extra work for our team.
We would need to start using / configure Unity Catalog.

For now I have opened a ticket with the Databricks support team.
If I find any solution I will post it here.

maartenvr
New Contributor III

FYI: For now we have found a workaround.
We are adding the package as ZIP file to the current spark session with .addyFiles.
So after creating a spark session using Databricks-connect we run the following:
spark.sparkContext.addPyFile("C:/path/to/custom_package.zip")

We still have the question open to the DB team on why our installed package is not found anymore by the spark workers.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group