Databricks Community

maartenvr · ‎09-05-2023

Hi all,

We recently upgraded our databricks compute cluster from runtime version 10.4 LST, to 12.2 LST.
After the upgrade one of our python scripts suddenly fails with a module not found error; indicating that our customly created module "xml_parser" is not found on the spark executors. This is strange since we did install the module / library through the databrick UI on the new upgraded cluster; in exactly the same way as we installed it on the old cluster. Everything was running fine on the old LST. Therefore, I am wondering what causes this issue.
Has anything changed between the two runtimes? Am I missing a new setting?

FYI:
- Our spark jobs run from scripts using databricks connect (not through DB notebooks) and we have updated all the databricks connect packages from 10.4.X to 12.2.X.
- We upload a python wheel file to the UI, which gets stored on the DBFS to be picked up by the cluster.
The installation shows a success mark in the UI.

The error message is as follows:

```

Exception has occurred: Py4JJavaError

An error occurred while calling o52.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 19) (10.161.130.19 executor 1): org.apache.spark.api.python.PythonException: 'pyspark.serializers.SerializationError: Caused by Traceback (most recent call last):

File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length return self.loads(obj)

File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding)

ModuleNotFoundError: No module named 'xml_parser''.

Full traceback below:

Traceback (most recent call last): File "/databricks/spark/python/pyspark/serializers.py", line 188, in _read_with_length return self.loads(obj) File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'xml_parser'
```

Debayan · ‎09-07-2023

Hi, This looks like package dependency issue. Could you also please try to update the DB connect to its latest version and try again?

Also, please tag @Debayan with your next response so that I will get notified. Thanks.

maartenvr · ‎09-12-2023

Hi @Debayan ,

We were already on the latest databricks-connect version (12.2.12) compatibly with LTS 12.2
I tried to run the code with a downgraded version as well after your comment (12.2.10) but that didn't do the trick.

@Retired_mod, thanks for the suggestions / checks.
We double checked all these points, and everything is fine except for the first one.
I am sure our application runs on the same Python version as the cluster (3.9.5.), but we have not set the PYSPARK_PYTHON environment variable. Where do we need to set it? On the machine making the connection to the cluster or inside the databricks compute cluster itself?
Just for my information, isn't the databricks-connect package responsible for this?
We didn't set it before either while everything was working fine.

Debayan · ‎09-13-2023

Hi, Also, what if you are trying with DBR version 13.x?

maartenvr · ‎09-14-2023

Going to 13.3 (LTS) unfortunately requires quite some extra work for our team.
We would need to start using / configure Unity Catalog.

For now I have opened a ticket with the Databricks support team.
If I find any solution I will post it here.

maartenvr · ‎09-28-2023

FYI: For now we have found a workaround.
We are adding the package as ZIP file to the current spark session with .addyFiles.
So after creating a spark session using Databricks-connect we run the following:
spark.sparkContext.addPyFile("C:/path/to/custom_package.zip")

We still have the question open to the DB team on why our installed package is not found anymore by the spark workers.

jguski · ‎07-18-2025

Hi @maartenvr , hi @Debayan ,
Are there any updates on this? Have you found a solution, or can the problem at least be narrowed down to specific DBR versions? I am on a cluster with 11.3 LTS and deploy my custom packaged code (named simply 'src') as Python wheel using Databricks Asset Bundles. Even though the package is successfully installed and can generally be used by the job, execution fails as soon as I want to parallelize anything using PySpark. The module 'src' is not found by the executors, and also none of its dependencies (e.g., 'xgboost').

~

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5285.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5285.0 (TID 13412) (10.22.37.185 executor 2): org.apache.spark.api.python.PythonException: 'ModuleNotFoundError: No module named 'src''. Full traceback below: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 1018, in main func, profiler, deserializer, serializer = read_command(pickleSer, infile) File "/databricks/spark/python/pyspark/worker.py", line 92, in read_command command = serializer.loads(command.value) File "/databricks/spark/python/pyspark/serializers.py", line 540, in loads return cloudpickle.loads(obj, encoding=encoding) ModuleNotFoundError: No module named 'src'

Databricks Community

Installed Library / Module not found through Databricks connect LST 12.2

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.