Installed Library / Module not found through Databricks connect LST 12.2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-05-2023 07:16 AM
Hi all,
We recently upgraded our databricks compute cluster from runtime version 10.4 LST, to 12.2 LST.
After the upgrade one of our python scripts suddenly fails with a module not found error; indicating that our customly created module "xml_parser" is not found on the spark executors. This is strange since we did install the module / library through the databrick UI on the new upgraded cluster; in exactly the same way as we installed it on the old cluster. Everything was running fine on the old LST. Therefore, I am wondering what causes this issue.
Has anything changed between the two runtimes? Am I missing a new setting?
FYI:
- Our spark jobs run from scripts using databricks connect (not through DB notebooks) and we have updated all the databricks connect packages from 10.4.X to 12.2.X.
- We upload a python wheel file to the UI, which gets stored on the DBFS to be picked up by the cluster.
The installation shows a success mark in the UI.
The error message is as follows:
```
```
- Labels:
-
12.2 LST
-
Compute
-
CustomLibrary
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-07-2023 12:28 AM
Hi, This looks like package dependency issue. Could you also please try to update the DB connect to its latest version and try again?
Also, please tag @Debayan with your next response so that I will get notified. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-12-2023 01:26 AM
Hi @Debayan ,
We were already on the latest databricks-connect version (12.2.12) compatibly with LTS 12.2
I tried to run the code with a downgraded version as well after your comment (12.2.10) but that didn't do the trick.
@Retired_mod, thanks for the suggestions / checks.
We double checked all these points, and everything is fine except for the first one.
I am sure our application runs on the same Python version as the cluster (3.9.5.), but we have not set the PYSPARK_PYTHON environment variable. Where do we need to set it? On the machine making the connection to the cluster or inside the databricks compute cluster itself?
Just for my information, isn't the databricks-connect package responsible for this?
We didn't set it before either while everything was working fine.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-13-2023 09:43 PM
Hi, Also, what if you are trying with DBR version 13.x?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-14-2023 02:12 AM
Going to 13.3 (LTS) unfortunately requires quite some extra work for our team.
We would need to start using / configure Unity Catalog.
For now I have opened a ticket with the Databricks support team.
If I find any solution I will post it here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-28-2023 01:02 AM
FYI: For now we have found a workaround.
We are adding the package as ZIP file to the current spark session with .addyFiles.
So after creating a spark session using Databricks-connect we run the following:
spark.sparkContext.addPyFile("C:/path/to/custom_package.zip")
We still have the question open to the DB team on why our installed package is not found anymore by the spark workers.