Databricks Community

sarosh · ‎09-27-2021

I am running into the following error when I run a model fitting process over databricks-connect.

It looks like worker nodes are unable to access modules from the project's parent directory.

Note that the program runs successfully up to this point; no module not found errors are raised in the beginning and spark actions run just fine until this collect statement is called. Also, I have packaged my project as a wheel and installed it directly on the cluster to ensure the module is available to workers.

I have the python project set up as follows:

optimize-assortments

| - configs/

| - tests/

| -optimize_assortments/

| - process.py

| - sub_process_1.py

| - sub_process_2.py

| - sub_process_3.py

process.py imports classes from each sub_process in module_1, instantiates and runs their methods. They are a collection of spark transformations along with a Pandas UDF, which fits a sci-kit model distributed across worker nodes. The error is raised after some subprocesses execute spark commands successfully across workers.

Some things I've tried/verified:

Python/DBR/ db-connect versions.
Moving all code from the sub_module into the parent Process.
Building a wheel and installing it on my cluster:
- Running via databricks-connect gives me ModuleNotFoundError halfway through execution as described above.
- If I import the module/submodule from in a Databricks notebook, the code executes successfully.

DouglasLinder · ‎09-27-2021

@Sarosh Ahmad , You haven't provided all the details, but the issue is so close to one I've seen in the past, I'm fairly the certain is the same issue.

Long story short: when the executor executes a UDF, it will, regardless of the function you register, attempt to execute the function using a fully qualified namespace.

That is to say, if you create a file like optimize_assortments/foo.py:

def hello():
   ...
 
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Then the spark executor will attempt to execute `optimize_assortments.foo.hello` which is not defined, and the dreaded ModuleNotFoundError will be thrown.

This is because the function 'hello' is scoped to the module that it is defined in.

You can resolve this by defining a function level UDF which has no scope, and is thus resolved as 'hello' when it is invoked, like this:

def run():
  def hello():
     ...
 
  hello_udf = udf(hello, StringType())
  df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Many people will recommend this approach for many reasons, but largely they have no idea why.

The reason is because when you define a function inside a function, it is not module scoped, and therefore has no module namespace.

This code will work fine in a notebook (rather than in databricks connect) because notebooks use a single top level (ie. no namespace) module scope.

I've tried to explain, but you can read more and see a full detailed example on stack overflow here -> https://stackoverflow.com/questions/59322622/how-to-use-a-udf-defined-in-a-sub-module-in-pyspark/671...

Manjunath · ‎11-02-2021

@Sarosh Ahmad , Could you try adding the zip of the module to the addPyFile like below

spark.sparkContext.addPyFile("src.zip")

Databricks Community

ModuleNotFoundError / SerializationError when executing over databricks-connect

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon