cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

ModuleNotFoundError / SerializationError when executing over databricks-connect

sarosh
New Contributor

I am running into the following error when I run a model fitting process over databricks-connect.

modulenotfoundannoIt looks like worker nodes are unable to access modules from the project's parent directory.

Note that the program runs successfully up to this point; no module not found errors are raised in the beginning and spark actions run just fine until this collect statement is called. Also, I have packaged my project as a wheel and installed it directly on the cluster to ensure the module is available to workers.

I have the python project set up as follows:

optimize-assortments

| - configs/

| - tests/

| -optimize_assortments/

| - process.py

   | - sub_process_1.py

   | - sub_process_2.py

   | - sub_process_3.py

process.py imports classes from each sub_process in module_1, instantiates and runs their methods. They are a collection of spark transformations along with a Pandas UDF, which fits a sci-kit model distributed across worker nodes. The error is raised after some subprocesses execute spark commands successfully across workers.

Some things I've tried/verified:

  • Python/DBR/ db-connect versions.
  • Moving all code from the sub_module into the parent Process.
  • Building a wheel and installing it on my cluster:
    • Running via databricks-connect gives me ModuleNotFoundError halfway through execution as described above.
    • If I import the module/submodule from in a Databricks notebook, the code executes successfully.
3 REPLIES 3

DouglasLinder
New Contributor III

@Sarosh Ahmad​ , You haven't provided all the details, but the issue is so close to one I've seen in the past, I'm fairly the certain is the same issue.

Long story short: when the executor executes a UDF, it will, regardless of the function you register, attempt to execute the function using a fully qualified namespace.

That is to say, if you create a file like optimize_assortments/foo.py:

def hello():
   ...
 
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Then the spark executor will attempt to execute `optimize_assortments.foo.hello` which is not defined, and the dreaded ModuleNotFoundError will be thrown.

This is because the function 'hello' is scoped to the module that it is defined in.

You can resolve this by defining a function level UDF which has no scope, and is thus resolved as 'hello' when it is invoked, like this:

def run():
  def hello():
     ...
 
  hello_udf = udf(hello, StringType())
  df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Many people will recommend this approach for many reasons, but largely they have no idea why.

The reason is because when you define a function inside a function, it is not module scoped, and therefore has no module namespace.

This code will work fine in a notebook (rather than in databricks connect) because notebooks use a single top level (ie. no namespace) module scope.

I've tried to explain, but you can read more and see a full detailed example on stack overflow here -> https://stackoverflow.com/questions/59322622/how-to-use-a-udf-defined-in-a-sub-module-in-pyspark/671...

Manjunath
New Contributor III
New Contributor III

@Sarosh Ahmad​ , Could you try adding the zip of the module to the addPyFile like below

spark.sparkContext.addPyFile("src.zip")

Kaniz
Community Manager
Community Manager

Hi @Sarosh Ahmad​ , Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.