cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ModuleNotFoundError / SerializationError when executing over databricks-connect

sarosh
New Contributor

I am running into the following error when I run a model fitting process over databricks-connect.

modulenotfoundannoIt looks like worker nodes are unable to access modules from the project's parent directory.

Note that the program runs successfully up to this point; no module not found errors are raised in the beginning and spark actions run just fine until this collect statement is called. Also, I have packaged my project as a wheel and installed it directly on the cluster to ensure the module is available to workers.

I have the python project set up as follows:

optimize-assortments

| - configs/

| - tests/

| -optimize_assortments/

| - process.py

   | - sub_process_1.py

   | - sub_process_2.py

   | - sub_process_3.py

process.py imports classes from each sub_process in module_1, instantiates and runs their methods. They are a collection of spark transformations along with a Pandas UDF, which fits a sci-kit model distributed across worker nodes. The error is raised after some subprocesses execute spark commands successfully across workers.

Some things I've tried/verified:

  • Python/DBR/ db-connect versions.
  • Moving all code from the sub_module into the parent Process.
  • Building a wheel and installing it on my cluster:
    • Running via databricks-connect gives me ModuleNotFoundError halfway through execution as described above.
    • If I import the module/submodule from in a Databricks notebook, the code executes successfully.
2 REPLIES 2

DouglasLinder
New Contributor III

@Sarosh Ahmad​ , You haven't provided all the details, but the issue is so close to one I've seen in the past, I'm fairly the certain is the same issue.

Long story short: when the executor executes a UDF, it will, regardless of the function you register, attempt to execute the function using a fully qualified namespace.

That is to say, if you create a file like optimize_assortments/foo.py:

def hello():
   ...
 
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Then the spark executor will attempt to execute `optimize_assortments.foo.hello` which is not defined, and the dreaded ModuleNotFoundError will be thrown.

This is because the function 'hello' is scoped to the module that it is defined in.

You can resolve this by defining a function level UDF which has no scope, and is thus resolved as 'hello' when it is invoked, like this:

def run():
  def hello():
     ...
 
  hello_udf = udf(hello, StringType())
  df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Many people will recommend this approach for many reasons, but largely they have no idea why.

The reason is because when you define a function inside a function, it is not module scoped, and therefore has no module namespace.

This code will work fine in a notebook (rather than in databricks connect) because notebooks use a single top level (ie. no namespace) module scope.

I've tried to explain, but you can read more and see a full detailed example on stack overflow here -> https://stackoverflow.com/questions/59322622/how-to-use-a-udf-defined-in-a-sub-module-in-pyspark/671...

Manjunath
Databricks Employee
Databricks Employee

@Sarosh Ahmad​ , Could you try adding the zip of the module to the addPyFile like below

spark.sparkContext.addPyFile("src.zip")

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group