topic Re: ModuleNotFoundError / SerializationError when executing over databricks-connect in Data Engineering

ModuleNotFoundError / SerializationError when executing over databricks-connect

sarosh — Mon, 27 Sep 2021 20:36:38 GMT

I am running into the following error when I run a model fitting process over databricks-connect.

It looks like worker nodes are unable to access modules from the project's parent directory.

Note that the program runs successfully up to this point; no module not found errors are raised in the beginning and spark actions run just fine until this collect statement is called. Also, I have packaged my project as a wheel and installed it directly on the cluster to ensure the module is available to workers.

I have the python project set up as follows:

optimize-assortments

| - configs/

| - tests/

| -optimize_assortments/

| - process.py

| - sub_process_1.py

| - sub_process_2.py

| - sub_process_3.py

process.py imports classes from each sub_process in module_1, instantiates and runs their methods. They are a collection of spark transformations along with a Pandas UDF, which fits a sci-kit model distributed across worker nodes. The error is raised after some subprocesses execute spark commands successfully across workers.

Some things I've tried/verified:

Python/DBR/ db-connect versions.
Moving all code from the sub_module into the parent Process.
Building a wheel and installing it on my cluster:
- Running via databricks-connect gives me ModuleNotFoundError halfway through execution as described above.
- If I import the module/submodule from in a Databricks notebook, the code executes successfully.

Re: ModuleNotFoundError / SerializationError when executing over databricks-connect

DouglasLinder — Tue, 28 Sep 2021 03:27:12 GMT

@Sarosh Ahmad , You haven't provided all the details, but the issue is so close to one I've seen in the past, I'm fairly the certain is the same issue.

Long story short: when the executor executes a UDF, it will, regardless of the function you register, attempt to execute the function using a fully qualified namespace.

That is to say, if you create a file like optimize_assortments/foo.py:

def hello():
   ...
 
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Then the spark executor will attempt to execute `optimize_assortments.foo.hello` which is not defined, and the dreaded ModuleNotFoundError will be thrown.

This is because the function 'hello' is scoped to the module that it is defined in.

You can resolve this by defining a function level UDF which has no scope, and is thus resolved as 'hello' when it is invoked, like this:

def run():
  def hello():
     ...
 
  hello_udf = udf(hello, StringType())
  df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))

Many people will recommend this approach for many reasons, but largely they have no idea why.

The reason is because when you define a function inside a function, it is not module scoped, and therefore has no module namespace.

This code will work fine in a notebook (rather than in databricks connect) because notebooks use a single top level (ie. no namespace) module scope.

I've tried to explain, but you can read more and see a full detailed example on stack overflow here -> https://stackoverflow.com/questions/59322622/how-to-use-a-udf-defined-in-a-sub-module-in-pyspark/67158697#67158697

Re: ModuleNotFoundError / SerializationError when executing over databricks-connect

Manjunath — Tue, 02 Nov 2021 07:01:43 GMT

@Sarosh Ahmad , Could you try adding the zip of the module to the addPyFile like below

spark.sparkContext.addPyFile("src.zip")