@Sarosh Ahmad , You haven't provided all the details, but the issue is so close to one I've seen in the past, I'm fairly the certain is the same issue.
Long story short: when the executor executes a UDF, it will, regardless of the function you register, attempt to execute the function using a fully qualified namespace.
That is to say, if you create a file like optimize_assortments/foo.py:
def hello():
...
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))
Then the spark executor will attempt to execute `optimize_assortments.foo.hello` which is not defined, and the dreaded ModuleNotFoundError will be thrown.
This is because the function 'hello' is scoped to the module that it is defined in.
You can resolve this by defining a function level UDF which has no scope, and is thus resolved as 'hello' when it is invoked, like this:
def run():
def hello():
...
hello_udf = udf(hello, StringType())
df = (spark.sql('SELECT * FROM foo').withColumn('hello', hello_udf()))
Many people will recommend this approach for many reasons, but largely they have no idea why.
The reason is because when you define a function inside a function, it is not module scoped, and therefore has no module namespace.
This code will work fine in a notebook (rather than in databricks connect) because notebooks use a single top level (ie. no namespace) module scope.
I've tried to explain, but you can read more and see a full detailed example on stack overflow here -> https://stackoverflow.com/questions/59322622/how-to-use-a-udf-defined-in-a-sub-module-in-pyspark/671...