cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using pyspark databricks UDFs with outside function imports

tp992
New Contributor II

Problem with minimal example

The below minimal example does not run locally with databricks-connect==15.3 but does run within databricks workspace.

main.py

from databricks.connect import DatabricksSession

from module.udf import send_message, send_complex_message

spark = DatabricksSession.builder.getOrCreate()


def main_works():
    df = spark.createDataFrame([
        (1, "James"),
        (2, "John"),
    ], ["id", "name"])

    df = df.withColumn("message", send_message("name"))

    df.show()


def main_fails():
    df = spark.createDataFrame([
        (1, "James"),
        (2, "John"),
    ], ["id", "name"])

    df = df.withColumn("message", send_complex_message("name"))

    df.show()

if __name__ == "__main__":
    main_works()
    main_fails()

udf.py

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

GREETING = "Hi"
COWBOY_GREETING = "Howdy"

def _get_greeting() -> str:
    return COWBOY_GREETING

@udf(returnType=StringType())
def send_message(message: str) -> str:
    return f"{GREETING}, {message}!"

@udf(returnType=StringType())
def send_complex_message(message: str) -> str:
    greeting_str = _get_greeting()
    return f"{greeting_str}, {message}!"

Traceback when running locally

The second function fails:

Traceback (most recent call last):
  File "main.py", line 31, in <module>
    main_fails()
  File "main.py", line 27, in main_fails
    df.show()
  [shortened]
  File "/databricks/spark/python/pyspark/serializers.py", line 572, in loads
    return cloudpickle.loads(obj, encoding=encoding)
ModuleNotFoundError: No module named 'module'

Expected result and question

See below. How do you properly debug your Databricks workflows if they cannot allow for some UDF function nesting? It's very hard to debug them. Now we develop code for local-use == True which works around UDFs.

+---+-----+----------+
| id| name|   message|
+---+-----+----------+
|  1|James|Hi, James!|
|  2| John| Hi, John!|
+---+-----+----------+

and

+---+-----+-------------+
| id| name|      message|
+---+-----+-------------+
|  1|James|Howdy, James!|
|  2| John| Howdy, John!|
+---+-----+-------------+
 
2 REPLIES 2

tp992
New Contributor II

I think the solution is in .addArtifact if I read this:

 

But have not gotten it to work just yet. 

 

```

spark = DatabricksSession.builder.getOrCreate()
 
venv_pack.pack(output='pyspark_venv.tar.gz')
spark.addArtifact(
"pyspark_venv.tar.gz#environment",
archive=True)
spark.conf.set(
"spark.sql.execution.pyspark.python", "environment/bin/python")
spark.addArtifact("minimal_example_1/module/udf.py", pyfile=True)
```

mark_ott
Databricks Employee
Databricks Employee

The core issue is that PySpark UDFs require their entire closure—including any helper functions they call, such as _get_greeting—to be serializable and available on the worker nodes. In Databricks Workspace, the module distribution and packaging are managed transparently. Locally with databricks-connect, however, the worker processes may not have access to your full Python environment or custom modules, which causes pickling or module import errors like:

ModuleNotFoundError: No module named 'module'

Why This Happens

  • Single-file/inline UDFs work because all definitions are in the main context.

  • UDFs that reference nested functions or imports depend on either closure serialization (which has quirks with PySpark/CloudPickle) or require module visibility from the execution directory of the PySpark worker (often different from your local directory).

  • Databricks Workspace handles code distribution, so helpers within the same file or package are available.

  • Local/Databricks-Connect runs worker processes that may not have your project's structure in their PYTHONPATH, so import/module issues arise for non-trivial UDFs.

Recommendations for Local Debugging

1. Flatten UDF Definitions

  • Avoid calling helpers inside UDFs unless you are sure the helpers will always be in the worker's PYTHONPATH.

  • Inline logic directly into the UDF for local testing.

2. Package Your Code

  • Package your Python code (as a .whl or .egg) and install it in your local Spark environment before running your script. Example:

bash
python setup.py bdist_wheel pip install dist/your_package.whl

Start your Spark app only after your local cluster (or databricks-connect environment) has the same package installed, ensuring all workers find your module.

3. Use Spark's addPyFile

  • You can distribute Python files or zip archives to Spark workers with:

python
spark.sparkContext.addPyFile("path/to/your/module.zip")

This works for small, self-contained modules.

4. Use pandas_udf for Easier Debugging

  • Often, pandas_udf (vectorized UDFs) are easier to debug and more performant, and they behave more like regular Python functions.

5. Conditional Logic for Local Mode

  • Use a local-mode fallback: implement logic to bypass UDFs for local or unit testing:

python
if local_use: df = df.withColumn("message", some_non_udf_function("name")) else: df = df.withColumn("message", send_complex_message("name"))

However, this sacrifices fidelity with production code.

Table: UDF Best Practices for Local Debugging

Pattern Works in Databricks Workspace Works with databricks-connect Locally Notes
Single-file/Inline UDF Yes Yes All logic is visible to workers
UDF with helper import Yes No (unless packaged/distributed) Workers may miss helper functions
Packaged module UDF Yes Yes (if installed in env) Package must be visible to Spark workers
addPyFile UDF Yes Yes Useful for small scripts/modules
Conditional local mode N/A Yes Non-UDF path only for local debugging
 
 

Key Takeaway

To reliably debug PySpark UDFs locally, package all helper functions/methods into installable modules and ensure your Spark worker (whether local or remote) has access to that package. Otherwise, keep UDFs purely self-contained.


For Databricks official references and troubleshooting serialization/UDF local development, see the following resources:

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now