cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Python udfs, Spark Connect, included modules. Compatibility issues with shared compute

thackman
New Contributor II

Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes.  There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.

In our top level notebook we are appending the subfolder that has our scrubber udf's in it:

 

sys.path.append(os.path.abspath('./scrubbers/'))
from UDFRegistry import UDFRegistry

 

The scrubber functions are configurable per tenant so they are registered dynamically using:

 

def get_known_udf(self, module_name, udf_function_name):  
    udf_module = __import__(module_name)
    udf_function = getattr(udf_module, udf_function_name)
    return udf_function

 

As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.

Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @thackman, Ensure the Python binary's location is correctly set to resolve runtime issues in Spark Connect and employ Databricks’ performance profiling techniques to fine-tune UDFs for optimal execution.

thackman
New Contributor II

I'm not sure what you mean by "Ensure the Python binary's location is correctly set to resolve runtime issues" . We aren't using any binaries. Everything is just Databricks notebooks.  In our case if we define a python udf function in the root notebook then it works fine for both a single user cluster or a shared cluster.  If we put the python udf in a child notebook that is included with the %run magic command then the executor nodes can't resolve the udf.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group