cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Python udfs, Spark Connect, included modules. Compatibility issues with shared compute

thackman
New Contributor

Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes.  There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.

In our top level notebook we are appending the subfolder that has our scrubber udf's in it:

 

sys.path.append(os.path.abspath('./scrubbers/'))
from UDFRegistry import UDFRegistry

 

The scrubber functions are configurable per tenant so they are registered dynamically using:

 

def get_known_udf(self, module_name, udf_function_name):  
    udf_module = __import__(module_name)
    udf_function = getattr(udf_module, udf_function_name)
    return udf_function

 

As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.

Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @thackman, Ensure the Python binary's location is correctly set to resolve runtime issues in Spark Connect and employ Databricks’ performance profiling techniques to fine-tune UDFs for optimal execution.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!