Python udfs, Spark Connect, included modules. Compatibility issues with shared compute

thackman — Wed, 10 Jul 2024 18:54:01 GMT

Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes. There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.

In our top level notebook we are appending the subfolder that has our scrubber udf's in it:

sys.path.append(os.path.abspath('./scrubbers/')) from UDFRegistry import UDFRegistry

The scrubber functions are configurable per tenant so they are registered dynamically using:

def get_known_udf(self, module_name, udf_function_name): udf_module = __import__(module_name) udf_function = getattr(udf_module, udf_function_name) return udf_function

As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.

Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.

Re: Python udfs, Spark Connect, included modules. Compatibility issues with shared compute

thackman — Mon, 15 Jul 2024 21:50:36 GMT

I'm not sure what you mean by "Ensure the Python binary's location is correctly set to resolve runtime issues" . We aren't using any binaries. Everything is just Databricks notebooks. In our case if we define a python udf function in the root notebook then it works fine for both a single user cluster or a shared cluster. If we put the python udf in a child notebook that is included with the %run magic command then the executor nodes can't resolve the udf.

topic Python udfs, Spark Connect, included modules. Compatibility issues with shared compute in Data Engineering

Python udfs, Spark Connect, included modules. Compatibility issues with shared compute

Re: Python udfs, Spark Connect, included modules. Compatibility issues with shared compute