Our current system uses Databricks notebooks and we have some shared notebooks that define some python udfs. This was working for us until we tried to switch from single user clusters to shared clusters. Shared clusters and serverless now use Spark Connect and that introduces a lot of behavior changes. There are several ways to include resources in spark but we can't find a combination that allows the worker nodes to find the udf source code. Pulling the udf function directly into our notebook works but we want to keep our code modular. It's a lot of bloat to copy all of these functions into each notebook.
In our top level notebook we are appending the subfolder that has our scrubber udf's in it:
sys.path.append(os.path.abspath('./scrubbers/'))
from UDFRegistry import UDFRegistry
The scrubber functions are configurable per tenant so they are registered dynamically using:
def get_known_udf(self, module_name, udf_function_name):
udf_module = __import__(module_name)
udf_function = getattr(udf_module, udf_function_name)
return udf_function
As far as we can tell, it's finding the modules when running on a single user cluster and can't find the modules with the udf in them on a shared compute or serverless cluster.
Does anyone have a better way to include udfs from library modules in sub folders? We could move our code to py files and create a wheel package but that creates a ton of complexity and introduces a second versioning system. It's much cleaner for everything to pull from one tagged version branch in our repo.