I have a custom Python package that provides a PySpark DataSource implementation. I'm using Databricks Connect (16.4.10) and need to understand package installation options for serverless compute.
Works: Traditional Compute Cluster
Custom package pre-installed on cluster
spark = DatabricksSession.builder.clusterId("my-cluster-id").getOrCreate()
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("my_format").load()
Works perfectly
Doesn't Work: Serverless Compute
Custom package not available
spark = DatabricksSession.builder.serverless().getOrCreate()
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("my_format").load()
Error
What I've Tried
I attempted to use DatabricksEnv().withDependencies():
env = DatabricksEnv().withDependencies(["my-custom-package==0.4.0"])
spark = DatabricksSession.builder.serverless().withEnvironment(env).getOrCreate()
However, based on the documentation, withDependencies() appears to only work for Python UDFs, not for packages that need to be available at the driver or session level for custom DataSource registration.
Questions
Is there a way to install custom packages on serverless compute when using Databricks Connect?
Is support for custom package installation on serverless compute (similar to cluster libraries) on the roadmap?
Are there any workarounds to make custom DataSources work with serverless compute?
Environment
Databricks Connect: 16.4.10
Python: 3.12
Custom package: Installed locally via pip, provides PySpark DataSource V2 API implementation
Additional Context
The custom package works perfectly with serverless environment in a notebook.
Links
https://docs.databricks.com/aws/en/dev-tools/databricks-connect/cluster-config#remote-meth
https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/udf#base-env