Friday - last edited Friday
I have a custom Python package that provides a PySpark DataSource implementation. I'm using Databricks Connect (16.4.10) and need to understand package installation options for serverless compute.
Works: Traditional Compute Cluster
Custom package pre-installed on cluster
spark = DatabricksSession.builder.clusterId("my-cluster-id").getOrCreate()
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("my_format").load()
Works perfectly
Doesn't Work: Serverless Compute
Custom package not available
spark = DatabricksSession.builder.serverless().getOrCreate()
spark.dataSource.register(MyCustomDataSource)
df = spark.read.format("my_format").load()
Error
What I've Tried
I attempted to use DatabricksEnv().withDependencies():
env = DatabricksEnv().withDependencies(["my-custom-package==0.4.0"])
spark = DatabricksSession.builder.serverless().withEnvironment(env).getOrCreate()
However, based on the documentation, withDependencies() appears to only work for Python UDFs, not for packages that need to be available at the driver or session level for custom DataSource registration.
Questions
Is there a way to install custom packages on serverless compute when using Databricks Connect?
Is support for custom package installation on serverless compute (similar to cluster libraries) on the roadmap?
Are there any workarounds to make custom DataSources work with serverless compute?
Environment
Databricks Connect: 16.4.10
Python: 3.12
Custom package: Installed locally via pip, provides PySpark DataSource V2 API implementation
Additional Context
The custom package works perfectly with serverless environment in a notebook.
Links
https://docs.databricks.com/aws/en/dev-tools/databricks-connect/cluster-config#remote-meth
https://docs.databricks.com/aws/en/dev-tools/databricks-connect/python/udf#base-env
Saturday
Just put the wheel on volume and add it to the environment?
Saturday - last edited Saturday
Saturday
Hi @ganesh_raskar - Can you try to do pip install in a notebook shell first and then used the library, you can give a try on this. what is the package you want to install, please provide that detail, I will give try.
Regards - San
yesterday
@Sanjeeb2024 It works perfectly fine in notebook by either installing !pip install or pre-install on serverless environment that notebook is attached to.
It just that with spark connect with serverless compute, I don't see an option to install it.
I also tried configuring default workspace serverless environment but that applies to notebook and jobs. It does not apply to spark connect sessions.
yesterday
Hi @ganesh_raskar - If you can provide which custom package and exact code and error, I can try to replicate at my end and explore the suitable option.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now