How does Databricks handle registration and discovery of custom PySpark data sources in SDPs?

mnissen1337
Contributor

I'm working with Databricks declarative pipelines and have defined a custom PySpark data source (CDS) in its own standalone Python module. I include this module as part of the pipeline resources. 

What I find interesting is that, even without explicitly importing this module in my pipeline code, the custom data source is registered and available when I reference it with spark.read.format("my_custom_source")

I’m trying to understand how Databricks manages the registration and discovery of custom data sources in this scenario. Specifically:

  • Does Databricks automatically scan and execute code from modules included as pipeline resources for data source registration when a custom format is referenced?
  • Is there any documentation or explanation for this behavior?

Any insights or pointers to relevant documentation would be greatly appreciated!

Thanks in advance!