Databricks Community

mnissen1337 · yesterday

I'm working with Databricks declarative pipelines and have defined a custom PySpark data source (CDS) in its own standalone Python module. I include this module as part of the pipeline resources.

What I find interesting is that, even without explicitly importing this module in my pipeline code, the custom data source is registered and available when I reference it with spark.read.format("my_custom_source")

I’m trying to understand how Databricks manages the registration and discovery of custom data sources in this scenario. Specifically:

Does Databricks automatically scan and execute code from modules included as pipeline resources for data source registration when a custom format is referenced?
Is there any documentation or explanation for this behavior?

Any insights or pointers to relevant documentation would be greatly appreciated!

Thanks in advance!

aliyasingh · yesterday

That is a great observation! You aren't actually triggering a hidden "auto-discovery" feature for custom data sources. Instead, what you are seeing is a byproduct of how Spark Declarative Pipelines (SDPs) evaluate pipeline resources.

To answer your specific questions:

1. Does Databricks automatically scan for data source registration? No, it doesn't actively scan for custom data sources specifically. However, when Databricks builds and plans the pipeline graph, it has to evaluate the top-level code of every Python file configured as a pipeline source.

Because your custom module is included as a source file, Databricks runs its top-level code during this graph-planning phase. Assuming your spark.dataSource.register(...) call is at the top level of that module, it gets executed automatically as a side effect of this evaluation. Therefore, by the time your main pipeline code runs, the format is already registered, making the explicit import unnecessary.

(Note: This only happens for configured source files. If your module was packaged as a standard wheel dependency or utility module in your environment, it wouldn't run until explicitly imported).

2. Is there documentation for this? There isn't specific documentation for "implicit custom data source discovery" because it technically isn't a standalone feature. The official docs for PySpark Custom Data Sources assume the standard, explicit path of importing and calling spark.dataSource.register() before reading.

A quick tip for best practice: Because pipeline planning can evaluate source files multiple times, your top-level register() call might run multiple times. While this is usually harmless, it's generally safer to keep registration explicit (e.g., importing the module and registering it in your main pipeline file) rather than relying on the side effects of file evaluation.

View solution in original post

anagilla · yesterday

You're not hitting a hidden Databricks feature that scans your pipeline resources for data sources. This is just how Lakeflow Spark Declarative Pipelines runs your code.

When a pipeline plans its graph, it evaluates every file you've configured as pipeline source, and it does that more than once across planning and the run. So if the module with your custom data source is one of those source files and it calls spark.dataSource.register(...) at the top level, that line runs whenever the file gets evaluated. The format ends up registered without you importing anything, because the registration already happened as a side effect of the pipeline reading your source files.

The line that matters is configured source file vs. everything else:

A configured source file gets evaluated, so top-level code in it (your register() call) runs on its own.
A utility module, a wheel, or an env dependency only lands on sys.path so you can import it. It won't run until a source file imports it, so the registration won't happen on its own.

On your two questions:

There's no "discover and auto-register data sources" mechanism. What you're seeing falls out of the pipeline evaluating your source files, and your registration module happens to be one of them.
The docs describe the explicit path, not this implicit one. Load data in pipelines assumes the source "has been registered using spark.dataSource.register" before you read it, and the PySpark custom data sources page shows the same explicit register() call.

Two things worth planning for:

Since the pipeline evaluates your code more than once, a top-level register() can run several times. That's fine in practice, just keep top-level side effects cheap and idempotent.
For a data source spread across several modules, the pattern SDP is happiest with is a single source file. The Databricks Labs lakeflow-community-connectors project does this for you: it inlines the data source code into one source file and appends the spark.dataSource.register(...) call. If you're importing across modules today, consolidating into one source file (or borrowing that project's approach) is the dependable route.

What I'd do: keep the registration explicit so it doesn't ride on which file the pipeline happens to evaluate. Put spark.dataSource.register(MyDataSource) in a configured source file, or import the module from one. For anything beyond a single module, the merged single-file pattern the connectors project uses is the way to go.

aliyasingh · yesterday

That is a great observation! You aren't actually triggering a hidden "auto-discovery" feature for custom data sources. Instead, what you are seeing is a byproduct of how Spark Declarative Pipelines (SDPs) evaluate pipeline resources.

To answer your specific questions:

1. Does Databricks automatically scan for data source registration? No, it doesn't actively scan for custom data sources specifically. However, when Databricks builds and plans the pipeline graph, it has to evaluate the top-level code of every Python file configured as a pipeline source.

Because your custom module is included as a source file, Databricks runs its top-level code during this graph-planning phase. Assuming your spark.dataSource.register(...) call is at the top level of that module, it gets executed automatically as a side effect of this evaluation. Therefore, by the time your main pipeline code runs, the format is already registered, making the explicit import unnecessary.

(Note: This only happens for configured source files. If your module was packaged as a standard wheel dependency or utility module in your environment, it wouldn't run until explicitly imported).

2. Is there documentation for this? There isn't specific documentation for "implicit custom data source discovery" because it technically isn't a standalone feature. The official docs for PySpark Custom Data Sources assume the standard, explicit path of importing and calling spark.dataSource.register() before reading.

A quick tip for best practice: Because pipeline planning can evaluate source files multiple times, your top-level register() call might run multiple times. While this is usually harmless, it's generally safer to keep registration explicit (e.g., importing the module and registering it in your main pipeline file) rather than relying on the side effects of file evaluation.

Databricks Community

How does Databricks handle registration and discovery of custom PySpark data sources in SDPs?

🌟 Community Pulse: Your Weekly Roundup! June 22 – 28, 2026

Solution Accelerator Series | Product Quality Inspection

Upcoming Community BrickTalk: Bringing (Geo)Spatial Awareness to your Conversational Agents

Databricks Community Champion - June 2026 - Amira Bedhiafi

DAIS 2026 Brought 2,800 New Members to the Databricks Community - Welcome Aboard