Re: How does Databricks handle registration and di...

anagilla · yesterday

You're not hitting a hidden Databricks feature that scans your pipeline resources for data sources. This is just how Lakeflow Spark Declarative Pipelines runs your code.

When a pipeline plans its graph, it evaluates every file you've configured as pipeline source, and it does that more than once across planning and the run. So if the module with your custom data source is one of those source files and it calls spark.dataSource.register(...) at the top level, that line runs whenever the file gets evaluated. The format ends up registered without you importing anything, because the registration already happened as a side effect of the pipeline reading your source files.

The line that matters is configured source file vs. everything else:

A configured source file gets evaluated, so top-level code in it (your register() call) runs on its own.
A utility module, a wheel, or an env dependency only lands on sys.path so you can import it. It won't run until a source file imports it, so the registration won't happen on its own.

On your two questions:

There's no "discover and auto-register data sources" mechanism. What you're seeing falls out of the pipeline evaluating your source files, and your registration module happens to be one of them.
The docs describe the explicit path, not this implicit one. Load data in pipelines assumes the source "has been registered using spark.dataSource.register" before you read it, and the PySpark custom data sources page shows the same explicit register() call.

Two things worth planning for:

Since the pipeline evaluates your code more than once, a top-level register() can run several times. That's fine in practice, just keep top-level side effects cheap and idempotent.
For a data source spread across several modules, the pattern SDP is happiest with is a single source file. The Databricks Labs lakeflow-community-connectors project does this for you: it inlines the data source code into one source file and appends the spark.dataSource.register(...) call. If you're importing across modules today, consolidating into one source file (or borrowing that project's approach) is the dependable route.

What I'd do: keep the registration explicit so it doesn't ride on which file the pipeline happens to evaluate. Put spark.dataSource.register(MyDataSource) in a configured source file, or import the module from one. For anything beyond a single module, the merged single-file pattern the connectors project uses is the way to go.