That is a great observation! You aren't actually triggering a hidden "auto-discovery" feature for custom data sources. Instead, what you are seeing is a byproduct of how Spark Declarative Pipelines (SDPs) evaluate pipeline resources.
To answer your specific questions:
1. Does Databricks automatically scan for data source registration? No, it doesn't actively scan for custom data sources specifically. However, when Databricks builds and plans the pipeline graph, it has to evaluate the top-level code of every Python file configured as a pipeline source.
Because your custom module is included as a source file, Databricks runs its top-level code during this graph-planning phase. Assuming your spark.dataSource.register(...) call is at the top level of that module, it gets executed automatically as a side effect of this evaluation. Therefore, by the time your main pipeline code runs, the format is already registered, making the explicit import unnecessary.
(Note: This only happens for configured source files. If your module was packaged as a standard wheel dependency or utility module in your environment, it wouldn't run until explicitly imported).
2. Is there documentation for this? There isn't specific documentation for "implicit custom data source discovery" because it technically isn't a standalone feature. The official docs for PySpark Custom Data Sources assume the standard, explicit path of importing and calling spark.dataSource.register() before reading.
A quick tip for best practice: Because pipeline planning can evaluate source files multiple times, your top-level register() call might run multiple times. While this is usually harmless, it's generally safer to keep registration explicit (e.g., importing the module and registering it in your main pipeline file) rather than relying on the side effects of file evaluation.