Hi @ajay_wavicle,
There are several approaches you can use to extract read/source paths from notebooks, including Auto Loader (cloudFiles) paths. The right choice depends on whether you want runtime lineage data or static code analysis.
APPROACH 1: UNITY CATALOG LINEAGE SYSTEM TABLES (RECOMMENDED)
If your workloads run against Unity Catalog, the lineage system tables automatically capture source and target paths at runtime, including cloud storage paths used by Auto Loader.
The two key tables are:
system.access.table_lineage
system.access.column_lineage
Both tables include a source_path column that captures cloud storage URIs (s3://, abfss://, gs://, etc.) and a source_type column that can be TABLE or PATH. They also include entity_type (NOTEBOOK, JOB, PIPELINE, etc.) and entity_id so you can trace which notebook performed the read.
Example query to find all source paths read by notebooks:
SELECT
entity_type,
entity_id,
source_table_full_name,
source_path,
source_type,
event_time
FROM system.access.table_lineage
WHERE source_type IS NOT NULL
AND target_type IS NULL
AND entity_type = 'NOTEBOOK'
ORDER BY event_time DESC
To find Auto Loader reads specifically, filter for PATH source types with cloud storage prefixes:
SELECT
entity_id,
source_path,
event_time
FROM system.access.table_lineage
WHERE source_type = 'PATH'
AND entity_type IN ('NOTEBOOK', 'PIPELINE')
ORDER BY event_time DESC
Notes on lineage system tables:
- Lineage data is retained for one year on a rolling basis
- Lineage is captured across all workspaces attached to a Unity Catalog metastore
- You need to enable system tables if you have not already
Documentation: https://docs.databricks.com/aws/en/admin/system-tables/lineage
Documentation: https://docs.databricks.com/aws/en/data-governance/unity-catalog/data-lineage
APPROACH 2: EXPORT AND PARSE NOTEBOOK SOURCE CODE WITH THE DATABRICKS SDK
If you need to extract paths from the notebook code itself (static analysis), you can use the Databricks SDK for Python to export notebooks and then parse the source code for read statements and cloudFiles paths.
Step 1: Install the SDK
pip install databricks-sdk
Step 2: List and export notebooks
from databricks.sdk import WorkspaceClient
import base64
import re
w = WorkspaceClient()
# List all notebooks in a folder
notebooks = w.workspace.list("/Users/your_folder/")
for item in notebooks:
if item.object_type.name == "NOTEBOOK":
# Export the notebook source
export_response = w.workspace.export(
path=item.path,
format="SOURCE"
)
source_code = base64.b64decode(export_response.content).decode("utf-8")
# Search for Auto Loader (cloudFiles) paths
autoloader_paths = re.findall(
r'\.load\(\s*["\']([^"\']+)["\']',
source_code
)
cloudfiles_paths = re.findall(
r'cloudFiles\.path["\'\s,)]*["\']([^"\']+)["\']',
source_code
)
# Also search for option-based cloudFiles path
option_paths = re.findall(
r'\.option\(\s*["\']cloudFiles\.path["\']\s*,\s*["\']([^"\']+)["\']',
source_code
)
# Search for general spark.read paths
read_paths = re.findall(
r'spark\.read[^)]*\.(csv|parquet|json|orc|delta|format)\([^)]*\)\s*\.load\(\s*["\']([^"\']+)["\']',
source_code,
re.DOTALL
)
if autoloader_paths or cloudfiles_paths or option_paths or read_paths:
print(f"Notebook: {item.path}")
for p in autoloader_paths + cloudfiles_paths + option_paths:
print(f" Path found: {p}")
for fmt, p in read_paths:
print(f" Read path ({fmt}): {p}")
Step 3: For more robust parsing of Auto Loader specifically, look for the common patterns:
# Pattern 1: readStream with cloudFiles format and load path
# spark.readStream.format("cloudFiles").option(...).load("s3://bucket/path")
# Pattern 2: cloudFiles.path option
# .option("cloudFiles.path", "s3://bucket/path")
# Pattern 3: In DLT/SDP notebooks using dlt.read_stream or @Dlt.table
# spark.readStream.format("cloudFiles").load("/mnt/data/input")
Documentation: https://docs.databricks.com/aws/en/dev-tools/sdk-python.html
APPROACH 3: QUERY THE PIPELINES API FOR LAKEFLOW SPARK DECLARATIVE PIPELINES (SDP)
If your Auto Loader code runs inside Lakeflow Spark Declarative Pipelines (SDP), the pipeline definition and settings contain configuration that may reference source paths. You can retrieve this via the SDK:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
# List all pipelines
pipelines = w.pipelines.list_pipelines()
for pipeline in pipelines:
detail = w.pipelines.get(pipeline_id=pipeline.pipeline_id)
print(f"Pipeline: {detail.name}")
print(f" Libraries: {detail.spec.libraries}")
print(f" Configuration: {detail.spec.configuration}")
The configuration dictionary often contains cloudFiles.path or other source path settings that are passed into the pipeline notebooks.
APPROACH 4: COMBINE WITH THE DATABRICKS CLI
You can also use the Databricks CLI to export notebooks in bulk for parsing:
databricks workspace export-dir /Users/ ./exported_notebooks --overwrite
Then use standard text search tools (grep, ripgrep, etc.) to find read paths:
grep -rn "cloudFiles" ./exported_notebooks/
grep -rn "\.load(" ./exported_notebooks/
grep -rn "spark\.read" ./exported_notebooks/
SUMMARY
For runtime lineage (what paths were actually read): use system.access.table_lineage, which captures cloud storage paths including Auto Loader sources.
For static code analysis (what paths are in the code): use the Databricks SDK to export notebook source and parse with regex or AST tools.
For pipeline-specific paths: query the Pipelines API for configuration values.
The lineage system tables approach is typically the most reliable because it captures what actually executed, rather than what the code text contains (which may use variables, widgets, or dynamic path construction).
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.