cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to extract read path from notebooks. Especially from the autoloader

ajay_wavicle
Databricks Partner

I am trying to figure out how to extract source paths from read statement or autoloader paths. I need for knowing my source locations from lot of notebooks. How to extract from databricks. Can databricks sdk do this?

3 REPLIES 3

pradeep_singh
Contributor III

You can get metadata information for input files with the _metadata column.For file path you would use _metadata.file_path.

df = spark.read \
  .format("csv") \
  .schema(schema) \
  .load("dbfs:/tmp/*") \
  .select("*", "_metadata.file_path")

More details - https://docs.databricks.com/aws/en/ingestion/file-metadata-column

Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

pradeep_singh
Contributor III

If you are looking to crawl a path and read notebooks you can do that as well with the Databricks SDK to connect to list all objects and export notebooks and use pattern matching to extract the file path for each autoloader stream.


Thank You
Pradeep Singh - https://www.linkedin.com/in/dbxdev

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ajay_wavicle,

There are several approaches you can use to extract read/source paths from notebooks, including Auto Loader (cloudFiles) paths. The right choice depends on whether you want runtime lineage data or static code analysis.


APPROACH 1: UNITY CATALOG LINEAGE SYSTEM TABLES (RECOMMENDED)

If your workloads run against Unity Catalog, the lineage system tables automatically capture source and target paths at runtime, including cloud storage paths used by Auto Loader.

The two key tables are:

system.access.table_lineage
system.access.column_lineage

Both tables include a source_path column that captures cloud storage URIs (s3://, abfss://, gs://, etc.) and a source_type column that can be TABLE or PATH. They also include entity_type (NOTEBOOK, JOB, PIPELINE, etc.) and entity_id so you can trace which notebook performed the read.

Example query to find all source paths read by notebooks:

SELECT
entity_type,
entity_id,
source_table_full_name,
source_path,
source_type,
event_time
FROM system.access.table_lineage
WHERE source_type IS NOT NULL
AND target_type IS NULL
AND entity_type = 'NOTEBOOK'
ORDER BY event_time DESC

To find Auto Loader reads specifically, filter for PATH source types with cloud storage prefixes:

SELECT
entity_id,
source_path,
event_time
FROM system.access.table_lineage
WHERE source_type = 'PATH'
AND entity_type IN ('NOTEBOOK', 'PIPELINE')
ORDER BY event_time DESC

Notes on lineage system tables:
- Lineage data is retained for one year on a rolling basis
- Lineage is captured across all workspaces attached to a Unity Catalog metastore
- You need to enable system tables if you have not already

Documentation: https://docs.databricks.com/aws/en/admin/system-tables/lineage
Documentation: https://docs.databricks.com/aws/en/data-governance/unity-catalog/data-lineage


APPROACH 2: EXPORT AND PARSE NOTEBOOK SOURCE CODE WITH THE DATABRICKS SDK

If you need to extract paths from the notebook code itself (static analysis), you can use the Databricks SDK for Python to export notebooks and then parse the source code for read statements and cloudFiles paths.

Step 1: Install the SDK

pip install databricks-sdk

Step 2: List and export notebooks

from databricks.sdk import WorkspaceClient
import base64
import re

w = WorkspaceClient()

# List all notebooks in a folder
notebooks = w.workspace.list("/Users/your_folder/")

for item in notebooks:
if item.object_type.name == "NOTEBOOK":
# Export the notebook source
export_response = w.workspace.export(
path=item.path,
format="SOURCE"
)
source_code = base64.b64decode(export_response.content).decode("utf-8")

# Search for Auto Loader (cloudFiles) paths
autoloader_paths = re.findall(
r'\.load\(\s*["\']([^"\']+)["\']',
source_code
)
cloudfiles_paths = re.findall(
r'cloudFiles\.path["\'\s,)]*["\']([^"\']+)["\']',
source_code
)
# Also search for option-based cloudFiles path
option_paths = re.findall(
r'\.option\(\s*["\']cloudFiles\.path["\']\s*,\s*["\']([^"\']+)["\']',
source_code
)

# Search for general spark.read paths
read_paths = re.findall(
r'spark\.read[^)]*\.(csv|parquet|json|orc|delta|format)\([^)]*\)\s*\.load\(\s*["\']([^"\']+)["\']',
source_code,
re.DOTALL
)

if autoloader_paths or cloudfiles_paths or option_paths or read_paths:
print(f"Notebook: {item.path}")
for p in autoloader_paths + cloudfiles_paths + option_paths:
print(f" Path found: {p}")
for fmt, p in read_paths:
print(f" Read path ({fmt}): {p}")

Step 3: For more robust parsing of Auto Loader specifically, look for the common patterns:

# Pattern 1: readStream with cloudFiles format and load path
# spark.readStream.format("cloudFiles").option(...).load("s3://bucket/path")

# Pattern 2: cloudFiles.path option
# .option("cloudFiles.path", "s3://bucket/path")

# Pattern 3: In DLT/SDP notebooks using dlt.read_stream or @Dlt.table
# spark.readStream.format("cloudFiles").load("/mnt/data/input")

Documentation: https://docs.databricks.com/aws/en/dev-tools/sdk-python.html


APPROACH 3: QUERY THE PIPELINES API FOR LAKEFLOW SPARK DECLARATIVE PIPELINES (SDP)

If your Auto Loader code runs inside Lakeflow Spark Declarative Pipelines (SDP), the pipeline definition and settings contain configuration that may reference source paths. You can retrieve this via the SDK:

from databricks.sdk import WorkspaceClient

w = WorkspaceClient()

# List all pipelines
pipelines = w.pipelines.list_pipelines()

for pipeline in pipelines:
detail = w.pipelines.get(pipeline_id=pipeline.pipeline_id)
print(f"Pipeline: {detail.name}")
print(f" Libraries: {detail.spec.libraries}")
print(f" Configuration: {detail.spec.configuration}")

The configuration dictionary often contains cloudFiles.path or other source path settings that are passed into the pipeline notebooks.


APPROACH 4: COMBINE WITH THE DATABRICKS CLI

You can also use the Databricks CLI to export notebooks in bulk for parsing:

databricks workspace export-dir /Users/ ./exported_notebooks --overwrite

Then use standard text search tools (grep, ripgrep, etc.) to find read paths:

grep -rn "cloudFiles" ./exported_notebooks/
grep -rn "\.load(" ./exported_notebooks/
grep -rn "spark\.read" ./exported_notebooks/


SUMMARY

For runtime lineage (what paths were actually read): use system.access.table_lineage, which captures cloud storage paths including Auto Loader sources.

For static code analysis (what paths are in the code): use the Databricks SDK to export notebook source and parse with regex or AST tools.

For pipeline-specific paths: query the Pipelines API for configuration values.

The lineage system tables approach is typically the most reliable because it captures what actually executed, rather than what the code text contains (which may use variables, widgets, or dynamic path construction).

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.