DLT Auto Loader Reading from Parent S3 Folder not ...

FAHADURREHMAN · a month ago

Hi All, I am trying to read csv files from one Folder of S3 bucket. For this particular used case, I do not intent to read from sub folders. I am using below code however its reading all CSVs in sub folders as well. How can i avoid that?
I used many different versions of below code with help of Chatgpt but none of them seems working. Any help?

def source_config():
src_path = BASE_S3_URI.rstrip("/")

options = {
"cloudFiles.format": "csv",
"cloudFiles.schemaLocation": SCHEMA_LOCATION,
"cloudFiles.inferColumnTypes": "true",
"cloudFiles.schemaEvolutionMode": "addNewColumns",
"cloudFiles.includeExistingFiles": "true",
"cloudFiles.useNotifications": "false",
"pathGlobFilter": "*.csv",
"header": "true",
"delimiter": ",",
"quote": "\"",
"multiLine": "false",
# optional (can keep during debugging)
"badRecordsPath": f"{SCHEMA_LOCATION}/bad_records",
"columnNameOfCorruptRecord": "_corrupt_record",

"cloudFiles.rescuedDataColumn": "_rescued_data",
}

return src_path, options

Saritha_S · a month ago

Hi @FAHADURREHMAN

Please find below my findings

Since you're using Auto Loader (cloudFiles) this behavior is expected.

By default, when you provide a path like:

s3://bucket/folder/

Spark recursively reads all subfolders.
pathGlobFilter="*.csv" only filters file names — it does NOT prevent recursive directory traversal.

To overcome the issue, please use the below

Use recursiveFileLookup = false

.option("recursiveFileLookup", "false")

or

Use Explicit Wildcard Instead of Folder

s3://bucket/folder/*.csv

View solution in original post

FAHADURREHMAN · 4 weeks ago

Thanks @Saritha_S for your prompt feedback and support. Suggested option worked for me.

SteveOstrowski · 2 weeks ago

Hi @FAHADURREHMAN,

This is expected behavior with Auto Loader. By default, when you point it at a directory path like s3://bucket/folder/, it will recursively traverse all subdirectories and pick up matching files. The pathGlobFilter option only filters by file name pattern, it does not prevent Auto Loader from descending into subfolders.

You have two options to restrict reading to only the top-level folder:

OPTION 1: SET recursiveFileLookup TO FALSE

Add this option to your configuration dictionary:

"recursiveFileLookup": "false"

So your options dict would include:

options = {
  "cloudFiles.format": "csv",
  "cloudFiles.schemaLocation": SCHEMA_LOCATION,
  "cloudFiles.inferColumnTypes": "true",
  "cloudFiles.schemaEvolutionMode": "addNewColumns",
  "cloudFiles.includeExistingFiles": "true",
  "cloudFiles.useNotifications": "false",
  "recursiveFileLookup": "false",
  "pathGlobFilter": "*.csv",
  "header": "true",
  "delimiter": ",",
  "quote": "\"",
  "multiLine": "false",
  "badRecordsPath": f"{SCHEMA_LOCATION}/bad_records",
  "columnNameOfCorruptRecord": "_corrupt_record",
  "cloudFiles.rescuedDataColumn": "_rescued_data",
}

When recursiveFileLookup is set to false, Auto Loader will only discover files in the immediate directory you specify, ignoring any subdirectories.

OPTION 2: USE A WILDCARD PATH INSTEAD OF A DIRECTORY

Instead of pointing to the folder:

src_path = "s3://your-bucket/your-folder/"

Use a wildcard that matches only top-level CSV files:

src_path = "s3://your-bucket/your-folder/*.csv"

This tells Auto Loader to only pick up files matching *.csv directly under that path, without descending into subfolders.

Either approach will work. Option 1 is generally the cleaner solution when using Lakeflow Spark Declarative Pipelines (SDP), since it keeps path handling simple and the behavior is controlled explicitly through configuration. Note that what was previously called DLT is now named Lakeflow Spark Declarative Pipelines (SDP).

For reference, the full list of Auto Loader options is documented here:
https://docs.databricks.com/aws/ingestion/cloud-object-storage/auto-loader/options

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

DLT Auto Loader Reading from Parent S3 Folder not Sub Folders