Databricks Community

bd · ‎03-22-2023

When I try setting the `pathGlobFilter` on my Autoloader job, it appears to filter out everything.

The bucket/directory setup is like

`s3a://my_bucket/level_1_dir/level_2_dir/<some_name>/one/two/<the_files_i_want_to_load>`

So what I want is to be able to provide a list of the names from which to load the data. those directories will all share the same subdirectory structure, and all the files (which may have arbitrary extensions and naming conventions) will be two directories down.

The following is my current best attempt at loading the contents of these directories. I just want to load the entire contents of each file into a single column in my dataframe -- and that part works fine without the filter.

```

MY_S3_PATH = "s3a://my_bucket/level_1_dir/level_2_dir/"

names = ["alice", "bob", "mallory",]

pattern = f"/{{{','.join(names)}}}/one/two/*"

stream = (

spark.readStream.format("cloudFiles")

.schema(StructType([StructField("value", StringType(), True)]))

.option("cloudFiles.format", "text")

.option("wholeText", True)

.option("cloudFiles.fetchParallelism", 😎

.option("pathGlobFilter", include_patterns)

)

stream.load(MY_S3_PATH).writeStream.option(queryName, "my_loader_query").trigger(availableNow=True).toTable(my_table)

```

When I run this, the stream initializes and runs, but no data are processed. It appears to be filtering out everything (when I remove the filter, files get loaded like I expect).

I'm looking for a fix, but also to understand where I can look for information about what is actually being found/filtered.

bd · ‎03-22-2023

the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").

This portion of the docs could use some editing, imo.

View solution in original post

bd · ‎03-22-2023

the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").

This portion of the docs could use some editing, imo.

Databricks Community

How to debug Autoloader with `pathGlobFilter` option producing empty dataframe

Join Us as a Local Community Builder!

Level Up with Databricks Specialist Sessions

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Introducing Community Pulse — Your Weekly Databricks Roundup!