cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to debug Autoloader with `pathGlobFilter` option producing empty dataframe

bd
New Contributor III

When I try setting the `pathGlobFilter` on my Autoloader job, it appears to filter out everything.

The bucket/directory setup is like

`s3a://my_bucket/level_1_dir/level_2_dir/<some_name>/one/two/<the_files_i_want_to_load>`

So what I want is to be able to provide a list of the names from which to load the data. those directories will all share the same subdirectory structure, and all the files (which may have arbitrary extensions and naming conventions) will be two directories down.

The following is my current best attempt at loading the contents of these directories. I just want to load the entire contents of each file into a single column in my dataframe -- and that part works fine without the filter.

```

MY_S3_PATH = "s3a://my_bucket/level_1_dir/level_2_dir/"

names = ["alice", "bob", "mallory",]

pattern = f"/{{{','.join(names)}}}/one/two/*"

stream = (

spark.readStream.format("cloudFiles")

.schema(StructType([StructField("value", StringType(), True)]))

.option("cloudFiles.format", "text")

.option("wholeText", True)

.option("cloudFiles.fetchParallelism", 😎

.option("pathGlobFilter", include_patterns)

)

stream.load(MY_S3_PATH).writeStream.option(queryName, "my_loader_query").trigger(availableNow=True).toTable(my_table)

```

When I run this, the stream initializes and runs, but no data are processed. It appears to be filtering out everything (when I remove the filter, files get loaded like I expect).

I'm looking for a fix, but also to understand where I can look for information about what is actually being found/filtered.

1 ACCEPTED SOLUTION

Accepted Solutions

bd
New Contributor III

the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").

This portion of the docs could use some editing, imo.

View solution in original post

1 REPLY 1

bd
New Contributor III

the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").

This portion of the docs could use some editing, imo.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group