- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-22-2023 09:41 AM
When I try setting the `pathGlobFilter` on my Autoloader job, it appears to filter out everything.
The bucket/directory setup is like
`s3a://my_bucket/level_1_dir/level_2_dir/<some_name>/one/two/<the_files_i_want_to_load>`
So what I want is to be able to provide a list of the names from which to load the data. those directories will all share the same subdirectory structure, and all the files (which may have arbitrary extensions and naming conventions) will be two directories down.
The following is my current best attempt at loading the contents of these directories. I just want to load the entire contents of each file into a single column in my dataframe -- and that part works fine without the filter.
```
MY_S3_PATH = "s3a://my_bucket/level_1_dir/level_2_dir/"
names = ["alice", "bob", "mallory",]
pattern = f"/{{{','.join(names)}}}/one/two/*"
stream = (
spark.readStream.format("cloudFiles")
.schema(StructType([StructField("value", StringType(), True)]))
.option("cloudFiles.format", "text")
.option("wholeText", True)
.option("cloudFiles.fetchParallelism", 😎
.option("pathGlobFilter", include_patterns)
)
stream.load(MY_S3_PATH).writeStream.option(queryName, "my_loader_query").trigger(availableNow=True).toTable(my_table)
```
When I run this, the stream initializes and runs, but no data are processed. It appears to be filtering out everything (when I remove the filter, files get loaded like I expect).
I'm looking for a fix, but also to understand where I can look for information about what is actually being found/filtered.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-22-2023 02:44 PM
the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").
This portion of the docs could use some editing, imo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-22-2023 02:44 PM
the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").
This portion of the docs could use some editing, imo.

