How to debug Autoloader with `pathGlobFilter` option producing empty dataframe

bd — Wed, 22 Mar 2023 16:41:22 GMT

When I try setting the `pathGlobFilter` on my Autoloader job, it appears to filter out everything.

The bucket/directory setup is like

`s3a://my_bucket/level_1_dir/level_2_dir/<some_name>/one/two/<the_files_i_want_to_load>`

So what I want is to be able to provide a list of the names from which to load the data. those directories will all share the same subdirectory structure, and all the files (which may have arbitrary extensions and naming conventions) will be two directories down.

The following is my current best attempt at loading the contents of these directories. I just want to load the entire contents of each file into a single column in my dataframe -- and that part works fine without the filter.

```

MY_S3_PATH = "s3a://my_bucket/level_1_dir/level_2_dir/"

names = ["alice", "bob", "mallory",]

pattern = f"/{{{','.join(names)}}}/one/two/*"

stream = (

spark.readStream.format("cloudFiles")

.schema(StructType([StructField("value", StringType(), True)]))

.option("cloudFiles.format", "text")

.option("wholeText", True)

.option("cloudFiles.fetchParallelism", 😎

.option("pathGlobFilter", include_patterns)

)

stream.load(MY_S3_PATH).writeStream.option(queryName, "my_loader_query").trigger(availableNow=True).toTable(my_table)

```

When I run this, the stream initializes and runs, but no data are processed. It appears to be filtering out everything (when I remove the filter, files get loaded like I expect).

I'm looking for a fix, but also to understand where I can look for information about what is actually being found/filtered.

Re: How to debug Autoloader with `pathGlobFilter` option producing empty dataframe

bd — Wed, 22 Mar 2023 21:44:23 GMT

the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}").

This portion of the docs could use some editing, imo.

topic Re: How to debug Autoloader with `pathGlobFilter` option producing empty dataframe in Warehousing & Analytics

How to debug Autoloader with `pathGlobFilter` option producing empty dataframe

Re: How to debug Autoloader with `pathGlobFilter` option producing empty dataframe