I'm working in the Google Cloud environment. I have an Autoloader job that uses the cloud files notifications to load data into a delta table. I want to filter the files from the PubSub topic based on the path in GCS where the files are located, not just the file name. I can successfully filter files based on the file name, but if I try to filter on the path, I get an empty DataSet.
path_of_file = "gs://my_bucket/dir1/dir2/test_data.json"
glob_filter1 = "*.json"
glob_filter2 = "*dir2*.json"
glob_filter3 = "**dir2**.json"
glob_filter4 = "*/dir2/*.json
spark
.readStream.schema(schema)
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.inferColumnTypes", "true")
.option("cloudFiles.projectId", "<MY PROJECT ID>")
.option("cloudFiles.useNotifications", "true")
.option("checkpointLocation", check_point_location)
.option("cloudFiles.includeExistingFiles", "true")
.option("cloudFiles.subscription", "<MY SUBSCRIPTION ID>")
.option("pathGlobFilter", <GLOB FILTER>)
.load()
When I use `glob_filter1` as the `pathGlobFilter` option, the autoloader successfully runs and loads the expected file. When I use `glob_filter2`, `glob_filter3`, or `glob_filter4`, autoloader runs but filters out the expected file. I always confirm that the expected notification is in the PubSub topic before running the test and that it has been acked on the topic after the test.
The documentation refers to it as a glob filter, and in all other places in the documentation, the glob filter can filter on the full path. Am I doing something wrong? Does the globPathFilter only work on the file name and not the full path?