<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Does the `pathGlobFilter` option work on the entire file path or just the file name? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4596#M1277</link>
    <description>&lt;P&gt;Thank you for confirming what I observed that differed from the documentation.&lt;/P&gt;</description>
    <pubDate>Wed, 10 May 2023 13:37:34 GMT</pubDate>
    <dc:creator>Ryan512</dc:creator>
    <dc:date>2023-05-10T13:37:34Z</dc:date>
    <item>
      <title>Does the `pathGlobFilter` option work on the entire file path or just the file name?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4594#M1275</link>
      <description>&lt;P&gt;I'm working in the Google Cloud environment. I have an Autoloader job that uses the cloud files notifications to load data into a delta table.  I want to filter the files from the PubSub topic based on the path in GCS where the files are located, not just the file name.  I can successfully filter files based on the file name, but if I try to filter on the path, I get an empty DataSet.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;path_of_file = "gs://my_bucket/dir1/dir2/test_data.json"
&amp;nbsp;
glob_filter1 = "*.json"
glob_filter2 = "*dir2*.json"
glob_filter3 = "**dir2**.json"
glob_filter4 = "*/dir2/*.json
&amp;nbsp;
spark 
  .readStream.schema(schema) 
  .format("cloudFiles") 
  .option("cloudFiles.format", "json") 
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.projectId", "&amp;lt;MY PROJECT ID&amp;gt;") 
  .option("cloudFiles.useNotifications", "true")
  .option("checkpointLocation", check_point_location)
  .option("cloudFiles.includeExistingFiles", "true") 
  .option("cloudFiles.subscription", "&amp;lt;MY SUBSCRIPTION ID&amp;gt;")
  .option("pathGlobFilter", &amp;lt;GLOB FILTER&amp;gt;)
  .load() &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;When I use `glob_filter1` as the `pathGlobFilter` option, the autoloader successfully runs and loads the expected file.  When I use `glob_filter2`, `glob_filter3`, or `glob_filter4`, autoloader runs but filters out the expected file.  I always confirm that the expected notification is in the PubSub topic before running the test and that it has been acked on the topic after the test.  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The &lt;A href="https://docs.gcp.databricks.com/ingestion/auto-loader/options.html#generic-options" alt="https://docs.gcp.databricks.com/ingestion/auto-loader/options.html#generic-options" target="_blank"&gt;documentation&lt;/A&gt; refers to it as a glob filter, and in all other places in the documentation, the glob filter can filter on the full path. Am I doing something wrong? Does the globPathFilter only work on the file name and not the full path?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 09 May 2023 20:19:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4594#M1275</guid>
      <dc:creator>Ryan512</dc:creator>
      <dc:date>2023-05-09T20:19:45Z</dc:date>
    </item>
    <item>
      <title>Re: Does the `pathGlobFilter` option work on the entire file path or just the file name?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4595#M1276</link>
      <description>&lt;P&gt;pathGlobFilter is used to only include files with file names matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To load files with paths matching a given glob pattern while keeping the behavior of partition discovery, you can use:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;val testGlobFilterDF = spark.read.format("parquet")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("pathGlobFilter", "*.parquet") // json file should be filtered out&lt;/P&gt;&lt;P&gt;&amp;nbsp;.load("examples/src/main/resources/dir1")&lt;/P&gt;&lt;P&gt;testGlobFilterDF.show()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;// +-------------+&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;// |         file|&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;// +-------------+&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;// |file1.parquet|&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;// +-------------+&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 May 2023 11:04:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4595#M1276</guid>
      <dc:creator>padmajaa</dc:creator>
      <dc:date>2023-05-10T11:04:39Z</dc:date>
    </item>
    <item>
      <title>Re: Does the `pathGlobFilter` option work on the entire file path or just the file name?</title>
      <link>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4596#M1277</link>
      <description>&lt;P&gt;Thank you for confirming what I observed that differed from the documentation.&lt;/P&gt;</description>
      <pubDate>Wed, 10 May 2023 13:37:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/does-the-pathglobfilter-option-work-on-the-entire-file-path-or/m-p/4596#M1277</guid>
      <dc:creator>Ryan512</dc:creator>
      <dc:date>2023-05-10T13:37:34Z</dc:date>
    </item>
  </channel>
</rss>

