<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to debug Autoloader with `pathGlobFilter` option producing empty dataframe in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/how-to-debug-autoloader-with-pathglobfilter-option-producing/m-p/7245#M117</link>
    <description>&lt;P&gt;the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}"). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This portion of the docs could use some editing, imo.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 22 Mar 2023 21:44:23 GMT</pubDate>
    <dc:creator>bd</dc:creator>
    <dc:date>2023-03-22T21:44:23Z</dc:date>
    <item>
      <title>How to debug Autoloader with `pathGlobFilter` option producing empty dataframe</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/how-to-debug-autoloader-with-pathglobfilter-option-producing/m-p/7244#M116</link>
      <description>&lt;P&gt;When I try setting the `pathGlobFilter` on my Autoloader job, it appears to filter out everything. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The bucket/directory setup is like&lt;/P&gt;&lt;P&gt;`s3a://my_bucket/level_1_dir/level_2_dir/&amp;lt;some_name&amp;gt;/one/two/&amp;lt;the_files_i_want_to_load&amp;gt;`&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;So what I want is to be able to provide a list of the names from which to load the data. those directories will all share the same subdirectory structure, and all the files (which may have arbitrary extensions and naming conventions) will be two directories down.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The following is my current best attempt at loading the contents of these directories. I just want to load the entire contents of each file into a single column in my dataframe -- and that part works fine without the filter.&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;MY_S3_PATH = "s3a://my_bucket/level_1_dir/level_2_dir/"&lt;/P&gt;&lt;P&gt;names = ["alice", "bob", "mallory",]&lt;/P&gt;&lt;P&gt;pattern = f"/{{{','.join(names)}}}/one/two/*"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;stream = (&lt;/P&gt;&lt;P&gt;    spark.readStream.format("cloudFiles")&lt;/P&gt;&lt;P&gt;    .schema(StructType([StructField("value", StringType(), True)]))&lt;/P&gt;&lt;P&gt;    .option("cloudFiles.format", "text")&lt;/P&gt;&lt;P&gt;    .option("wholeText", True)&lt;/P&gt;&lt;P&gt;    .option("cloudFiles.fetchParallelism", &lt;span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:"&gt;😎&lt;/span&gt;&lt;/P&gt;&lt;P&gt;    .option("pathGlobFilter", include_patterns)&lt;/P&gt;&lt;P&gt;    )&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;stream.load(MY_S3_PATH).writeStream.option(queryName, "my_loader_query").trigger(availableNow=True).toTable(my_table)&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When I run this, the stream initializes and runs, but no data are processed. It appears to be filtering out everything (when I remove the filter, files get loaded like I expect).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm looking for a fix, but also to understand where I can look for information about what is actually being found/filtered. &lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2023 16:41:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/how-to-debug-autoloader-with-pathglobfilter-option-producing/m-p/7244#M116</guid>
      <dc:creator>bd</dc:creator>
      <dc:date>2023-03-22T16:41:22Z</dc:date>
    </item>
    <item>
      <title>Re: How to debug Autoloader with `pathGlobFilter` option producing empty dataframe</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/how-to-debug-autoloader-with-pathglobfilter-option-producing/m-p/7245#M117</link>
      <description>&lt;P&gt;the thing that actually worked for me was to skip the `pathGlobFilter` and do this filtering in the `load` invocation: `stream.load(f"{MY_S3_PATH}{include_patterns}"). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This portion of the docs could use some editing, imo.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2023 21:44:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/how-to-debug-autoloader-with-pathglobfilter-option-producing/m-p/7245#M117</guid>
      <dc:creator>bd</dc:creator>
      <dc:date>2023-03-22T21:44:23Z</dc:date>
    </item>
  </channel>
</rss>

