<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to filter files in Databricks Autoloader stream in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12089#M6948</link>
    <description>&lt;P&gt;Strange, maybe because of this? :&lt;/P&gt;&lt;P&gt;&lt;I&gt;"The glob pattern will have * appended to it &lt;/I&gt;" (for the filepath)&lt;/P&gt;&lt;P&gt;Or use *_INPUT* as file filter.&lt;/P&gt;</description>
    <pubDate>Fri, 29 Oct 2021 11:22:21 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-10-29T11:22:21Z</dc:date>
    <item>
      <title>How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12085#M6944</link>
      <description>&lt;P&gt;I want to set up an S3 stream using &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#configuration-1" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#configuration-1" target="_blank"&gt;Databricks Auto Loader&lt;/A&gt;. I have managed to set up the stream, but my S3 bucket contains different type of JSON files. I want to filter them out, preferably in the stream itself rather than using a filter operation.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;According to &lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#use-cloudfiles-source" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#use-cloudfiles-source" target="_blank"&gt;the docs&lt;/A&gt; I should be able to filter using a glob pattern. However, I can't seem to get this to work as it loads everything anyhow.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is what I have&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = (
  spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.inferColumnTypes", "true")
  .option("cloudFiles.schemaInference.samleSize.numFiles", 1000)
  .option("cloudFiles.schemaLocation", "dbfs:/auto-loader/schemas/")
  .option("includeExistingFiles", "true")
  .option("multiLine", "true")
  .option("inferSchema", "true")
#   .option("cloudFiles.schemaHints", schemaHints)
#  .load("s3://&amp;lt;BUCKET&amp;gt;/qualifier/**/*_INPUT")
  .load("s3://&amp;lt;BUCKET&amp;gt;/qualifier")
  .withColumn("filePath", F.input_file_name())
  .withColumn("date_ingested", F.current_timestamp())
)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;My files have a key that is structured as &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;qualifier/version/YYYY-MM/DD/&amp;lt;NAME&amp;gt;_INPUT.json
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;, so I want to filter files that contain the name input. This seems to load everything:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;.load("s3://&amp;lt;BUCKET&amp;gt;/qualifier")
&amp;nbsp;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;and&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;.load("s3://&amp;lt;BUCKET&amp;gt;/qualifier/**/*_INPUT")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;is what I want to do, but that doesn't work. Is my glob pattern incorrect, or is there something else I am missing?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 06:46:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12085#M6944</guid>
      <dc:creator>kaslan</dc:creator>
      <dc:date>2021-10-29T06:46:45Z</dc:date>
    </item>
    <item>
      <title>Re: How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12087#M6946</link>
      <description>&lt;P&gt;According to the docs you linked, the glob filter on input-path only works on directories, not on the files themselves.&lt;/P&gt;&lt;P&gt;So if you want to filter on certain files in the concerning dirs, you can include an additional filter through the pathGlobFilter option:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;.option("pathGlobFilter", "*_INPUT")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#use-cloudfiles-source" alt="https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#use-cloudfiles-source" target="_blank"&gt;https://docs.databricks.com/spark/latest/structured-streaming/auto-loader-s3.html#use-cloudfiles-source&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 08:29:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12087#M6946</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-29T08:29:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12088#M6947</link>
      <description>&lt;P&gt;Ah yeah forgot to mention that I tried that as well. It still picks up other files as well when I do that&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 09:08:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12088#M6947</guid>
      <dc:creator>kaslan</dc:creator>
      <dc:date>2021-10-29T09:08:05Z</dc:date>
    </item>
    <item>
      <title>Re: How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12089#M6948</link>
      <description>&lt;P&gt;Strange, maybe because of this? :&lt;/P&gt;&lt;P&gt;&lt;I&gt;"The glob pattern will have * appended to it &lt;/I&gt;" (for the filepath)&lt;/P&gt;&lt;P&gt;Or use *_INPUT* as file filter.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Oct 2021 11:22:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12089#M6948</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-29T11:22:21Z</dc:date>
    </item>
    <item>
      <title>Re: How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12090#M6949</link>
      <description>&lt;P&gt;Yeah, maybe. But that would mean that all files that contain &lt;I&gt;INPUT&lt;/I&gt; would still be included right?&lt;/P&gt;</description>
      <pubDate>Tue, 02 Nov 2021 06:18:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12090#M6949</guid>
      <dc:creator>kaslan</dc:creator>
      <dc:date>2021-11-02T06:18:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to filter files in Databricks Autoloader stream</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12091#M6950</link>
      <description>&lt;P&gt;no, if you explicitely put the underscore, plain INPUT will not be selected.&lt;/P&gt;</description>
      <pubDate>Tue, 02 Nov 2021 08:37:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-filter-files-in-databricks-autoloader-stream/m-p/12091#M6950</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-11-02T08:37:55Z</dc:date>
    </item>
  </channel>
</rss>

