<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Spark streaming auto loader wildcard not working in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-auto-loader-wildcard-not-working/m-p/45725#M27971</link>
    <description>&lt;P&gt;Need som help with an issue loading a subdirectory from S3 bucket using auto-loader. For example:&lt;/P&gt;&lt;P&gt;S3://path1/path2/databases*/paths/&lt;/P&gt;&lt;P&gt;In databases there are various versions of databases. For example&lt;/P&gt;&lt;P&gt;path1/path2/database_v1/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;path1/path2/database_v2/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;path1/path2/database_v3/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What's happening? - Well somehow it takes "database*" as a directory name literally. When it does not found that path it move one path behind.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;"Listing s3://path1..."&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And obviously it stay in that listening because from path1 to sub_path/*.parquet there are a lot of different schemas to explore.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Already tried "cloudFiles.recursiveFileLookup": "true"&lt;/P&gt;&lt;P&gt;Also tried to pass a list but Databricks does not supports directory list.&lt;/P&gt;&lt;P&gt;Code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;autoloader_options = {
"cloudFiles.format": "parquet",
"cloudFiles.schemaLocation":f'{defs["schema_checkpoint_name"]}'
}

# AutoLoader
readstream_dataframe_autoloader = (
    spark.readStream
    .format("cloudFiles")
    .options(**autoloader_options)
    .load(
     'S3://path1/path2/databases*/sub_path/bank_fee '
        )
)

# No Autoloader works perfectly. But the project precise to use Auto loader feture. 

df_transaction = (
   spark.readStream
   .format("parquet")
   .option("rowsPerSecond", 100)
   .schema(&amp;lt;someschema&amp;gt;)
   .load("S3://path1/path2/databases*/paths/") 
   )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 22 Sep 2023 17:56:00 GMT</pubDate>
    <dc:creator>Jozhua</dc:creator>
    <dc:date>2023-09-22T17:56:00Z</dc:date>
    <item>
      <title>Spark streaming auto loader wildcard not working</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-auto-loader-wildcard-not-working/m-p/45725#M27971</link>
      <description>&lt;P&gt;Need som help with an issue loading a subdirectory from S3 bucket using auto-loader. For example:&lt;/P&gt;&lt;P&gt;S3://path1/path2/databases*/paths/&lt;/P&gt;&lt;P&gt;In databases there are various versions of databases. For example&lt;/P&gt;&lt;P&gt;path1/path2/database_v1/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;path1/path2/database_v2/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;path1/path2/database_v3/sub_path/*.parquet&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What's happening? - Well somehow it takes "database*" as a directory name literally. When it does not found that path it move one path behind.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;"Listing s3://path1..."&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And obviously it stay in that listening because from path1 to sub_path/*.parquet there are a lot of different schemas to explore.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Already tried "cloudFiles.recursiveFileLookup": "true"&lt;/P&gt;&lt;P&gt;Also tried to pass a list but Databricks does not supports directory list.&lt;/P&gt;&lt;P&gt;Code:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;autoloader_options = {
"cloudFiles.format": "parquet",
"cloudFiles.schemaLocation":f'{defs["schema_checkpoint_name"]}'
}

# AutoLoader
readstream_dataframe_autoloader = (
    spark.readStream
    .format("cloudFiles")
    .options(**autoloader_options)
    .load(
     'S3://path1/path2/databases*/sub_path/bank_fee '
        )
)

# No Autoloader works perfectly. But the project precise to use Auto loader feture. 

df_transaction = (
   spark.readStream
   .format("parquet")
   .option("rowsPerSecond", 100)
   .schema(&amp;lt;someschema&amp;gt;)
   .load("S3://path1/path2/databases*/paths/") 
   )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Sep 2023 17:56:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-auto-loader-wildcard-not-working/m-p/45725#M27971</guid>
      <dc:creator>Jozhua</dc:creator>
      <dc:date>2023-09-22T17:56:00Z</dc:date>
    </item>
  </channel>
</rss>

