<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Passing multiple paths to .load in autoloader in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69088#M33805</link>
    <description>&lt;P&gt;Can I create multiple streams that run at the same time? Or do I have to wait for one stream to finish before starting another? So if I have 10 different storage containers, can I create 10 streams that run at the same time?&lt;/P&gt;</description>
    <pubDate>Wed, 15 May 2024 13:37:40 GMT</pubDate>
    <dc:creator>TimB</dc:creator>
    <dc:date>2024-05-15T13:37:40Z</dc:date>
    <item>
      <title>Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/68874#M33758</link>
      <description>&lt;P&gt;I am trying to use autoloader to load data from two different blobs from within the same account so that spark will discover the data asynchronously. However, when I try this, it doesn't work and I get the error outlined below. Can anyone point out where I am going wrong, or an alternative method to achieve this?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;To use 'cloudFiles' as a streaming source, please provide the file format with the option 'cloudFiles.format', and use .load() to create your DataFrame.&lt;/LI-CODE&gt;&lt;LI-CODE lang="python"&gt;df = (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "csv")
  .option("cloudFiles.backfillInterval", "1 day")
  .load("wasbs://customer1@container_name.*csv", "wasbs://customer2@container_name.*csv")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 13 May 2024 11:55:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/68874#M33758</guid>
      <dc:creator>TimB</dc:creator>
      <dc:date>2024-05-13T11:55:38Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69082#M33802</link>
      <description>&lt;P&gt;Hello Tim, as you will note from the &lt;A href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss/api/pyspark.sql.streaming.DataStreamReader.load.html" target="_self"&gt;Spark Streaming&lt;/A&gt; docs, the load function only accepts one string for the path arg. This means that all files need to be detectible from the same base path if you wish to do this in a single stream. You can then use &lt;A href="https://learn.microsoft.com/en-us/azure/databricks/ingestion/auto-loader/patterns#filtering-directories-or-files-using-glob-patterns" target="_self"&gt;glob patterns&lt;/A&gt; to pick up the files from the same base path.&lt;/P&gt;
&lt;P&gt;You have 2 options here:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;[Recommended] Change the upstream to be in ADLSg2 so that you have hierarchical namespace (also,&amp;nbsp;&lt;SPAN&gt;Microsoft has deprecated the Windows Azure Storage Blob driver (WASB) for Azure Blob Storage&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;in favor of the Azure Blob Filesystem driver (ABFS), so you should move off anyways -- &lt;A href="https://docs.databricks.com/en/archive/storage/wasb-blob.html" target="_self"&gt;docs&lt;/A&gt;). Then use 1 container in 1 storage account in ADLSg2, then have subdirectories for each customer. Now you can easily glob all customers together in a single string from the same base path.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Your only other option is to create a separate stream for each customer, which wouldn't scale as well as 1 stream for all (though is still a possible solution).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 15 May 2024 13:12:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69082#M33802</guid>
      <dc:creator>Corbin</dc:creator>
      <dc:date>2024-05-15T13:12:29Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69083#M33803</link>
      <description>&lt;P&gt;Thanks for your feedback Corbin. I am aware of the deprecation, but the type of blob storage is beyond my control in this case i'm afraid. Further to this, the business logic dictates that there be a single blob storage per customer so this also cannot be changed.&lt;BR /&gt;&lt;BR /&gt;Therefore it looks like option 2 would be my only choice. Is there a way to create separate streams dynamically from a list of customers and still utilise the asynchronous nature of spark?&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 13:17:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69083#M33803</guid>
      <dc:creator>TimB</dc:creator>
      <dc:date>2024-05-15T13:17:13Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69084#M33804</link>
      <description>&lt;P&gt;What do you mean by "&lt;SPAN&gt;asynchronous nature of spark"? What behavior are you trying to maintain?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 13:19:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69084#M33804</guid>
      <dc:creator>Corbin</dc:creator>
      <dc:date>2024-05-15T13:19:42Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69088#M33805</link>
      <description>&lt;P&gt;Can I create multiple streams that run at the same time? Or do I have to wait for one stream to finish before starting another? So if I have 10 different storage containers, can I create 10 streams that run at the same time?&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 13:37:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69088#M33805</guid>
      <dc:creator>TimB</dc:creator>
      <dc:date>2024-05-15T13:37:40Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69091#M33807</link>
      <description>&lt;P&gt;Yes, each stream will run independently of one another and all can run together at the same time.&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 13:49:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69091#M33807</guid>
      <dc:creator>Corbin</dc:creator>
      <dc:date>2024-05-15T13:49:22Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69092#M33808</link>
      <description>&lt;P&gt;That's good news. So would this be the correct sort of set up, or should I be creating all the streams first before writing the streams to a table?&lt;/P&gt;&lt;LI-CODE lang="python"&gt;customer_list = ['customer1', 'customer2', 'customer3', ...]

table_name = "bronze.customer_data_table"

for customer in customer_list:
    file_path = f"wasbs://{customer}@conatiner.blob.core.windows.net/*/*.csv"

    checkpoint_ path = f"/tmp/checkpoints/{customer}/_checkpoints"

    cloudFile = {
        "cloudFiles.format": "csv",
        "cloudFiles.backfillInterval": "1 day",
        "cloudFiles.schemaLocation": checkpoint_path,
        "cloudFiles.schemaEvolutionMode": "rescue",
    }

    df = (
        spark.readStream
        .format("cloudFiles")
        .options(**cloudFile)
        .load(file_path)
    )

    streamQuery = (
        df.writeStream.format("delta")
        .option("outputMode", "append")
        .option("checkpointLocation", checkpoint_path)
        .trigger(once=True)
        .toTable(table_name)
)&lt;/LI-CODE&gt;</description>
      <pubDate>Wed, 15 May 2024 14:01:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69092#M33808</guid>
      <dc:creator>TimB</dc:creator>
      <dc:date>2024-05-15T14:01:59Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69094#M33810</link>
      <description>&lt;P&gt;LGTM, but use .trigger(availableNow=True) instead of once since &lt;A href="https://docs.databricks.com/en/structured-streaming/triggers.html#configuring-incremental-batch-processing" target="_self"&gt;once is now deprecated&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 14:31:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69094#M33810</guid>
      <dc:creator>Corbin</dc:creator>
      <dc:date>2024-05-15T14:31:42Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69096#M33811</link>
      <description>&lt;P&gt;If were were to upgrade to&amp;nbsp;&lt;SPAN&gt;ADLSg2, but retain the same structure, would there be scope for this method above to be improved (besides moving to notification mode)?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 May 2024 15:15:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/69096#M33811</guid>
      <dc:creator>TimB</dc:creator>
      <dc:date>2024-05-15T15:15:48Z</dc:date>
    </item>
    <item>
      <title>Re: Passing multiple paths to .load in autoloader</title>
      <link>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/82405#M36639</link>
      <description>&lt;P&gt;I'm trying to do something similar either in a DLT pipeline or a standard streaming query autoloader job.&amp;nbsp; This &lt;A href="https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584" target="_self"&gt;thread&lt;/A&gt; implies there is a way to pass multiple source directories via options to the readstream query while maintaining state all through one checkpoint.&amp;nbsp; &amp;nbsp;this would greatly simplify my process if true but I have been unable to get it to work.&amp;nbsp; I've also considered your appraoch which is to parameterize my file_path using subfolders as sources.&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 14:33:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/passing-multiple-paths-to-load-in-autoloader/m-p/82405#M36639</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-08T14:33:41Z</dc:date>
    </item>
  </channel>
</rss>

