<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: AutoLoader - problem with adding new source location in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-problem-with-adding-new-source-location/m-p/61488#M31808</link>
    <description>&lt;P&gt;Thanks for the reply &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; . I have some questions related to you answer.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;C&lt;/STRONG&gt;&lt;STRONG&gt;heckpoint Location&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Does deleteing checkpoint folder (or only files?) mean that next run of AutoLoader will load all files from provided source locations? So it will duplicate data which was alredy loaded to target delta table.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Configure Auto Loader&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Am I undestand correctly that InMemoryFileIndex is used for listing files and directories more efficiently but there is no possibility to use it with AutoLoader with cloudFiles?&lt;/LI&gt;&lt;LI&gt;How about to implement process which move (for backup purpose) or delete processed by AutoLoader files? It could resolve problem with long files listing. Is there any feature like this in AutoLoader? In fact I have found "archive_timestamp" column in "cloud_file_state" but it keeps only nulls.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Consider Using Wildcards&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;It looks like wildcards could resolve my problem. Please confirm that using wildcards create only one soruce in "checkpoint/sources" directory?&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Marcin_U_0-1708616776547.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6340i474D8F3CE9960D8A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Marcin_U_0-1708616776547.png" alt="Marcin_U_0-1708616776547.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;LI&gt;I wonder why my implementation create new source in "checkpoint/sources" folder after adding new source locations to AutoLoader? Is it due to start run readStream as many times as source locations in&lt;PRE&gt;_al_readStream_from_paths​&lt;/PRE&gt;method ?&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
    <pubDate>Thu, 22 Feb 2024 15:47:05 GMT</pubDate>
    <dc:creator>Marcin_U</dc:creator>
    <dc:date>2024-02-22T15:47:05Z</dc:date>
    <item>
      <title>AutoLoader - problem with adding new source location</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-problem-with-adding-new-source-location/m-p/61365#M31772</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have some trouble with AutoLoader. Currently we use many diffrent source location on ADLS to read parquet files and write it to delta table using AutoLoader. Files in locations have the same schema.&lt;/P&gt;&lt;P&gt;Every things works fine untill we have to add new source location for existing table. In this case the error has thrown:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;There are [2] sources in the checkpoint offsets and now there are [3] sources requested by the query&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Our implementation of AutoLoader:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;CloudFile config:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;        cloudFile = {
                 "cloudFiles.usenotifications": False
                , "cloudFiles.format": self.source_format
                , "cloudFiles.schemaLocation": self.al_output_path_schema
                , "cloudFiles.inferColumnTypes": True
                , "cloudFiles.validateOptions": True
            }&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;AutoLoader run:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;        df = self._al_readStream_from_paths()

        df.writeStream\
            .foreachBatch(lambda df, epoch_id: self._al_NAV_augment_base_stream(epoch_id, df))\
            .option( "checkpointLocation", self.al_output_path_checkpoint )\
            .option("mergeSchema", "true")\
            .queryName(f'_process_{self.source_mnt}_{self.target_name}')\
            .trigger(availableNow=True)\
            .start().awaitTermination()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;_al_readStream_from_paths definition:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;def _al_readStream_from_paths(self) -&amp;gt; None:

        list_of_paths = self._get_list_of_paths()
        if list_of_paths == []:
            df_single_path = self._al_readStream( source_path=self.source_path )
            return df_single_path
        else:
            df_unioned = None
            for path in list_of_paths:

                df_single_path = self._al_readStream( source_path=path )
                if df_unioned is None:
                    df_unioned = df_single_path
                else:
                    df_unioned = df_unioned.union(df_single_path)
            return df_unioned&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;_al_readStrem def:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;    def _al_readStream(self, source_path) -&amp;gt; DataFrame:

        cloudFile = self._al_get_cloudFile_options()

        df = spark.readStream\
          .format("cloudFiles")\
          .options(**cloudFile)\
          .load(source_path)
        return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;_al_NAV_augment_base_stream used in writeStream include augument of df such as adding linegae columns etc.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My question is how to add new source location in proper way that not cause "There are [2] sources in the checkpoint offsets and now there are [3] sources requested by the query".&lt;/P&gt;</description>
      <pubDate>Wed, 21 Feb 2024 14:18:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-problem-with-adding-new-source-location/m-p/61365#M31772</guid>
      <dc:creator>Marcin_U</dc:creator>
      <dc:date>2024-02-21T14:18:25Z</dc:date>
    </item>
    <item>
      <title>Re: AutoLoader - problem with adding new source location</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-problem-with-adding-new-source-location/m-p/61488#M31808</link>
      <description>&lt;P&gt;Thanks for the reply &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; . I have some questions related to you answer.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;STRONG&gt;C&lt;/STRONG&gt;&lt;STRONG&gt;heckpoint Location&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Does deleteing checkpoint folder (or only files?) mean that next run of AutoLoader will load all files from provided source locations? So it will duplicate data which was alredy loaded to target delta table.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Configure Auto Loader&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;Am I undestand correctly that InMemoryFileIndex is used for listing files and directories more efficiently but there is no possibility to use it with AutoLoader with cloudFiles?&lt;/LI&gt;&lt;LI&gt;How about to implement process which move (for backup purpose) or delete processed by AutoLoader files? It could resolve problem with long files listing. Is there any feature like this in AutoLoader? In fact I have found "archive_timestamp" column in "cloud_file_state" but it keeps only nulls.&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Consider Using Wildcards&lt;/STRONG&gt;:&lt;UL&gt;&lt;LI&gt;It looks like wildcards could resolve my problem. Please confirm that using wildcards create only one soruce in "checkpoint/sources" directory?&lt;BR /&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Marcin_U_0-1708616776547.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/6340i474D8F3CE9960D8A/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Marcin_U_0-1708616776547.png" alt="Marcin_U_0-1708616776547.png" /&gt;&lt;/span&gt;&lt;/LI&gt;&lt;LI&gt;I wonder why my implementation create new source in "checkpoint/sources" folder after adding new source locations to AutoLoader? Is it due to start run readStream as many times as source locations in&lt;PRE&gt;_al_readStream_from_paths​&lt;/PRE&gt;method ?&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 22 Feb 2024 15:47:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-problem-with-adding-new-source-location/m-p/61488#M31808</guid>
      <dc:creator>Marcin_U</dc:creator>
      <dc:date>2024-02-22T15:47:05Z</dc:date>
    </item>
  </channel>
</rss>

