<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Large/complex Incremental Autoloader Job -- Seeking Experience on approach in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82233#M36570</link>
    <description>&lt;P&gt;I'm experimenting with several approaches to implement an incremental autoloader query either in DLT or in a pipeline job.&amp;nbsp; &amp;nbsp;The complexities:&lt;/P&gt;&lt;P&gt;- Moving approximately 30B records from a nasty set of nested folders on S3 in several thousand csv files.&amp;nbsp; The table structures are very simple however (two tables of 2 and 3 columns).&amp;nbsp; This csv "swamp" grows incrementally very slowly -- with a new subfolder each month.&amp;nbsp; these nested folders have many types of files in them including my files of interest, and others that I don't want in my data lake.&lt;/P&gt;&lt;P&gt;- Through experimentation, I have found the magic combination of glob paths in the load statement and the pathGlob option to target the files I want.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;crawl_path =  BASE + "*/*/edges/
(return (spark.readStream.format("cloudFiles") 
     .option("cloudFiles.format", "csv") 
     .option("cloudFiles.maxBytesPerTrigger", "50g") 
     .option("pathGlobfilter", "*.gz") 
     .schema(schemas[tblname]) 
     .option("sep", "\t")
     .load(BASE + "*/*/edges/)
    )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My questions revolve around how to take this bite in chunks but I am struggling with how due to limits of the readstream statement (&lt;STRONG&gt;won't take a file list like others like AWS glue do - or will it?&amp;nbsp; &amp;nbsp;See below&lt;/STRONG&gt;).&amp;nbsp; &amp;nbsp;The underlying files range in size from about 300Mb to over 3GB which are unsplittable .gz files, this makes for very lumpy and skewed job processing.&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is what I'm considering based on considerable experimentation:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Big Bite: Just let it go and crawl all and take the resulting cost hit on what is probably a 24 hour run.&lt;/LI&gt;&lt;LI&gt;Symlinks -- I wonder if Autoloader will crawl and checkpoint a directory full of symlinks?&amp;nbsp; &amp;nbsp;That way, I could add symlinks to a folder incrementally and let autoloader process them.&lt;/LI&gt;&lt;LI&gt;Incremental load path -- I've found through experimentation that I cannot filter the above statement only on a subpath (readstream ...... .where("filter statement") as autoloader's checkpoint places all files in the previous read checkpoint and does not read them on a subsequent read even though they were not all copied.&amp;nbsp; &amp;nbsp;Unless I'm missing something, I've not found much documentation on how autoloader does checkpointing nor is there much control over it.&amp;nbsp; &amp;nbsp; &amp;nbsp;So, I've considered incrementally adding to the readstream load statement with incremental paths (eg. readstream ...... load(crawl_path + "/filterfolder/").&amp;nbsp; I could pass "filterfolder" as a job parameter and run multiple jobs each adding to the checkpoint unitl all are complete.&lt;/LI&gt;&lt;LI&gt;Multiple streams -- As I'm researching this question, I found the solution&amp;nbsp;&lt;A title="multiple source paths" href="https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584" target="_self"&gt;https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584&lt;/A&gt;&amp;nbsp;that seems to indicate I can give autoloader a list of sources.&amp;nbsp; &amp;nbsp;I assume this json is just input options to the readstream statement?&amp;nbsp; &amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Would appreciate any voices of experience and wisdom on this one.&amp;nbsp; I have burned up a lot of time pondering this one.&amp;nbsp; &amp;nbsp;As my Dad used to say, "Experience is the best teacher of them all but fools will learn no other way!"&lt;/P&gt;</description>
    <pubDate>Wed, 07 Aug 2024 13:39:05 GMT</pubDate>
    <dc:creator>lprevost</dc:creator>
    <dc:date>2024-08-07T13:39:05Z</dc:date>
    <item>
      <title>Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82233#M36570</link>
      <description>&lt;P&gt;I'm experimenting with several approaches to implement an incremental autoloader query either in DLT or in a pipeline job.&amp;nbsp; &amp;nbsp;The complexities:&lt;/P&gt;&lt;P&gt;- Moving approximately 30B records from a nasty set of nested folders on S3 in several thousand csv files.&amp;nbsp; The table structures are very simple however (two tables of 2 and 3 columns).&amp;nbsp; This csv "swamp" grows incrementally very slowly -- with a new subfolder each month.&amp;nbsp; these nested folders have many types of files in them including my files of interest, and others that I don't want in my data lake.&lt;/P&gt;&lt;P&gt;- Through experimentation, I have found the magic combination of glob paths in the load statement and the pathGlob option to target the files I want.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;crawl_path =  BASE + "*/*/edges/
(return (spark.readStream.format("cloudFiles") 
     .option("cloudFiles.format", "csv") 
     .option("cloudFiles.maxBytesPerTrigger", "50g") 
     .option("pathGlobfilter", "*.gz") 
     .schema(schemas[tblname]) 
     .option("sep", "\t")
     .load(BASE + "*/*/edges/)
    )&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;My questions revolve around how to take this bite in chunks but I am struggling with how due to limits of the readstream statement (&lt;STRONG&gt;won't take a file list like others like AWS glue do - or will it?&amp;nbsp; &amp;nbsp;See below&lt;/STRONG&gt;).&amp;nbsp; &amp;nbsp;The underlying files range in size from about 300Mb to over 3GB which are unsplittable .gz files, this makes for very lumpy and skewed job processing.&amp;nbsp; &amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is what I'm considering based on considerable experimentation:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Big Bite: Just let it go and crawl all and take the resulting cost hit on what is probably a 24 hour run.&lt;/LI&gt;&lt;LI&gt;Symlinks -- I wonder if Autoloader will crawl and checkpoint a directory full of symlinks?&amp;nbsp; &amp;nbsp;That way, I could add symlinks to a folder incrementally and let autoloader process them.&lt;/LI&gt;&lt;LI&gt;Incremental load path -- I've found through experimentation that I cannot filter the above statement only on a subpath (readstream ...... .where("filter statement") as autoloader's checkpoint places all files in the previous read checkpoint and does not read them on a subsequent read even though they were not all copied.&amp;nbsp; &amp;nbsp;Unless I'm missing something, I've not found much documentation on how autoloader does checkpointing nor is there much control over it.&amp;nbsp; &amp;nbsp; &amp;nbsp;So, I've considered incrementally adding to the readstream load statement with incremental paths (eg. readstream ...... load(crawl_path + "/filterfolder/").&amp;nbsp; I could pass "filterfolder" as a job parameter and run multiple jobs each adding to the checkpoint unitl all are complete.&lt;/LI&gt;&lt;LI&gt;Multiple streams -- As I'm researching this question, I found the solution&amp;nbsp;&lt;A title="multiple source paths" href="https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584" target="_self"&gt;https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584&lt;/A&gt;&amp;nbsp;that seems to indicate I can give autoloader a list of sources.&amp;nbsp; &amp;nbsp;I assume this json is just input options to the readstream statement?&amp;nbsp; &amp;nbsp;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Would appreciate any voices of experience and wisdom on this one.&amp;nbsp; I have burned up a lot of time pondering this one.&amp;nbsp; &amp;nbsp;As my Dad used to say, "Experience is the best teacher of them all but fools will learn no other way!"&lt;/P&gt;</description>
      <pubDate>Wed, 07 Aug 2024 13:39:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82233#M36570</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-07T13:39:05Z</dc:date>
    </item>
    <item>
      <title>Re: Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82240#M36574</link>
      <description>&lt;P&gt;This seems important for item 4 option:&amp;nbsp; &amp;nbsp;&lt;A href="https://docs.databricks.com/en/ingestion/auto-loader/directory-listing-mode.html#change-source-path-for-auto-loader" target="_self"&gt;change source path for autoloader&lt;/A&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 07 Aug 2024 13:50:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82240#M36574</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-07T13:50:41Z</dc:date>
    </item>
    <item>
      <title>Re: Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82258#M36583</link>
      <description>&lt;P&gt;Potential option for #4 "multiple streams" ??&amp;nbsp; if this works, could be a game changer for me.&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;reader_args = {'cloudFiles': {'format': 'csv',
   'maxBytesPerTrigger': '50g',
   'source': [{'path': '/[mypath]/[subpath1]/*/edges/',
     'globPattern': '*.gz',
     'recursive': True},
    {'path': '/[mypath]/[subpath2]/*/edges/',
     'globPattern': '*.gz',
     'recursive': True}]}}

   
(spark.readStream
     .options(reader_args)
     .schema([myschema]) 
     .option("sep", "\t") 
     .load()
    )&lt;/LI-CODE&gt;</description>
      <pubDate>Wed, 07 Aug 2024 15:50:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82258#M36583</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-07T15:50:39Z</dc:date>
    </item>
    <item>
      <title>Re: Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82432#M36649</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;- any ideas here? Specifically, is there a way to pass multiple folders to a load statement similar to the thread posted above in item 4 of my original question&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;SPAN&gt;Multiple streams -- As I'm researching this question, I found the solution&amp;nbsp;&lt;/SPAN&gt;&lt;A title="multiple source paths" href="https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584" target="_self"&gt;https://community.databricks.com/t5/data-engineering/configure-multiple-source-paths-for-auto-loader/m-p/5059#M1584&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;that seems to indicate I can give autoloader a list of sources.&amp;nbsp; &amp;nbsp;I assume this json is just input options to the readstream statement?&amp;nbsp; &amp;nbsp;&lt;/SPAN&gt;&lt;P&gt;Would appreciate any voices of experience and wisdom on this one.&amp;nbsp;&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;BR /&gt;seems to indicate?&lt;/P&gt;</description>
      <pubDate>Thu, 08 Aug 2024 17:12:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82432#M36649</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-08T17:12:07Z</dc:date>
    </item>
    <item>
      <title>Re: Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82655#M36708</link>
      <description>&lt;P&gt;Status update:&lt;/P&gt;&lt;P&gt;have been unsuccessful at getting anything to work on approach 3 and 4. &amp;nbsp;Did not try 2. &amp;nbsp;On 3, I don’t understand why that won’t work. &amp;nbsp;But when I changed sub folders, autoloader would not incrementally load them. &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;i have been successful on approach 1 using a slightly narrower glob pattern so as to avoid the really ugly large unsplittable files. &amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;im still wondering if 3 or 4 are possible?&lt;/P&gt;</description>
      <pubDate>Sun, 11 Aug 2024 13:41:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/82655#M36708</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-08-11T13:41:06Z</dc:date>
    </item>
    <item>
      <title>Re: Large/complex Incremental Autoloader Job -- Seeking Experience on approach</title>
      <link>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/91296#M38135</link>
      <description>&lt;P&gt;Crickets....&lt;/P&gt;</description>
      <pubDate>Sat, 21 Sep 2024 18:27:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/large-complex-incremental-autoloader-job-seeking-experience-on/m-p/91296#M38135</guid>
      <dc:creator>lprevost</dc:creator>
      <dc:date>2024-09-21T18:27:05Z</dc:date>
    </item>
  </channel>
</rss>

