<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic copy file structure including files from one storage to another incrementally using pyspark in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/copy-file-structure-including-files-from-one-storage-to-another/m-p/68429#M33676</link>
    <description>&lt;P&gt;I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:&lt;/P&gt;&lt;PRE&gt;results  
    search
        03
            Module19111.json
            Module19126.json
        04
            Module11291.json
            Module19222.json
    product
        03
            Module18867.json
            Module182625.json
        04
            Module122251.json
            Module192287.json&lt;/PRE&gt;&lt;P&gt;i am trying to copy the data incrementally from source to destination container by using the below code snippet&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; datetime &lt;SPAN class=""&gt;import&lt;/SPAN&gt; datetime, timedelta
&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class=""&gt;import&lt;/SPAN&gt; SparkSession


&lt;SPAN class=""&gt;# Set up the source and destination storage account configurations&lt;/SPAN&gt;
source_account_name = &lt;SPAN class=""&gt;"dev-stor"&lt;/SPAN&gt;
source_container_name = &lt;SPAN class=""&gt;"results"&lt;/SPAN&gt;
destination_account_name = &lt;SPAN class=""&gt;"dev-stor"&lt;/SPAN&gt;
destination_container_name = &lt;SPAN class=""&gt;"results"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Set up the source and destination paths&lt;/SPAN&gt;
source_path = &lt;SPAN class=""&gt;f"abfss://&lt;SPAN class=""&gt;{source_container_name}&lt;/SPAN&gt;@&lt;SPAN class=""&gt;{source_account_name}&lt;/SPAN&gt;.dfs.core.windows.net/&lt;SPAN class=""&gt;{search,product}&lt;/SPAN&gt;/"&lt;/SPAN&gt;
destination_path = &lt;SPAN class=""&gt;f"abfss://&lt;SPAN class=""&gt;{destination_container_name}&lt;/SPAN&gt;@&lt;SPAN class=""&gt;{destination_account_name}&lt;/SPAN&gt;.dfs.core.windows.net/copy-data-2024"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Set up the date range for incremental copy&lt;/SPAN&gt;
start_date = datetime(&lt;SPAN class=""&gt;2024&lt;/SPAN&gt;, &lt;SPAN class=""&gt;3&lt;/SPAN&gt;, &lt;SPAN class=""&gt;1&lt;/SPAN&gt;)
end_date = datetime(&lt;SPAN class=""&gt;2999&lt;/SPAN&gt;, &lt;SPAN class=""&gt;12&lt;/SPAN&gt;, &lt;SPAN class=""&gt;12&lt;/SPAN&gt;)

dbutils.fs.cp(source_path, destination_path, recurse=&lt;SPAN class=""&gt;True&lt;/SPAN&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;/PRE&gt;&lt;P&gt;the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.&lt;/P&gt;&lt;P&gt;PS. directory hierarchy is to be the same.&lt;/P&gt;&lt;P&gt;I also tried autoloader but was unable to main the same hierarchical directory structure.&lt;/P&gt;&lt;P&gt;can i get some expert advice please&lt;/P&gt;</description>
    <pubDate>Tue, 07 May 2024 14:13:08 GMT</pubDate>
    <dc:creator>shreya_20202</dc:creator>
    <dc:date>2024-05-07T14:13:08Z</dc:date>
    <item>
      <title>copy file structure including files from one storage to another incrementally using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/copy-file-structure-including-files-from-one-storage-to-another/m-p/68429#M33676</link>
      <description>&lt;P&gt;I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:&lt;/P&gt;&lt;PRE&gt;results  
    search
        03
            Module19111.json
            Module19126.json
        04
            Module11291.json
            Module19222.json
    product
        03
            Module18867.json
            Module182625.json
        04
            Module122251.json
            Module192287.json&lt;/PRE&gt;&lt;P&gt;i am trying to copy the data incrementally from source to destination container by using the below code snippet&lt;/P&gt;&lt;PRE&gt;&lt;SPAN class=""&gt;from&lt;/SPAN&gt; datetime &lt;SPAN class=""&gt;import&lt;/SPAN&gt; datetime, timedelta
&lt;SPAN class=""&gt;from&lt;/SPAN&gt; pyspark.sql &lt;SPAN class=""&gt;import&lt;/SPAN&gt; SparkSession


&lt;SPAN class=""&gt;# Set up the source and destination storage account configurations&lt;/SPAN&gt;
source_account_name = &lt;SPAN class=""&gt;"dev-stor"&lt;/SPAN&gt;
source_container_name = &lt;SPAN class=""&gt;"results"&lt;/SPAN&gt;
destination_account_name = &lt;SPAN class=""&gt;"dev-stor"&lt;/SPAN&gt;
destination_container_name = &lt;SPAN class=""&gt;"results"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Set up the source and destination paths&lt;/SPAN&gt;
source_path = &lt;SPAN class=""&gt;f"abfss://&lt;SPAN class=""&gt;{source_container_name}&lt;/SPAN&gt;@&lt;SPAN class=""&gt;{source_account_name}&lt;/SPAN&gt;.dfs.core.windows.net/&lt;SPAN class=""&gt;{search,product}&lt;/SPAN&gt;/"&lt;/SPAN&gt;
destination_path = &lt;SPAN class=""&gt;f"abfss://&lt;SPAN class=""&gt;{destination_container_name}&lt;/SPAN&gt;@&lt;SPAN class=""&gt;{destination_account_name}&lt;/SPAN&gt;.dfs.core.windows.net/copy-data-2024"&lt;/SPAN&gt;

&lt;SPAN class=""&gt;# Set up the date range for incremental copy&lt;/SPAN&gt;
start_date = datetime(&lt;SPAN class=""&gt;2024&lt;/SPAN&gt;, &lt;SPAN class=""&gt;3&lt;/SPAN&gt;, &lt;SPAN class=""&gt;1&lt;/SPAN&gt;)
end_date = datetime(&lt;SPAN class=""&gt;2999&lt;/SPAN&gt;, &lt;SPAN class=""&gt;12&lt;/SPAN&gt;, &lt;SPAN class=""&gt;12&lt;/SPAN&gt;)

dbutils.fs.cp(source_path, destination_path, recurse=&lt;SPAN class=""&gt;True&lt;/SPAN&gt;)&lt;BR /&gt;&lt;BR /&gt;&lt;/PRE&gt;&lt;P&gt;the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.&lt;/P&gt;&lt;P&gt;PS. directory hierarchy is to be the same.&lt;/P&gt;&lt;P&gt;I also tried autoloader but was unable to main the same hierarchical directory structure.&lt;/P&gt;&lt;P&gt;can i get some expert advice please&lt;/P&gt;</description>
      <pubDate>Tue, 07 May 2024 14:13:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/copy-file-structure-including-files-from-one-storage-to-another/m-p/68429#M33676</guid>
      <dc:creator>shreya_20202</dc:creator>
      <dc:date>2024-05-07T14:13:08Z</dc:date>
    </item>
    <item>
      <title>Re: copy file structure including files from one storage to another incrementally using pyspark</title>
      <link>https://community.databricks.com/t5/data-engineering/copy-file-structure-including-files-from-one-storage-to-another/m-p/108275#M43017</link>
      <description>&lt;P&gt;Is this directory structure a partitioned table?&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Feb 2025 07:37:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/copy-file-structure-including-files-from-one-storage-to-another/m-p/108275#M43017</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-02-01T07:37:29Z</dc:date>
    </item>
  </channel>
</rss>

