copy file structure including files from one storage to another incrementally using pyspark

shreya_20202 — Tue, 07 May 2024 14:13:08 GMT

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:

results  
    search
        03
            Module19111.json
            Module19126.json
        04
            Module11291.json
            Module19222.json
    product
        03
            Module18867.json
            Module182625.json
        04
            Module122251.json
            Module192287.json

i am trying to copy the data incrementally from source to destination container by using the below code snippet

from datetime import datetime, timedelta
from pyspark.sql import SparkSession


# Set up the source and destination storage account configurations
source_account_name = "dev-stor"
source_container_name = "results"
destination_account_name = "dev-stor"
destination_container_name = "results"

# Set up the source and destination paths
source_path = f"abfss://{source_container_name}@{source_account_name}.dfs.core.windows.net/{search,product}/"
destination_path = f"abfss://{destination_container_name}@{destination_account_name}.dfs.core.windows.net/copy-data-2024"

# Set up the date range for incremental copy
start_date = datetime(2024, 3, 1)
end_date = datetime(2999, 12, 12)

dbutils.fs.cp(source_path, destination_path, recurse=True)

the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.

PS. directory hierarchy is to be the same.

I also tried autoloader but was unable to main the same hierarchical directory structure.

can i get some expert advice please

Re: copy file structure including files from one storage to another incrementally using pyspark

NandiniN — Sat, 01 Feb 2025 07:37:29 GMT

Is this directory structure a partitioned table?

topic copy file structure including files from one storage to another incrementally using pyspark in Data Engineering

copy file structure including files from one storage to another incrementally using pyspark

Re: copy file structure including files from one storage to another incrementally using pyspark