I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:
results
search
03
Module19111.json
Module19126.json
04
Module11291.json
Module19222.json
product
03
Module18867.json
Module182625.json
04
Module122251.json
Module192287.json
i am trying to copy the data incrementally from source to destination container by using the below code snippet
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
# Set up the source and destination storage account configurations
source_account_name = "dev-stor"
source_container_name = "results"
destination_account_name = "dev-stor"
destination_container_name = "results"
# Set up the source and destination paths
source_path = f"abfss://{source_container_name}@{source_account_name}.dfs.core.windows.net/{search,product}/"
destination_path = f"abfss://{destination_container_name}@{destination_account_name}.dfs.core.windows.net/copy-data-2024"
# Set up the date range for incremental copy
start_date = datetime(2024, 3, 1)
end_date = datetime(2999, 12, 12)
dbutils.fs.cp(source_path, destination_path, recurse=True)
the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.
PS. directory hierarchy is to be the same.
I also tried autoloader but was unable to main the same hierarchical directory structure.
can i get some expert advice please