I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:
results
search
03
Module19111.json
Module19126.json
04
Module11291.json
Module19222.json
product
03
Module18867.json
Module182625.json
04
Module122251.json
Module192287.jsoni am trying to copy the data incrementally from source to destination container by using the below code snippet
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
# Set up the source and destination storage account configurations
source_account_name = "dev-stor"
source_container_name = "results"
destination_account_name = "dev-stor"
destination_container_name = "results"
# Set up the source and destination paths
source_path = f"abfss://{source_container_name}@{source_account_name}.dfs.core.windows.net/{search,product}/"
destination_path = f"abfss://{destination_container_name}@{destination_account_name}.dfs.core.windows.net/copy-data-2024"
# Set up the date range for incremental copy
start_date = datetime(2024, 3, 1)
end_date = datetime(2999, 12, 12)
dbutils.fs.cp(source_path, destination_path, recurse=True)
the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.
PS. directory hierarchy is to be the same.
I also tried autoloader but was unable to main the same hierarchical directory structure.
can i get some expert advice please