Databricks Community

shreya_20202 · ‎05-07-2024

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:

results  
    search
        03
            Module19111.json
            Module19126.json
        04
            Module11291.json
            Module19222.json
    product
        03
            Module18867.json
            Module182625.json
        04
            Module122251.json
            Module192287.json

i am trying to copy the data incrementally from source to destination container by using the below code snippet

from datetime import datetime, timedelta
from pyspark.sql import SparkSession


# Set up the source and destination storage account configurations
source_account_name = "dev-stor"
source_container_name = "results"
destination_account_name = "dev-stor"
destination_container_name = "results"

# Set up the source and destination paths
source_path = f"abfss://{source_container_name}@{source_account_name}.dfs.core.windows.net/{search,product}/"
destination_path = f"abfss://{destination_container_name}@{destination_account_name}.dfs.core.windows.net/copy-data-2024"

# Set up the date range for incremental copy
start_date = datetime(2024, 3, 1)
end_date = datetime(2999, 12, 12)

dbutils.fs.cp(source_path, destination_path, recurse=True)

the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.

PS. directory hierarchy is to be the same.

I also tried autoloader but was unable to main the same hierarchical directory structure.

can i get some expert advice please

NandiniN · ‎01-31-2025

Is this directory structure a partitioned table?

Databricks Community

copy file structure including files from one storage to another incrementally using pyspark

Join Us as a Local Community Builder!

Level Up with Databricks Specialist Sessions

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Introducing Community Pulse — Your Weekly Databricks Roundup!