copy file structure including files from one storage to another incrementally using pyspark

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I have a storage account dexflex and two containers source and destination. Source container has directory and files as below:

results  
    search
        03
            Module19111.json
            Module19126.json
        04
            Module11291.json
            Module19222.json
    product
        03
            Module18867.json
            Module182625.json
        04
            Module122251.json
            Module192287.json

i am trying to copy the data incrementally from source to destination container by using the below code snippet

from datetime import datetime, timedelta
from pyspark.sql import SparkSession


# Set up the source and destination storage account configurations
source_account_name = "dev-stor"
source_container_name = "results"
destination_account_name = "dev-stor"
destination_container_name = "results"

# Set up the source and destination paths
source_path = f"abfss://{source_container_name}@{source_account_name}.dfs.core.windows.net/{search,product}/"
destination_path = f"abfss://{destination_container_name}@{destination_account_name}.dfs.core.windows.net/copy-data-2024"

# Set up the date range for incremental copy
start_date = datetime(2024, 3, 1)
end_date = datetime(2999, 12, 12)

dbutils.fs.cp(source_path, destination_path, recurse=True)

the above code is a full copy however i am more of looking towards incremental copy i.e in the next run only the new files be copied.

PS. directory hierarchy is to be the same.

I also tried autoloader but was unable to main the same hierarchical directory structure.

can i get some expert advice please

0 REPLIES 0

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon

Databricks Community

copy file structure including files from one storage to another incrementally using pyspark

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon