cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to fast sync large files (> 100GB)

PeSe
New Contributor

I want to sync large files (>100GB) from my local system to a DBX Volume. I see 2 Options with different problems, do you have suggestions?

Option 1: Needs to open the file completely -> Memory issues

    with open(local_file_path, 'rb') as file:
        file_bytes = file.read()
        binary_data = io.BytesIO(file_bytes)
        response = workspace.files.upload(dbx_file_path, binary_data, overwrite=True)
        if response:
                print(response)

 Option 2: Only 1MB chunks allowed on DBX site -> Very slow

    create_response = workspace.dbfs.create(dbx_file_path, overwrite=True)
    handle = create_response.handle

    file_size = os.path.getsize(local_file_path)
    with open(local_file_path, 'rb') as file:
        with tqdm(total=file_size, unit='B', unit_scale=True, desc="Uploading") as pbar:
            while chunk := file.read(1024 * 1014):
                encoded_chunk = base64.b64encode(chunk).decode('utf-8')
                response = workspace.dbfs.add_block(handle, encoded_chunk)
                if response:
                    print("Add block response:", response)
                
                pbar.update(len(chunk))

    close_response = workspace.dbfs.close(handle)
    print("Close response:", close_response)

 

2 REPLIES 2

Sharanya13
Contributor

How about uploading the large file to S3 and linking the S3 to UC Volumes?

Brahmareddy
Honored Contributor III

Hi PeSe,

How are you doing today? As per my understanding, You're absolutely right to think through both options carefully. Option 1 runs into memory issues because it's trying to read the whole large file into memory at once, which doesn't work well for files over 100GB. Option 2 is technically correct, but it's painfully slow because the Databricks API only allows small chunk sizes (1MB), so uploading big files takes a lot of time. A better and much faster way would be to upload your large files to cloud storage first—like AWS S3, Azure Blob Storage, or GCP Cloud Storage—and then copy them into Databricks using dbutils.fs.cp or set up Auto Loader if you plan to do this regularly. Cloud platforms are designed for high-throughput data transfer, so this method will save you time and avoid memory or timeout issues. Let me know your cloud provider and I’d be happy to share exact steps.

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now