Databricks Community

PeSe · ‎07-04-2025

I want to sync large files (>100GB) from my local system to a DBX Volume. I see 2 Options with different problems, do you have suggestions?

Option 1: Needs to open the file completely -> Memory issues

    with open(local_file_path, 'rb') as file:
        file_bytes = file.read()
        binary_data = io.BytesIO(file_bytes)
        response = workspace.files.upload(dbx_file_path, binary_data, overwrite=True)
        if response:
                print(response)

Option 2: Only 1MB chunks allowed on DBX site -> Very slow

    create_response = workspace.dbfs.create(dbx_file_path, overwrite=True)
    handle = create_response.handle

    file_size = os.path.getsize(local_file_path)
    with open(local_file_path, 'rb') as file:
        with tqdm(total=file_size, unit='B', unit_scale=True, desc="Uploading") as pbar:
            while chunk := file.read(1024 * 1014):
                encoded_chunk = base64.b64encode(chunk).decode('utf-8')
                response = workspace.dbfs.add_block(handle, encoded_chunk)
                if response:
                    print("Add block response:", response)
                
                pbar.update(len(chunk))

    close_response = workspace.dbfs.close(handle)
    print("Close response:", close_response)

Sharanya13 · ‎07-04-2025

How about uploading the large file to S3 and linking the S3 to UC Volumes?

Brahmareddy · ‎07-04-2025

Hi PeSe,

How are you doing today? As per my understanding, You're absolutely right to think through both options carefully. Option 1 runs into memory issues because it's trying to read the whole large file into memory at once, which doesn't work well for files over 100GB. Option 2 is technically correct, but it's painfully slow because the Databricks API only allows small chunk sizes (1MB), so uploading big files takes a lot of time. A better and much faster way would be to upload your large files to cloud storage first—like AWS S3, Azure Blob Storage, or GCP Cloud Storage—and then copy them into Databricks using dbutils.fs.cp or set up Auto Loader if you plan to do this regularly. Cloud platforms are designed for high-throughput data transfer, so this method will save you time and avoid memory or timeout issues. Let me know your cloud provider and I’d be happy to share exact steps.

Regards,

Brahma