Databricks Community

rockybhai · ‎09-26-2024

i am bringing 13000gb of data from redhsift to databricks by reading through spark and then wrting it has delta table so what is the best cluster configuration can you suggest and also wokrer nodes ....if i need to this to be done in 1hr

filipniziol · ‎09-27-2024

Hi @rockybhai ,Transferring 13 TB of data from Amazon Redshift to Databricks and writing it as a Delta table within 1 hour is a significant task.

Key Considerations

Network Bandwidth:

Data Transfer Rate: To move 13 TB in 1 hour, you need a sustained data transfer rate of approximately 3.6 GB/s.
Network Infrastructure: Ensure that your network can handle this throughput. To minimize latency make sure your Databricks workspace is in the same region as Redshift.

Redshift Export Strategy:

UNLOAD Command: Use the UNLOAD command to export data from Redshift to Amazon S3 in parallel, which allows for faster data extraction.

Databricks Cluster Configuration:

High-Performance Instances: Use instances with optimized I/O and network capabilities.
Parallelism: Maximize the number of worker nodes to distribute the workload efficiently.
Recommended Cluster Configuration
- Choose the Right Instance Types. AWS Instance Types:
  - Compute-Optimized: c5d.9xlarge or c5d.18xlarge (36 to 72 vCPUs, high network bandwidth).
  - Storage-Optimized: i3.8xlarge or i3en.12xlarge (large NVMe SSDs for high I/O performance).
    Reasoning:
  - Compute-Optimized instances provide high CPU performance.
  - Storage-Optimized instances offer high disk I/O, which can benefit shuffle-intensive operations.
- Configure the Number of Worker Nodes
  - Estimate Resources:
    - CPU Cores: Aim for at least 1,000 vCPUs in total across the cluster.
    - Memory: Ensure sufficient memory per executor to handle data partitions.
- Example Configuration:
  - Instance Type: i3.8xlarge (32 vCPUs, 244 GB RAM).
  - Number of Workers: 32 workers to achieve 1,024 vCPUs.
- Spark Configuration Settings
  - Parallelism: Set spark.sql.shuffle.partitions to a high value (e.g., 8000) to utilize all cores.
  - Optimized Writes: Enable optimized writes for Delta Lake

Data Transfer Strategy

Export Data from Redshift.
- UNLOAD to S3:UNLOAD ('SELECT * FROM your_table') TO 's3://your-bucket/your-prefix/' IAM_ROLE 'your-iam-role' PARALLEL ON PARTITION BY 'your partitioning column(s)' ALLOWOVERWRITE GZIP;
- Export data using the UNLOAD command with parallelism
Read Data into Databricks

Use S3 as the Data Source: Reading from S3 can be faster and avoids overloading Redshift.
Parallel Reading:
- Read data in parallel by pointing Spark to the S3 location with multiple files.
Write Data as Delta Table
- Partitioning:
  - Partition the Delta table on a logical column to improve write performance.

Hope it helps.

By the way, Databricks has prepared a Redshift-Databricks migration guide you may want to check:
https://www.databricks.com/resources/guide/redshift-databricks-migration-guide/thank-you

Databricks Community

need urgent help

Connect with Databricks Users in Your Area

Securely share data, analytics and AI

Check Out the Latest Videos on DatabricksTV

Data Intelligence for Data Engineers

Databricks Learning Festival (Virtual): 15 January - 31 January 2025