Hi @rockybhai ,Transferring 13 TB of data from Amazon Redshift to Databricks and writing it as a Delta table within 1 hour is a significant task.
Key Considerations
Network Bandwidth:
- Data Transfer Rate: To move 13 TB in 1 hour, you need a sustained data transfer rate of approximately 3.6 GB/s.
- Network Infrastructure: Ensure that your network can handle this throughput. To minimize latency make sure your Databricks workspace is in the same region as Redshift.
Redshift Export Strategy:
UNLOAD Command: Use the UNLOAD command to export data from Redshift to Amazon S3 in parallel, which allows for faster data extraction.
Databricks Cluster Configuration:
- High-Performance Instances: Use instances with optimized I/O and network capabilities.
- Parallelism: Maximize the number of worker nodes to distribute the workload efficiently.
- Recommended Cluster Configuration
- Choose the Right Instance Types. AWS Instance Types:
- Compute-Optimized: c5d.9xlarge or c5d.18xlarge (36 to 72 vCPUs, high network bandwidth).
- Storage-Optimized: i3.8xlarge or i3en.12xlarge (large NVMe SSDs for high I/O performance).
Reasoning: - Compute-Optimized instances provide high CPU performance.
- Storage-Optimized instances offer high disk I/O, which can benefit shuffle-intensive operations.
- Configure the Number of Worker Nodes
- Estimate Resources:
- CPU Cores: Aim for at least 1,000 vCPUs in total across the cluster.
- Memory: Ensure sufficient memory per executor to handle data partitions.
- Example Configuration:
- Instance Type: i3.8xlarge (32 vCPUs, 244 GB RAM).
- Number of Workers: 32 workers to achieve 1,024 vCPUs.
- Spark Configuration Settings
- Parallelism: Set spark.sql.shuffle.partitions to a high value (e.g., 8000) to utilize all cores.
- Optimized Writes: Enable optimized writes for Delta Lake
Data Transfer Strategy
- Export Data from Redshift.
- UNLOAD to S3:UNLOAD ('SELECT * FROM your_table') TO 's3://your-bucket/your-prefix/' IAM_ROLE 'your-iam-role' PARALLEL ON PARTITION BY 'your partitioning column(s)' ALLOWOVERWRITE GZIP;โ
- Export data using the UNLOAD command with parallelism
- Read Data into Databricks
- Use S3 as the Data Source: Reading from S3 can be faster and avoids overloading Redshift.
- Parallel Reading:
- Read data in parallel by pointing Spark to the S3 location with multiple files.
- Write Data as Delta Table
- Partitioning:
- Partition the Delta table on a logical column to improve write performance.
Hope it helps.
By the way, Databricks has prepared a Redshift-Databricks migration guide you may want to check:
https://www.databricks.com/resources/guide/redshift-databricks-migration-guide/thank-you