Re: need urgent help

filipniziol · ‎09-27-2024

Hi @rockybhai ,Transferring 13 TB of data from Amazon Redshift to Databricks and writing it as a Delta table within 1 hour is a significant task.

Key Considerations

Network Bandwidth:

Data Transfer Rate: To move 13 TB in 1 hour, you need a sustained data transfer rate of approximately 3.6 GB/s.
Network Infrastructure: Ensure that your network can handle this throughput. To minimize latency make sure your Databricks workspace is in the same region as Redshift.

Redshift Export Strategy:

UNLOAD Command: Use the UNLOAD command to export data from Redshift to Amazon S3 in parallel, which allows for faster data extraction.

Databricks Cluster Configuration:

High-Performance Instances: Use instances with optimized I/O and network capabilities.
Parallelism: Maximize the number of worker nodes to distribute the workload efficiently.
Recommended Cluster Configuration
- Choose the Right Instance Types. AWS Instance Types:
  - Compute-Optimized: c5d.9xlarge or c5d.18xlarge (36 to 72 vCPUs, high network bandwidth).
  - Storage-Optimized: i3.8xlarge or i3en.12xlarge (large NVMe SSDs for high I/O performance).
    Reasoning:
  - Compute-Optimized instances provide high CPU performance.
  - Storage-Optimized instances offer high disk I/O, which can benefit shuffle-intensive operations.
- Configure the Number of Worker Nodes
  - Estimate Resources:
    - CPU Cores: Aim for at least 1,000 vCPUs in total across the cluster.
    - Memory: Ensure sufficient memory per executor to handle data partitions.
- Example Configuration:
  - Instance Type: i3.8xlarge (32 vCPUs, 244 GB RAM).
  - Number of Workers: 32 workers to achieve 1,024 vCPUs.
- Spark Configuration Settings
  - Parallelism: Set spark.sql.shuffle.partitions to a high value (e.g., 8000) to utilize all cores.
  - Optimized Writes: Enable optimized writes for Delta Lake

Data Transfer Strategy

Export Data from Redshift.
- UNLOAD to S3:UNLOAD ('SELECT * FROM your_table') TO 's3://your-bucket/your-prefix/' IAM_ROLE 'your-iam-role' PARALLEL ON PARTITION BY 'your partitioning column(s)' ALLOWOVERWRITE GZIP;
- Export data using the UNLOAD command with parallelism
Read Data into Databricks

Use S3 as the Data Source: Reading from S3 can be faster and avoids overloading Redshift.
Parallel Reading:
- Read data in parallel by pointing Spark to the S3 location with multiple files.
Write Data as Delta Table
- Partitioning:
  - Partition the Delta table on a logical column to improve write performance.

Hope it helps.

By the way, Databricks has prepared a Redshift-Databricks migration guide you may want to check:
https://www.databricks.com/resources/guide/redshift-databricks-migration-guide/thank-you