filipniziol
Esteemed Contributor

Hi  @rockybhai  ,Transferring 13 TB of data from Amazon Redshift to Databricks and writing it as a Delta table within 1 hour is a significant task.

Key Considerations

Network Bandwidth:

  • Data Transfer Rate: To move 13 TB in 1 hour, you need a sustained data transfer rate of approximately 3.6 GB/s.
  • Network Infrastructure: Ensure that your network can handle this throughput. To minimize latency make sure your Databricks workspace is in the same region as Redshift.

Redshift Export Strategy:

UNLOAD Command: Use the UNLOAD command to export data from Redshift to Amazon S3 in parallel, which allows for faster data extraction.

Databricks Cluster Configuration:

  • High-Performance Instances: Use instances with optimized I/O and network capabilities.
  • Parallelism: Maximize the number of worker nodes to distribute the workload efficiently.
  • Recommended Cluster Configuration
    • Choose the Right Instance Types. AWS Instance Types:
      • Compute-Optimized: c5d.9xlarge or c5d.18xlarge (36 to 72 vCPUs, high network bandwidth).
      • Storage-Optimized: i3.8xlarge or i3en.12xlarge (large NVMe SSDs for high I/O performance).
        Reasoning:
      • Compute-Optimized instances provide high CPU performance.
      • Storage-Optimized instances offer high disk I/O, which can benefit shuffle-intensive operations.
    • Configure the Number of Worker Nodes
      • Estimate Resources:
        • CPU Cores: Aim for at least 1,000 vCPUs in total across the cluster.
        • Memory: Ensure sufficient memory per executor to handle data partitions.
    • Example Configuration:
      • Instance Type: i3.8xlarge (32 vCPUs, 244 GB RAM).
      • Number of Workers: 32 workers to achieve 1,024 vCPUs.
    • Spark Configuration Settings
      • Parallelism: Set spark.sql.shuffle.partitions to a high value (e.g., 8000) to utilize all cores.
      • Optimized Writes: Enable optimized writes for Delta Lake

Data Transfer Strategy

  • Export Data from Redshift.
    • UNLOAD to S3:UNLOAD ('SELECT * FROM your_table') TO 's3://your-bucket/your-prefix/' IAM_ROLE 'your-iam-role' PARALLEL ON PARTITION BY 'your partitioning column(s)' ALLOWOVERWRITE GZIP;​ 
    • Export data using the UNLOAD command with parallelism
  • Read Data into Databricks
    • Use S3 as the Data Source: Reading from S3 can be faster and avoids overloading Redshift.
    • Parallel Reading:
      • Read data in parallel by pointing Spark to the S3 location with multiple files.
    • Write Data as Delta Table
      • Partitioning:
        • Partition the Delta table on a logical column to improve write performance.

Hope it helps.

By the way, Databricks has prepared a Redshift-Databricks migration guide you may want to check:
https://www.databricks.com/resources/guide/redshift-databricks-migration-guide/thank-you