cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

need urgent help

rockybhai
New Contributor

i am bringing 13000gb of data from redhsift to databricks by reading through spark and then wrting it has delta table so what is the best cluster configuration can you suggest and also wokrer nodes ....if i need to this to be done in 1hr

1 REPLY 1

filipniziol
Contributor

Hi  @rockybhai  ,Transferring 13 TB of data from Amazon Redshift to Databricks and writing it as a Delta table within 1 hour is a significant task.

Key Considerations

Network Bandwidth:

  • Data Transfer Rate: To move 13 TB in 1 hour, you need a sustained data transfer rate of approximately 3.6 GB/s.
  • Network Infrastructure: Ensure that your network can handle this throughput. To minimize latency make sure your Databricks workspace is in the same region as Redshift.

Redshift Export Strategy:

UNLOAD Command: Use the UNLOAD command to export data from Redshift to Amazon S3 in parallel, which allows for faster data extraction.

Databricks Cluster Configuration:

  • High-Performance Instances: Use instances with optimized I/O and network capabilities.
  • Parallelism: Maximize the number of worker nodes to distribute the workload efficiently.
  • Recommended Cluster Configuration
    • Choose the Right Instance Types. AWS Instance Types:
      • Compute-Optimized: c5d.9xlarge or c5d.18xlarge (36 to 72 vCPUs, high network bandwidth).
      • Storage-Optimized: i3.8xlarge or i3en.12xlarge (large NVMe SSDs for high I/O performance).
        Reasoning:
      • Compute-Optimized instances provide high CPU performance.
      • Storage-Optimized instances offer high disk I/O, which can benefit shuffle-intensive operations.
    • Configure the Number of Worker Nodes
      • Estimate Resources:
        • CPU Cores: Aim for at least 1,000 vCPUs in total across the cluster.
        • Memory: Ensure sufficient memory per executor to handle data partitions.
    • Example Configuration:
      • Instance Type: i3.8xlarge (32 vCPUs, 244 GB RAM).
      • Number of Workers: 32 workers to achieve 1,024 vCPUs.
    • Spark Configuration Settings
      • Parallelism: Set spark.sql.shuffle.partitions to a high value (e.g., 8000) to utilize all cores.
      • Optimized Writes: Enable optimized writes for Delta Lake

Data Transfer Strategy

  • Export Data from Redshift.
    • UNLOAD to S3:UNLOAD ('SELECT * FROM your_table') TO 's3://your-bucket/your-prefix/' IAM_ROLE 'your-iam-role' PARALLEL ON PARTITION BY 'your partitioning column(s)' ALLOWOVERWRITE GZIP;โ€‹ 
    • Export data using the UNLOAD command with parallelism
  • Read Data into Databricks
    • Use S3 as the Data Source: Reading from S3 can be faster and avoids overloading Redshift.
    • Parallel Reading:
      • Read data in parallel by pointing Spark to the S3 location with multiple files.
    • Write Data as Delta Table
      • Partitioning:
        • Partition the Delta table on a logical column to improve write performance.

Hope it helps.

By the way, Databricks has prepared a Redshift-Databricks migration guide you may want to check:
https://www.databricks.com/resources/guide/redshift-databricks-migration-guide/thank-you

 

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group