cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge delta tables with data more than 200 million

Mohammad_Younus
New Contributor

HI Everyone,

Im trying to merge two delta tables who have data more than 200 million in each of them. These tables are properly optimized. But upon running the job, the job is taking a long time to execute and the memory spills are huger (1TB-3TB) recorded. And the jobs are still running. I work with 5 executor nodes with Standard_DS5_V2 Configuration. Can someone help me on how to optimize the code.

Mohammad_Younus_0-1698373999153.png

Any help would be greatly appreciated.

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Mohammad_Younus , 

When dealing with large Delta tables with over 200 million rows, optimizing merge operations becomes crucial to avoid memory overflow and reduce execution time.

Here are some effective strategies to enhance the efficiency of your merge operations:

  1. Utilize the MERGE Operation: The MERGE operation is the recommended method for merging many rows in Delta tables. It's purpose-built for this task and can be much more efficient than simple JOIN statements.

  2. Shuffle Optimization: Fine-tune shuffle settings to minimize memory usage during the merge operation. Adjust parameters like spark.sql.shuffle.partitions and spark.sql.autoBroadcastJoinThreshold for optimal memory utilization.

  3. Table Partitioning: Consider partitioning your Delta tables based on columns with high cardinality. This reduces the amount of data that needs to be loaded into memory, resulting in more efficient queries and reduced memory usage during the merge operation.

  4. Z-Ordering: Implement Z-ordering, which rearranges data within each partition so that data with similar column values are stored closer together. This optimization reduces the amount of data that must be loaded into memory for joining tables.

  5. Batch Size Reduction: If memory-intensive joins are causing issues, reduce the batch size for the merging operation. This breaks down the merge into smaller, more manageable batches, lowering the memory footprint and preventing memory spills.

  6. Cluster Configuration: Another option is to consider increasing the resources allocated to your cluster. Upgrading to a more powerful cluster configuration can enhance the efficiency of the merge operation.

By applying these optimization techniques, you can execute merge operations on large Delta tables with minimal memory spills and faster completion times. These strategies are essential for managing substantial datasets effectively.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.