cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Joining huge delta tables in Databricks

AlokThampi
New Contributor III

Hello,

I am trying to join few delta tables as per the code below.

SQLCopy
 
select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.

I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work. Paritioning table isnt ideal as the joining keys are high cardinality columns.

EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)

PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)

What would be the best way to optmize this join? I am using the below cluster configuration.

AlokThampi_0-1728392939237.png

Thanks,

Alok

 

 

7 REPLIES 7

-werners-
Esteemed Contributor III

have you tried liquid clustering on the source tables?

AlokThampi
New Contributor III

Not yet, but I will try that now

 

Mo
Databricks Employee
Databricks Employee

hey @AlokThampi,

it's difficult to understand what's going on here (not having access to spark UI, query profile or any idea about the dimensions of these tables) but you can give these a try:

  1. make sure the keys have the same data type (i.e. all three are Long or INT)
  2. you are using 3 tables, while I see you optimized only two of them, and reorganized the data layout using zorder.I highly suggest to use liquid clustering on all three
    1. the ReportTable should be clustered by ORDER_ID.
    2. the EKBETable should be clustered by BELNR
    3.  the PurchaseOrder should be clustered by PO_NO.

Let me know if these have any effect or please provide more details to be able to find the bottleneck of this join 😉

 

 

AlokThampi
New Contributor III

Thanks Mo, I am yet to try liquid clustering, will do that now.

Also, can you please advise on what other details would you require to help me out (I have not used query profiler very extensively yet 😑).

What cluster size would you suggest to handle this worklaod? I am using a standard 16GB, 4 Core worker nodes which scales from 4 to 8. I feel that that might be a little under powered to do the job.

noorbasha534
New Contributor II

Hi Alok, try to gather statistics for the important columns. Databricks gathers stats for the first 32 columns of the table by default. This helps in data skipping. Check all your important/frequently used columns are in first 32 positions of the delta table. If not, gather stats of all important columns manually and see if it helps.

AlokThampi
New Contributor III

Hello @-werners-@Mo ,

I tried the liquid clustering option as suggested but it still doesn't seem to work. 😶

I am assuming it to be an issue with the small cluster size that I am using.Or do you suggest any other options?

@noorbasha534 , the columns that I use are in the first 32 columns and also checked for statistics but didn't work unfortunately

ANALYZE TABLE <table_name> COMPUTE STATISTICS;

 

@AlokThampi please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group