10-08-2024 06:10 AM
Hello,
I am trying to join few delta tables as per the code below.
select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO
The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.
I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work. Paritioning table isnt ideal as the joining keys are high cardinality columns.
EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)
PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)
What would be the best way to optmize this join? I am using the below cluster configuration.
Thanks,
Alok
10-08-2024 06:27 AM
have you tried liquid clustering on the source tables?
10-08-2024 07:20 AM
Not yet, but I will try that now
10-08-2024 06:55 AM
hey @AlokThampi,
it's difficult to understand what's going on here (not having access to spark UI, query profile or any idea about the dimensions of these tables) but you can give these a try:
Let me know if these have any effect or please provide more details to be able to find the bottleneck of this join 😉
10-08-2024 07:24 AM
Thanks Mo, I am yet to try liquid clustering, will do that now.
Also, can you please advise on what other details would you require to help me out (I have not used query profiler very extensively yet 😑).
What cluster size would you suggest to handle this worklaod? I am using a standard 16GB, 4 Core worker nodes which scales from 4 to 8. I feel that that might be a little under powered to do the job.
10-08-2024 08:02 AM
Hi Alok, try to gather statistics for the important columns. Databricks gathers stats for the first 32 columns of the table by default. This helps in data skipping. Check all your important/frequently used columns are in first 32 positions of the delta table. If not, gather stats of all important columns manually and see if it helps.
10-08-2024 11:57 AM
Hello @-werners-, @Mo ,
I tried the liquid clustering option as suggested but it still doesn't seem to work. 😶
I am assuming it to be an issue with the small cluster size that I am using.Or do you suggest any other options?
@noorbasha534 , the columns that I use are in the first 32 columns and also checked for statistics but didn't work unfortunately
10-08-2024 12:03 PM
@AlokThampi please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group