Databricks Community

AlokThampi · ‎10-08-2024

Hello,

I am trying to join few delta tables as per the code below.

SQLCopy

select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.

I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work. Paritioning table isnt ideal as the joining keys are high cardinality columns.

EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)

PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)

What would be the best way to optmize this join? I am using the below cluster configuration.

Thanks,

Alok

-werners- · ‎10-08-2024

have you tried liquid clustering on the source tables?

AlokThampi · ‎10-08-2024

Not yet, but I will try that now

Mo · ‎10-08-2024

hey @AlokThampi,

it's difficult to understand what's going on here (not having access to spark UI, query profile or any idea about the dimensions of these tables) but you can give these a try:

make sure the keys have the same data type (i.e. all three are Long or INT)
you are using 3 tables, while I see you optimized only two of them, and reorganized the data layout using zorder.I highly suggest to use liquid clustering on all three
1. the ReportTable should be clustered by ORDER_ID.
2. the EKBETable should be clustered by BELNR
3. the PurchaseOrder should be clustered by PO_NO.

Let me know if these have any effect or please provide more details to be able to find the bottleneck of this join 😉

AlokThampi · ‎10-08-2024

Thanks Mo, I am yet to try liquid clustering, will do that now.

Also, can you please advise on what other details would you require to help me out (I have not used query profiler very extensively yet 😑).

What cluster size would you suggest to handle this worklaod? I am using a standard 16GB, 4 Core worker nodes which scales from 4 to 8. I feel that that might be a little under powered to do the job.

noorbasha534 · ‎10-08-2024

Hi Alok, try to gather statistics for the important columns. Databricks gathers stats for the first 32 columns of the table by default. This helps in data skipping. Check all your important/frequently used columns are in first 32 positions of the delta table. If not, gather stats of all important columns manually and see if it helps.

AlokThampi · ‎10-08-2024

Hello @-werners-, @Mo ,

I tried the liquid clustering option as suggested but it still doesn't seem to work. 😶

I am assuming it to be an issue with the small cluster size that I am using.Or do you suggest any other options?

@noorbasha534 , the columns that I use are in the first 32 columns and also checked for statistics but didn't work unfortunately

ANALYZE TABLE <table_name> COMPUTE STATISTICS;

noorbasha534 · ‎10-08-2024

@AlokThampi please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.

Databricks Community

Joining huge delta tables in Databricks

li.media.uploader-dialog.title

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April