topic Re: Joining huge delta tables in Databricks in Data Engineering

Joining huge delta tables in Databricks

AlokThampi — Tue, 08 Oct 2024 13:10:19 GMT

Hello,

I am trying to join few delta tables as per the code below.

SQLCopy

select <applicable columns>
FROM ReportTable G
LEFT JOIN EKBETable EKBE ON EKBE.BELNR = G.ORDER_ID
LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO

The PurchaseOrder table contains approximately 2 Billion records and the EKBE table contains ~500 million records. The last join (LEFT JOIN PurchaseOrder POL ON EKBE.EBELN = POL.PO_NO) has a huge performance hit and the code keeps running for ever. There are duplicate EBELN and PO_NO values in both tables adding more heaviness to the join.

I have run the optimize / zorder on both the tables based on the joining keys as below but still it does't seem to work. Paritioning table isnt ideal as the joining keys are high cardinality columns.

EKBETable : OPTIMIZE EKBETable ZORDER BY (BELNR)

PurchaseOrder : OPTIMIZE PurchaseOrder ZORDER BY (PO_NO)

What would be the best way to optmize this join? I am using the below cluster configuration.

Thanks,

Alok

Re: Joining huge delta tables in Databricks

-werners- — Tue, 08 Oct 2024 13:27:13 GMT

have you tried liquid clustering on the source tables?

Re: Joining huge delta tables in Databricks

Mo — Tue, 08 Oct 2024 13:55:56 GMT

hey @AlokThampi,

it's difficult to understand what's going on here (not having access to spark UI, query profile or any idea about the dimensions of these tables) but you can give these a try:

make sure the keys have the same data type (i.e. all three are Long or INT)
you are using 3 tables, while I see you optimized only two of them, and reorganized the data layout using zorder.I highly suggest to use liquid clustering on all three
1. the ReportTable should be clustered by ORDER_ID.
2. the EKBETable should be clustered by BELNR
3. the PurchaseOrder should be clustered by PO_NO.

Let me know if these have any effect or please provide more details to be able to find the bottleneck of this join 😉

Re: Joining huge delta tables in Databricks

AlokThampi — Tue, 08 Oct 2024 14:20:59 GMT

Not yet, but I will try that now

Re: Joining huge delta tables in Databricks

AlokThampi — Tue, 08 Oct 2024 14:24:48 GMT

Thanks Mo, I am yet to try liquid clustering, will do that now.

Also, can you please advise on what other details would you require to help me out (I have not used query profiler very extensively yet 😑).

What cluster size would you suggest to handle this worklaod? I am using a standard 16GB, 4 Core worker nodes which scales from 4 to 8. I feel that that might be a little under powered to do the job.

Re: Joining huge delta tables in Databricks

noorbasha534 — Tue, 08 Oct 2024 15:02:23 GMT

Hi Alok, try to gather statistics for the important columns. Databricks gathers stats for the first 32 columns of the table by default. This helps in data skipping. Check all your important/frequently used columns are in first 32 positions of the delta table. If not, gather stats of all important columns manually and see if it helps.

Re: Joining huge delta tables in Databricks

AlokThampi — Tue, 08 Oct 2024 18:57:47 GMT

Hello @-werners-, @Mo ,

I tried the liquid clustering option as suggested but it still doesn't seem to work. 😶

I am assuming it to be an issue with the small cluster size that I am using.Or do you suggest any other options?

@noorbasha534 , the columns that I use are in the first 32 columns and also checked for statistics but didn't work unfortunately

ANALYZE TABLE <table_name> COMPUTE STATISTICS;

Re: Joining huge delta tables in Databricks

noorbasha534 — Tue, 08 Oct 2024 19:03:14 GMT

@AlokThampi please use specific columns to gather stats. The command you used just gathers high level info. Check documentation for syntax to incorporate columns.