How to speed-up Azure Databricks processing

Devsql — Fri, 31 May 2024 09:03:11 GMT

Hi Team,

My team has designed Azure Databricks solution and we are looking for solution to speed-up process.

Below are details of project:

1- Data is copied from SAP to ADLS-Gen-2 based External location.

2- Project follows medallion architecture i.e. we have BRONZE/Silver/Gold layer based tables.

3- Single database and Single schema holds all tables of BRONZE/Silver/Gold layer based tables.

4- Data ( parquet files ) are ingested into Bronze layer via Auto-Loader process. This process runs on Continuous mode.

5- Bronze-Layer tables are named as MyTable_1_raw and MyTable_1_append_raw.

6- From Bronze-Layer tables( MyTable_1_raw JOIN MyTable_2_raw JOIN MyTable_3_raw ), we are populating Silver layer tables, i.e. Silver_MyTable.

7- As we don't have any major transformation, hence Gold-Layer-tables ( Gold_MyTable ) are replica of Silver-Layer-tables.

8- All these Silver/Gold tables are based on Delta-Live-Tables.

9- Every 1 Hr Job runs which execute Delta-Live-Tables-based-Pipeline, which joins several RAW tables and then populates Silver and Gold tables.

Now what we saw and what client found that populating Silver-Layer-tables (Silver_MyTable) is taking too much time, almost 6 mins to populate 6 million records.

What Client is saying that you guys are NOT using approach of DELTA-Approach (i.e. Change Data Capture ).

Client says that when job runs every 1 hr, there are 5000 or 6000 records to be insert/update. But as per Delta-Live-Table job's screenshots, pipeline says Written-Record = 6 million records which is NOT true.

It means client is expecting us to use something like below logic:

CREATE OR ~~REPLACE~~ REFRESH LIVE TABLE SILVER_MyTable1
AS
SELECT *
FROM MyTable_1_raw
JOIN MyTable_2_raw
ON Table_1.Key = Table_2.Key
WHERE Table_1.Invoice_Create_Date >= TODAY
OR Table_1.Invoice_Update_Date >= TODAY

This will take only changed/updated records (5000 or 6000 records), ( DELTA-Approach ).

But my teammates say that since above table (SILVER_MyTable1) is Delta-Live-Table hence executing above code will DROP all records from above table and will insert only new records.

so lets say on Monday 9 AM, we loaded (SILVER_MyTable1) with 10,000 records then at 12 NOON when process tries to insert new 100 records then above code, will first wipe-out 10,000 records and then insert 100 records. So eventually Gold-Layer-tables ( Gold_MyTable ) will also have 100 records. But Client don't want that, Client want all 10,000 + 100 records into Gold-Layer-tables ( Gold_MyTable ).

So i hope you got idea of my issue.

So can you guys suggest me any option to solve this process.

If you feel that some points are NOT clear (WRONG) then please suggest with your thoughts, related article links.

Thanks in advance.
Devsql

Re: How to speed-up Azure Databricks processing

daniel_sahal — Fri, 31 May 2024 08:07:53 GMT

@Devsql
Could you share a code snippet that you're using to ingest the data to silver layer?
If you're doing

CREATE OR REPLACE STREAMING TABLE

then you should switch to

CREATE OR REFRESH STREAMING TABLE

to ingest only incremental data

Re: How to speed-up Azure Databricks processing

Devsql — Fri, 31 May 2024 09:06:40 GMT

Hi @daniel_sahal , thank you for quick update.

Below is line used to populate Silver-Layer-Tables:

CREATE OR REFRESH LIVE TABLE Silver_

So Danilel, as this moment i would like to clear 1 doubt that...

Above statement will TRUNCATE whole table and Re-Insert records into Silver_ table ?

Above statement will do UPSERT with Silver_ table ?

Thanks

Re: How to speed-up Azure Databricks processing

Devsql — Fri, 31 May 2024 09:10:47 GMT

Hi @Retired_mod , @raphaelblg , would you like to throw some light on this issue.

topic Re: How to speed-up Azure Databricks processing in Data Engineering

How to speed-up Azure Databricks processing

Re: How to speed-up Azure Databricks processing

Re: How to speed-up Azure Databricks processing

Re: How to speed-up Azure Databricks processing