cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to speed-up Azure Databricks processing

Devsql
New Contributor III

Hi Team,

My team has designed Azure Databricks solution and we are looking for solution to speed-up process.

Below are details of project:

1- Data is copied from SAP to ADLS-Gen-2 based External location.

2- Project follows medallion architecture i.e. we have BRONZE/Silver/Gold layer based tables.

3- Single database and Single schema holds all tables of BRONZE/Silver/Gold layer based tables.

4- Data ( parquet files ) are ingested into Bronze layer via Auto-Loader process. This process runs on Continuous mode.

5- Bronze-Layer tables are named as MyTable_1_raw and MyTable_1_append_raw.

6- From Bronze-Layer tables( MyTable_1_raw JOIN MyTable_2_raw JOIN MyTable_3_raw ), we are populating Silver layer tables, i.e. Silver_MyTable.

7- As we don't have any major transformation, hence Gold-Layer-tables ( Gold_MyTable ) are replica of Silver-Layer-tables.

8- All these Silver/Gold tables are based on Delta-Live-Tables.

9- Every 1 Hr Job runs which execute Delta-Live-Tables-based-Pipeline, which joins several RAW tables and then populates Silver and Gold tables.

Now what we saw and what client found that populating Silver-Layer-tables (Silver_MyTable) is taking too much time, almost 6 mins to populate 6 million records.

What Client is saying that you guys are NOT using approach of DELTA-Approach (i.e. Change Data Capture ).

Client says that when job runs every 1 hr, there are 5000 or 6000 records to be insert/update. But as per Delta-Live-Table job's screenshots, pipeline says Written-Record = 6 million records which is NOT true.

It means client is expecting us to use something like below logic:

CREATE OR REPLACE REFRESH LIVE TABLE SILVER_MyTable1
AS
SELECT *
FROM MyTable_1_raw
JOIN MyTable_2_raw
ON Table_1.Key = Table_2.Key
WHERE Table_1.Invoice_Create_Date >= TODAY
OR Table_1.Invoice_Update_Date >= TODAY

This will take only changed/updated records (5000 or 6000 records), ( DELTA-Approach ).

But my teammates say that since above table (SILVER_MyTable1) is Delta-Live-Table hence executing above code will DROP all records from above table and will insert only new records.

so lets say on Monday 9 AM, we loaded (SILVER_MyTable1) with 10,000 records then at 12 NOON when process tries to insert new 100 records then above code, will first wipe-out 10,000 records and then insert 100 records. So eventually Gold-Layer-tables ( Gold_MyTable ) will also have 100 records. But Client don't want that, Client want all 10,000 + 100 records into Gold-Layer-tables ( Gold_MyTable ).

So i hope you got idea of my issue.

So can you guys suggest me any option to solve this process.

If you feel that some points are NOT clear (WRONG) then please suggest with your thoughts, related article links.

Thanks in advance.
Devsql

3 REPLIES 3

daniel_sahal
Esteemed Contributor

@Devsql 
Could you share a code snippet that you're using to ingest the data to silver layer?
If you're doing 

CREATE OR REPLACE STREAMING TABLE

then you should switch to 

CREATE OR REFRESH STREAMING TABLE

to ingest only incremental data

Devsql
New Contributor III

Hi @daniel_sahal , thank you for quick update.

Below is line used to populate Silver-Layer-Tables:

CREATE OR REFRESH LIVE TABLE Silver_

So Danilel, as this moment i would like to clear 1 doubt that...

Above statement will TRUNCATE whole table and Re-Insert records into Silver_ table ?

OR

Above statement will do UPSERT with Silver_ table ?

Thanks

Devsql
New Contributor III

Hi @Retired_mod , @raphaelblg , would you like to throw some light on this issue.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group