Hi Team,
My team has designed Azure Databricks solution and we are looking for solution to speed-up process.
Below are details of project:
1- Data is copied from SAP to ADLS-Gen-2 based External location.
2- Project follows medallion architecture i.e. we have BRONZE/Silver/Gold layer based tables.
3- Single database and Single schema holds all tables of BRONZE/Silver/Gold layer based tables.
4- Data ( parquet files ) are ingested into Bronze layer via Auto-Loader process. This process runs on Continuous mode.
5- Bronze-Layer tables are named as MyTable_1_raw and MyTable_1_append_raw.
6- From Bronze-Layer tables( MyTable_1_raw JOIN MyTable_2_raw JOIN MyTable_3_raw ), we are populating Silver layer tables, i.e. Silver_MyTable.
7- As we don't have any major transformation, hence Gold-Layer-tables ( Gold_MyTable ) are replica of Silver-Layer-tables.
8- All these Silver/Gold tables are based on Delta-Live-Tables.
9- Every 1 Hr Job runs which execute Delta-Live-Tables-based-Pipeline, which joins several RAW tables and then populates Silver and Gold tables.
Now what we saw and what client found that populating Silver-Layer-tables (Silver_MyTable) is taking too much time, almost 6 mins to populate 6 million records.
What Client is saying that you guys are NOT using approach of DELTA-Approach (i.e. Change Data Capture ).
Client says that when job runs every 1 hr, there are 5000 or 6000 records to be insert/update. But as per Delta-Live-Table job's screenshots, pipeline says Written-Record = 6 million records which is NOT true.
It means client is expecting us to use something like below logic:
CREATE OR REPLACE REFRESH LIVE TABLE SILVER_MyTable1
AS
SELECT *
FROM MyTable_1_raw
JOIN MyTable_2_raw
ON Table_1.Key = Table_2.Key
WHERE Table_1.Invoice_Create_Date >= TODAY
OR Table_1.Invoice_Update_Date >= TODAY
This will take only changed/updated records (5000 or 6000 records), ( DELTA-Approach ).
But my teammates say that since above table (SILVER_MyTable1) is Delta-Live-Table hence executing above code will DROP all records from above table and will insert only new records.
so lets say on Monday 9 AM, we loaded (SILVER_MyTable1) with 10,000 records then at 12 NOON when process tries to insert new 100 records then above code, will first wipe-out 10,000 records and then insert 100 records. So eventually Gold-Layer-tables ( Gold_MyTable ) will also have 100 records. But Client don't want that, Client want all 10,000 + 100 records into Gold-Layer-tables ( Gold_MyTable ).
So i hope you got idea of my issue.
So can you guys suggest me any option to solve this process.
If you feel that some points are NOT clear (WRONG) then please suggest with your thoughts, related article links.
Thanks in advance.
Devsql