<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to speed-up Azure Databricks processing in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71170#M34264</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; , &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97998"&gt;@raphaelblg&lt;/a&gt; , would you like to throw some light on this issue.&lt;/P&gt;</description>
    <pubDate>Fri, 31 May 2024 09:10:47 GMT</pubDate>
    <dc:creator>Devsql</dc:creator>
    <dc:date>2024-05-31T09:10:47Z</dc:date>
    <item>
      <title>How to speed-up Azure Databricks processing</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71135#M34254</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;My team has designed Azure Databricks solution and we are looking for solution to speed-up process.&lt;/P&gt;&lt;P&gt;Below are details of project:&lt;/P&gt;&lt;P&gt;1- Data is copied from SAP to ADLS-Gen-2 based External location.&lt;/P&gt;&lt;P&gt;2- Project follows medallion architecture i.e. we have BRONZE/Silver/Gold layer based tables.&lt;/P&gt;&lt;P&gt;3- Single database and Single schema holds all tables of BRONZE/Silver/Gold layer based tables.&lt;/P&gt;&lt;P&gt;4- Data ( parquet files ) are ingested into Bronze layer via Auto-Loader process. This process runs on Continuous mode.&lt;/P&gt;&lt;P&gt;5- Bronze-Layer tables are named as MyTable_1_raw and MyTable_1_append_raw.&lt;/P&gt;&lt;P&gt;6- From Bronze-Layer tables( MyTable_1_raw JOIN MyTable_2_raw JOIN MyTable_3_raw ), we are populating Silver layer tables, i.e. Silver_MyTable.&lt;/P&gt;&lt;P&gt;7- As we don't have any major transformation, hence Gold-Layer-tables ( Gold_MyTable ) are replica of Silver-Layer-tables.&lt;/P&gt;&lt;P&gt;8- All these Silver/Gold tables are based on Delta-Live-Tables.&lt;/P&gt;&lt;P&gt;9- Every 1 Hr Job runs which execute Delta-Live-Tables-based-Pipeline, which joins several RAW tables and then populates Silver and Gold tables.&lt;/P&gt;&lt;P&gt;Now what we saw and what client found that populating Silver-Layer-tables (Silver_MyTable) is taking too much time, almost 6 mins to populate 6 million records.&lt;/P&gt;&lt;P&gt;What Client is saying that you guys are NOT using approach of DELTA-Approach (i.e. Change Data Capture ).&lt;/P&gt;&lt;P&gt;Client says that when job runs every 1 hr, there are 5000 or 6000 records to be insert/update. But as per Delta-Live-Table job's screenshots, pipeline says Written-Record = 6 million records which is NOT true.&lt;/P&gt;&lt;P&gt;It means client is expecting us to use something like below logic:&lt;/P&gt;&lt;P&gt;CREATE OR &lt;STRIKE&gt;REPLACE&lt;/STRIKE&gt; REFRESH LIVE TABLE SILVER_MyTable1&lt;BR /&gt;AS&lt;BR /&gt;SELECT *&lt;BR /&gt;FROM MyTable_1_raw&lt;BR /&gt;JOIN MyTable_2_raw&lt;BR /&gt;ON Table_1.Key = Table_2.Key&lt;BR /&gt;WHERE Table_1.Invoice_Create_Date &amp;gt;= TODAY&lt;BR /&gt;OR Table_1.Invoice_Update_Date &amp;gt;= TODAY&lt;/P&gt;&lt;P&gt;This will take only changed/updated records (5000 or 6000 records), ( DELTA-Approach ).&lt;/P&gt;&lt;P&gt;But my teammates say that since above table (SILVER_MyTable1) is Delta-Live-Table hence executing above code will DROP all records from above table and will insert only new records.&lt;/P&gt;&lt;P&gt;so lets say on Monday 9 AM, we loaded (SILVER_MyTable1) with 10,000 records then at 12 NOON when process tries to insert new 100 records then above code, will first wipe-out 10,000 records and then insert 100 records. So eventually Gold-Layer-tables ( Gold_MyTable ) will also have 100 records. But Client don't want that, Client want all 10,000 + 100 records into Gold-Layer-tables ( Gold_MyTable ).&lt;/P&gt;&lt;P&gt;So i hope you got idea of my issue.&lt;/P&gt;&lt;P&gt;So can you guys suggest me any option to solve this process.&lt;/P&gt;&lt;P&gt;If you feel that some points are NOT clear (WRONG) then please suggest with your thoughts, related article links.&lt;/P&gt;&lt;P&gt;Thanks in advance.&lt;BR /&gt;Devsql&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 09:03:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71135#M34254</guid>
      <dc:creator>Devsql</dc:creator>
      <dc:date>2024-05-31T09:03:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to speed-up Azure Databricks processing</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71158#M34259</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/104457"&gt;@Devsql&lt;/a&gt;&amp;nbsp;&lt;BR /&gt;Could you share a code snippet that you're using to ingest the data to silver layer?&lt;BR /&gt;If you're doing&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;CREATE OR REPLACE STREAMING TABLE&lt;/LI-CODE&gt;&lt;P&gt;then you should switch to&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;CREATE OR REFRESH STREAMING TABLE&lt;/LI-CODE&gt;&lt;P&gt;to ingest only incremental data&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 08:07:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71158#M34259</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2024-05-31T08:07:53Z</dc:date>
    </item>
    <item>
      <title>Re: How to speed-up Azure Databricks processing</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71169#M34263</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/79106"&gt;@daniel_sahal&lt;/a&gt; , thank you for quick update.&lt;/P&gt;&lt;P&gt;Below is line used to populate Silver-Layer-Tables:&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;CREATE&lt;/SPAN&gt; &lt;SPAN&gt;OR&lt;/SPAN&gt;&lt;SPAN&gt; REFRESH LIVE &lt;/SPAN&gt;&lt;SPAN&gt;TABLE&lt;/SPAN&gt;&lt;SPAN&gt; Silver_&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;So Danilel, as this moment i would like to clear 1 doubt that...&lt;/P&gt;&lt;P&gt;Above statement will TRUNCATE whole table and Re-Insert records into &lt;STRONG&gt;Silver_&lt;/STRONG&gt; table ?&lt;/P&gt;&lt;P&gt;OR&lt;/P&gt;&lt;P&gt;Above statement will do UPSERT with &lt;STRONG&gt;Silver_&lt;/STRONG&gt; table ?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 09:06:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71169#M34263</guid>
      <dc:creator>Devsql</dc:creator>
      <dc:date>2024-05-31T09:06:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to speed-up Azure Databricks processing</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71170#M34264</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt; , &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/97998"&gt;@raphaelblg&lt;/a&gt; , would you like to throw some light on this issue.&lt;/P&gt;</description>
      <pubDate>Fri, 31 May 2024 09:10:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-speed-up-azure-databricks-processing/m-p/71170#M34264</guid>
      <dc:creator>Devsql</dc:creator>
      <dc:date>2024-05-31T09:10:47Z</dc:date>
    </item>
  </channel>
</rss>

