<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Incrementally load SQL Server table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36918#M26218</link>
    <description>&lt;P&gt;As I said, there is no unique identifier in the table that would allow me to do any sort of Join between my source table and my bronze table.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 04 Jul 2023 09:14:14 GMT</pubDate>
    <dc:creator>erigaud</dc:creator>
    <dc:date>2023-07-04T09:14:14Z</dc:date>
    <item>
      <title>Incrementally load SQL Server table</title>
      <link>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36897#M26209</link>
      <description>&lt;P&gt;I am accessing an on premise SQL Server table. The table is relatively small (10 000 rows), and I access it using&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;STRONG&gt;spark.read.jdbc(url=jdbcUrl, table = query)&lt;/STRONG&gt;&lt;/DIV&gt;&lt;DIV&gt;Every day there are new records in the on prem table that I would like to append in my bronze table on the lakehouse. However there are no "InsertedOn" column or anything, and there are no obvious keys in the data that I could use to MERGE to my bronze table. So currently I am overwriting all the data every day, which does not seem like a good approach.&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Is there a better way to incrementally load the data from SQL server ? Perhaps something using the Streaming Structure ?&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;Thank you !&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 04 Jul 2023 07:01:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36897#M26209</guid>
      <dc:creator>erigaud</dc:creator>
      <dc:date>2023-07-04T07:01:50Z</dc:date>
    </item>
    <item>
      <title>Re: Incrementally load SQL Server table</title>
      <link>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36911#M26215</link>
      <description>&lt;P&gt;If there are no explicit "InsertedOn" column or other obvious keys to use for incremental loading, you can still achieve incremental loading from the on-premises SQL Server table to your Bronze table in the lakehouse by using a combination of Spark and some additional logic. While streaming might not be necessary in this case, you can still use Spark's capabilities to manage incremental loading efficiently. Here's a suggested approach:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;
&lt;P&gt;Load the Bronze table into a DataFrame: Before performing incremental loading, load the existing data from your Bronze table into a DataFrame.&lt;/P&gt;
&lt;PRE&gt;bronze_df = spark.read.format("delta").load("path_to_bronze_table")&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;2.&amp;nbsp;Load the entire source table from SQL Server into another DataFrame:&lt;/P&gt;
&lt;PRE&gt;source_df = spark.read.jdbc(url=jdbcUrl, table=query)&lt;/PRE&gt;
&lt;DIV class=""&gt;
&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;3.&amp;nbsp;Identify new records: Use DataFrame operations like &lt;CODE&gt;exceptAll or &lt;CODE&gt;anti-join to identify the new records in the source DataFrame compared to the existing data in the Bronze table.&lt;/CODE&gt;&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;DIV class=""&gt;
&lt;PRE&gt;from pyspark.sql import functions as F&lt;BR /&gt;&lt;BR /&gt;# Assuming there's a unique identifier column 'id', you can use it for comparison&lt;BR /&gt;new_records_df = source_df.join(bronze_df, "id", "left_anti")&lt;/PRE&gt;
&lt;DIV class=""&gt;&amp;nbsp;4.&amp;nbsp;&lt;SPAN&gt;&lt;SPAN&gt;Append new records to the Bronze table: Once you have the DataFrame containing new records, you can append it to the Bronze table.&lt;/SPAN&gt;&lt;/SPAN&gt;
&lt;DIV class=""&gt;&amp;nbsp;
&lt;PRE class=""&gt;&lt;SPAN&gt;new_records_df.write.format("delta").mode("append").save("path_to_bronze_table"&lt;BR /&gt;&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;SPAN&gt;&lt;CODE&gt;&lt;CODE&gt;&lt;/CODE&gt;&lt;/CODE&gt;&lt;/SPAN&gt;&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Tue, 04 Jul 2023 08:54:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36911#M26215</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-07-04T08:54:06Z</dc:date>
    </item>
    <item>
      <title>Re: Incrementally load SQL Server table</title>
      <link>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36918#M26218</link>
      <description>&lt;P&gt;As I said, there is no unique identifier in the table that would allow me to do any sort of Join between my source table and my bronze table.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 04 Jul 2023 09:14:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/incrementally-load-sql-server-table/m-p/36918#M26218</guid>
      <dc:creator>erigaud</dc:creator>
      <dc:date>2023-07-04T09:14:14Z</dc:date>
    </item>
  </channel>
</rss>

