cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to capture/Identify Incremental rows when No primary key columns in tables

Govind3331
New Contributor

Q1. My source is SQL server tables, I want to identify only latest records(incremental rows) and load those into BRNZE layer. Instead of full load to ADLS, we want to capture only incremental rows and load into ADLS for further processing. 

NOTE: Problem here is table has NO PRIMARY KEY column or DATE columns for identification. Could someone please help on this?

2 REPLIES 2

Slaw
New Contributor II

Hi, 

what kind of SQL source is it? MS SQL, MySQL, PostgreSQL?

Kaniz
Community Manager
Community Manager

Hi @Govind3331, Handling incremental data without a primary key or date column can be challenging, but there are some strategies you can consider. 

 

Row Number Approach:

  • You can use the ROW_NUMBER() function to assign a unique number to each row based on a specific order (e.g., timestamp or any other column). Then, select the rows with the highest row number to get the latest records.
  • Here’s an example query:WITH CTE AS (    SELECT *,           ROW_NUMBER() OVER (ORDER BY YourTimestampColumn DESC) AS rn    FROM YourTableName ) SELECT * FROM CTE WHERE rn = 1;
  • Replace YourTimestampColumn with the appropriate column name that represents the order of data arrival (e.g., timestamp, sequence number, etc.).

Comparison with Previous Load:

  • If you have a previous load of data in your BRONZE layer, you can compare it with the new data.
  • For each record in the new data, check if it already exists in the BRONZE layer. If not, it’s an incremental record.
  • This approach requires maintaining a copy of the previous load.

Hashing Approach:

  • Calculate a hash value for each row based on the column values (excluding any unique identifiers).
  • Store the hash values for the existing records in the BRONZE layer.
  • When new data arrives, calculate the hash for each row and compare it with the stored hashes. If it’s a new hash, it’s an incremental record.
  • Note that this approach may have performance implications.

Change Data Capture (CDC):

  • CDC is a feature in SQL Server that captures changes made to tables.
  • Even without a primary key, you can enable CDC on the table and track changes (inserts, updates, deletes).
  • The captured changes can be used to identify incremental records.
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.