topic How to Implement Incremental Loading in Azure Databricks for ETL in Data Engineering

How to Implement Incremental Loading in Azure Databricks for ETL

chexa_Wee — Tue, 20 May 2025 05:15:43 GMT

Hi everyone,

I'm currently working on an ETL process using Azure Databricks (Standard Tier) where I load data from Azure SQL Database into Databricks. I run a notebook daily to extract, transform, and load the data for Power BI reports.

Right now, the notebook loads all data from the beginning every time it runs, which is inefficient and causes unnecessary processing time. I want to switch to incremental loading, so the job only fetches new or changed records since the last successful run.

My setup:

Source: Azure SQL Database
Target: Databricks Delta Table
Scheduler: Daily Databricks job
Purpose: Power BI dashboards using processed data

What I'm looking for:

A standard or recommended approach to implement incremental loading in Databricks
Best practices for tracking the last load timestamp (e.g., using a watermark)
Example code or a step-by-step tutorial
Any built-in Databricks utilities or patterns to support this on the Standard Tier

If you've set this up before or know of any good resources, I’d really appreciate your help!

Thanks in advance!

Re: How to Implement Incremental Loading in Azure Databricks for ETL

nikhilj0421 — Fri, 23 May 2025 05:24:50 GMT

Hi @chexa_Wee answered in the recent post: https://community.databricks.com/t5/data-engineering/how-to-implement-incremental-loading-in-azure-databricks-for-etl/m-p/120020#M46027

Re: How to Implement Incremental Loading in Azure Databricks for ETL

-werners- — Fri, 23 May 2025 08:02:46 GMT

In case you do not want to use dlt (and there are reasons not to), you can also check the docs for autoloader and merge notebooks

These 2 do basically the same as dlt but without the extra cost and more control. You have to write more code though.
For ingesting the SQL server data I would use Data Factory, which lands the data onto your bronze layer (adls gen2).
Or use the Azure SQL connector of Databricks, but that will use DLT and is more expensive than ADF but has the ease of use (but less control/visibility).
So you see, many choices.