Hi everyone,
I'm currently working on an ETL process using Azure Databricks (Standard Tier) where I load data from Azure SQL Database into Databricks. I run a notebook daily to extract, transform, and load the data for Power BI reports.
Right now, the notebook loads all data from the beginning every time it runs, which is inefficient and causes unnecessary processing time. I want to switch to incremental loading, so the job only fetches new or changed records since the last successful run.
My setup:
- Source: Azure SQL Database
- Target: Databricks Delta Table
- Scheduler: Daily Databricks job
- Purpose: Power BI dashboards using processed data
What I'm looking for:
- A standard or recommended approach to implement incremental loading in Databricks
- Best practices for tracking the last load timestamp (e.g., using a watermark)
- Example code or a step-by-step tutorial
- Any built-in Databricks utilities or patterns to support this on the Standard Tier
If you've set this up before or know of any good resources, I’d really appreciate your help!
Thanks in advance!