cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Why ETL Pipelines and Jobs

john77
New Contributor

I do notice that ETL Pipelines let's you run declarative SQL syntax such as DLT tables but you can do the same with Jobs if you use SQL as your task type. So why and when to use ETL Pipelines?

5 REPLIES 5

K_Anudeep
Databricks Employee
Databricks Employee

Hello @john77 ,

Lakeflow ETL Pipelines give you a managed, declarative engine that understands your tables/flows and runs them with automatic dependency resolution, retries, and incremental semantics. Jobs are the general-purpose orchestratorโ€”they can run SQL files (and many other task types), but they donโ€™t add the pipeline-smart behaviours by themselves

Running SQL via a SQL task executes statements, but doesnโ€™t add the features like (automatic DAG building from table dependencies, streaming, incremental MV refresh logic, AUTO CDC, taskโ†’flow retry semantics, etc.). Those come from Lakeflow Declarative Pipelines.

Use ETL Pipelines when you want

  • Automatic orchestration of a data DAG: Pipelines analyse dependencies between flows/tables and run them in the right order with max parallelismโ€”no hand-built task graphs. Databricks Documentation

  • Declarative, incremental processing: Write simple SQL/Python; the engine does incremental MV refreshes and streaming ingestion without you coding watermarking/checkpointing logic. Databricks Documentation

  • Native CDC & SCD: The AUTO CDC flow handles out-of-order events and supports SCD1/SCD2 with a few lines of code. Databricks Documentation

  • SQL-first ETL inside or outside pipelines: You can define streaming tables/materialised views directly in Databricks SQL

Use Jobs when you wantโ€ฆ

  • General workflow orchestration across any task type: notebooks, Python scripts, dbt, SQL files, REST calls, and even โ€œpipelineโ€ tasks. Itโ€™s the scheduler/automation layer for diverse workloads. Databricks Documentation

  • Run plain SQL files (queries, dashboards, alerts) against a SQL warehouse, with Git-versioned assetsโ€”useful for reports or one-off DDL/DML that doesnโ€™t need pipeline semantics.

 

Please let me know if you have any further questions

 

Sounds fair, but that being said I do not think it restricts us from creating a DLT code as a SQL task in a Job. Once the job is evaluated/dryrun/run (which has a SQL task with DLT definition) , I immediately see an automatically created pipeline in the UI. (Streaming tables created in Databricks SQL have a type of MV/ST.) 

So don't I get the benefits you outlined in this case too but without the efforts of creating a ETL pipeline ? Which are the benefits I miss?

K_Anudeep
Databricks Employee
Databricks Employee

Hi @john77 , 

When you have a SQL task creating ST/MV ,it works fine for a few independent tables. You do get incremental refresh, retries, and an auto-created (implicit) pipeline per object
But what you miss is that ,it is  Harder to mix SQL + Python or add things like AUTO CDC and expectations across the whole pipeline.

No single DAG that guarantees bronze โ†’ silver โ†’ gold(medallion) ordering as one unit and object has its own run/metrics; no one โ€œpipeline runโ€ view.

All the above are only possible with ETL pipline 

-werners-
Esteemed Contributor III

Basically the pipelines can be seen as a replacement of notebooks.  You define all your logic in dlt functions and call these.
You get a nice lineage view etc.  But it comes with a price: less flexibility and you are pushed into a rather strict way of working.

It is definitely not for everybody, especially if you already have  databricks for years.

Perhaps Databricks should make an actual technical blog without marketing about what is really possible and what is not possible.

saurabh18cs
Honored Contributor II

Hi @john77 

SQL Task Type : Simple, one-off SQL operations or batch jobs + you need to orchestrate a mix of notebooks, Python/Scala code, and SQL in a single workflow

Lakeflow Declarative Pipelines : Complex , production ETL jobs requires lineage , monitoring, event logs, data quality rules, CDC, incremental processing  and optimize plans automatically

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now