topic Re: Why ETL Pipelines and Jobs in Data Engineering

Why ETL Pipelines and Jobs

john77 — Tue, 30 Sep 2025 22:15:10 GMT

I do notice that ETL Pipelines let's you run declarative SQL syntax such as DLT tables but you can do the same with Jobs if you use SQL as your task type. So why and when to use ETL Pipelines?

Re: Why ETL Pipelines and Jobs

K_Anudeep — Wed, 01 Oct 2025 08:53:45 GMT

Hello @john77 ,

Lakeflow ETL Pipelines give you a managed, declarative engine that understands your tables/flows and runs them with automatic dependency resolution, retries, and incremental semantics. Jobs are the general-purpose orchestrator—they can run SQL files (and many other task types), but they don’t add the pipeline-smart behaviours by themselves

Running SQL via a SQL task executes statements, but doesn’t add the features like (automatic DAG building from table dependencies, streaming, incremental MV refresh logic, AUTO CDC, task→flow retry semantics, etc.). Those come from Lakeflow Declarative Pipelines.

Use ETL Pipelines when you want

Automatic orchestration of a data DAG: Pipelines analyse dependencies between flows/tables and run them in the right order with max parallelism—no hand-built task graphs. Databricks Documentation
Declarative, incremental processing: Write simple SQL/Python; the engine does incremental MV refreshes and streaming ingestion without you coding watermarking/checkpointing logic. Databricks Documentation
Native CDC & SCD: The AUTO CDC flow handles out-of-order events and supports SCD1/SCD2 with a few lines of code. Databricks Documentation
SQL-first ETL inside or outside pipelines: You can define streaming tables/materialised views directly in Databricks SQL

Use Jobs when you want…

General workflow orchestration across any task type: notebooks, Python scripts, dbt, SQL files, REST calls, and even “pipeline” tasks. It’s the scheduler/automation layer for diverse workloads. Databricks Documentation
Run plain SQL files (queries, dashboards, alerts) against a SQL warehouse, with Git-versioned assets—useful for reports or one-off DDL/DML that doesn’t need pipeline semantics.

Please let me know if you have any further questions

Re: Why ETL Pipelines and Jobs

saurabh18cs — Wed, 01 Oct 2025 09:28:10 GMT

Hi @john77

SQL Task Type : Simple, one-off SQL operations or batch jobs + you need to orchestrate a mix of notebooks, Python/Scala code, and SQL in a single workflow

Lakeflow Declarative Pipelines : Complex , production ETL jobs requires lineage , monitoring, event logs, data quality rules, CDC, incremental processing and optimize plans automatically

Re: Why ETL Pipelines and Jobs

john77 — Fri, 03 Oct 2025 05:03:46 GMT

Sounds fair, but that being said I do not think it restricts us from creating a DLT code as a SQL task in a Job. Once the job is evaluated/dryrun/run (which has a SQL task with DLT definition) , I immediately see an automatically created pipeline in the UI. (Streaming tables created in Databricks SQL have a type of MV/ST.)

So don't I get the benefits you outlined in this case too but without the efforts of creating a ETL pipeline ? Which are the benefits I miss?

Re: Why ETL Pipelines and Jobs

K_Anudeep — Fri, 03 Oct 2025 06:38:23 GMT

Hi @john77 ,

When you have a SQL task creating ST/MV ,it works fine for a few independent tables. You do get incremental refresh, retries, and an auto-created (implicit) pipeline per object
But what you miss is that ,it is Harder to mix SQL + Python or add things like AUTO CDC and expectations across the whole pipeline.

No single DAG that guarantees bronze → silver → gold(medallion) ordering as one unit and object has its own run/metrics; no one “pipeline run” view.

All the above are only possible with ETL pipline

Re: Why ETL Pipelines and Jobs

-werners- — Fri, 03 Oct 2025 12:51:28 GMT

Basically the pipelines can be seen as a replacement of notebooks. You define all your logic in dlt functions and call these.
You get a nice lineage view etc. But it comes with a price: less flexibility and you are pushed into a rather strict way of working.

It is definitely not for everybody, especially if you already have databricks for years.

Perhaps Databricks should make an actual technical blog without marketing about what is really possible and what is not possible.