Tuesday
I do notice that ETL Pipelines let's you run declarative SQL syntax such as DLT tables but you can do the same with Jobs if you use SQL as your task type. So why and when to use ETL Pipelines?
Wednesday - last edited Wednesday
Hello @john77 ,
Lakeflow ETL Pipelines give you a managed, declarative engine that understands your tables/flows and runs them with automatic dependency resolution, retries, and incremental semantics. Jobs are the general-purpose orchestratorโthey can run SQL files (and many other task types), but they donโt add the pipeline-smart behaviours by themselves
Running SQL via a SQL task executes statements, but doesnโt add the features like (automatic DAG building from table dependencies, streaming, incremental MV refresh logic, AUTO CDC, taskโflow retry semantics, etc.). Those come from Lakeflow Declarative Pipelines.
Automatic orchestration of a data DAG: Pipelines analyse dependencies between flows/tables and run them in the right order with max parallelismโno hand-built task graphs. Databricks Documentation
Declarative, incremental processing: Write simple SQL/Python; the engine does incremental MV refreshes and streaming ingestion without you coding watermarking/checkpointing logic. Databricks Documentation
Native CDC & SCD: The AUTO CDC flow handles out-of-order events and supports SCD1/SCD2 with a few lines of code. Databricks Documentation
General workflow orchestration across any task type: notebooks, Python scripts, dbt, SQL files, REST calls, and even โpipelineโ tasks. Itโs the scheduler/automation layer for diverse workloads. Databricks Documentation
Run plain SQL files (queries, dashboards, alerts) against a SQL warehouse, with Git-versioned assetsโuseful for reports or one-off DDL/DML that doesnโt need pipeline semantics.
Please let me know if you have any further questions
yesterday - last edited yesterday
Sounds fair, but that being said I do not think it restricts us from creating a DLT code as a SQL task in a Job. Once the job is evaluated/dryrun/run (which has a SQL task with DLT definition) , I immediately see an automatically created pipeline in the UI. (Streaming tables created in Databricks SQL have a type of MV/ST.)
So don't I get the benefits you outlined in this case too but without the efforts of creating a ETL pipeline ? Which are the benefits I miss?
yesterday
Hi @john77 ,
When you have a SQL task creating ST/MV ,it works fine for a few independent tables. You do get incremental refresh, retries, and an auto-created (implicit) pipeline per object
But what you miss is that ,it is Harder to mix SQL + Python or add things like AUTO CDC and expectations across the whole pipeline.
No single DAG that guarantees bronze โ silver โ gold(medallion) ordering as one unit and object has its own run/metrics; no one โpipeline runโ view.
All the above are only possible with ETL pipline
15 hours ago
Basically the pipelines can be seen as a replacement of notebooks. You define all your logic in dlt functions and call these.
You get a nice lineage view etc. But it comes with a price: less flexibility and you are pushed into a rather strict way of working.
It is definitely not for everybody, especially if you already have databricks for years.
Perhaps Databricks should make an actual technical blog without marketing about what is really possible and what is not possible.
Wednesday
Hi @john77
SQL Task Type : Simple, one-off SQL operations or batch jobs + you need to orchestrate a mix of notebooks, Python/Scala code, and SQL in a single workflow
Lakeflow Declarative Pipelines : Complex , production ETL jobs requires lineage , monitoring, event logs, data quality rules, CDC, incremental processing and optimize plans automatically
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now