cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark Declarative Pipelines use in All-purpose compute?

ChristianRRL
Honored Contributor

Hi there, I know Spark Declarative Pipelines (previously DLT) has undergone some changes since last year and is even now open source (announcement). For a long while, I know that SDP/DLT was locked to only working with job compute. I'm wondering, with recent changes, will SDP now (or in the near future) work with All-purpose compute?

3 REPLIES 3

MoJaMa
Databricks Employee
Databricks Employee

Not today and likely not in future as well.

1. If you create a pipeline yourself, the pipeline creates it's own compute. This can be classic (ie you choose the nodes/node-types) or serverless (recommended)

2. If you define a Streaming Table or Materialized View in DBSQL, it create a serverless DLT/LSDP pipeline behind the scenes.

3. There is plan to support it from Serverless Interactive (but not classic All Purpose) during the development lifecycle.

The main reason is that it needs a custom runtime (forked from DBR with various configs tuned/set) and so your "normal" DBR that you use for classic all-purpose(or jobs) will not be sufficient.

ChristianRRL
Honored Contributor

Since Spark Declarative Pipelines are now open source, is it conceivable that a team may be able to leverage the open-source version to have a working version of SDP that works with All-purpose compute (assuming a future DBR LTS that has 4.1.x and above)? Is there something that I'm not seeing that may impede the open-source version from working in All-purpose compute?

ChristianRRL_0-1768421877875.png

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @ChristianRRL,

To address both your original question and your follow-up about the open-source angle:

CURRENT STATE ON DATABRICKS

Lakeflow Spark Declarative Pipelines (SDP), the current name for what was previously known as DLT, runs on its own managed pipeline compute on Databricks. The two supported options are:

1. Serverless compute (recommended, and the default for new pipelines)
2. Classic pipeline compute (you configure worker/driver instance types, autoscaling, etc.)

SDP pipelines do not run on all-purpose (interactive) compute today. When you create or update a pipeline, the system provisions dedicated compute for that pipeline run. This is by design, as the SDP runtime includes specialized configurations and optimizations on top of the Databricks Runtime that are not present on standard all-purpose clusters.

For the development workflow, you use the Lakeflow Pipelines Editor in the workspace to iteratively develop and validate your pipeline code. Running in "development mode" provides faster iteration with relaxed retry policies and no waiting for cluster reuse between updates.

Documentation for pipeline compute configuration:
https://docs.databricks.com/aws/en/ldp/configure-pipeline
https://docs.databricks.com/aws/en/ldp/develop

REGARDING THE OPEN-SOURCE VERSION

You are correct that Databricks contributed Spark Declarative Pipelines to the Apache Spark open-source project. The sql/pipelines module is available in the Apache Spark repository:
https://github.com/apache/spark/tree/master/sql/pipelines

There are some important distinctions to understand:

1. The open-source Apache Spark Declarative Pipelines provides the core declarative programming model (defining tables and views with @Dlt.table, @Dlt.view, etc.) that can run on any Spark cluster.

2. The Databricks version (Lakeflow SDP) adds significant platform integrations on top of that core, including Unity Catalog integration, managed pipeline compute orchestration, the Lakeflow Pipelines Editor, monitoring and observability, expectations/data quality enforcement at scale, enhanced autoscaling, and Photon acceleration.

3. When a future Databricks Runtime LTS ships with Spark 4.x that includes the open-source declarative pipelines module, it would theoretically be possible to use the core open-source APIs on all-purpose compute. However, you would not get the managed pipeline orchestration, automatic compute lifecycle, or the deeper platform integrations that the Databricks-managed SDP experience provides.

In short, the open-source contribution means the declarative programming model becomes portable across Spark environments, but the full managed experience on Databricks will continue to run on dedicated pipeline compute. For production workloads on Databricks, the recommended path remains using SDP pipelines with serverless or classic pipeline compute.

Keep an eye on the Databricks release notes for any updates as the Spark 4.x line matures:
https://docs.databricks.com/aws/en/release-notes/

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.