cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Lakeflow Spark Declarative Pipeline

Digvijay_11
Databricks Partner
  1. How we can run a SDP pipeline in parallel manner with dynamic parameter parsing on pipeline level. 
  2. How we can consume job level parameter in Pipeline. If similar name parameters are defined in pipeline level then job level parameters are getting overwritten. 
  3. Do we have to create Delta live table always?
3 REPLIES 3

osingh
Contributor

To run an SDP (Spark Declarative Pipeline) in parallel with dynamic parameters, you need to understand that SDP is "smart"โ€”it builds a dependency graph and runs everything it can at the same time by default.

Here is a simple breakdown of how to handle your specific questions:

1. Running in Parallel
You don't actually need to write "parallel code." Because SDP is declarative, the engine looks at your @sdp.table definitions. If Table A and Table B don't depend on each other, SDP will automatically trigger them in parallel to save time.

2. Handling Dynamic Parameters
To pass parameters into your pipeline without them getting messy or overwritten, the best way is to use Spark Configurations.

The Overwrite Issue: Youโ€™re rightโ€”if you use the same name at the Job and Pipeline levels, the Job level usually wins.

3. Do you always need a Delta Live Table?
Nope! While SDP is the engine behind Delta Live Tables (DLT) in Databricks, the open-source version is flexible. You can write to Parquet, Iceberg, or even just Temporary Views if you don't want to save the data permanently.

However, using Delta is usually recommended because it supports "Time Travel" and "Z-Ordering," which makes your queries much faster later on.

You can refer to the official documents for more details.

https://docs.databricks.com/aws/en/ldp/parameters

https://spark.apache.org/docs/latest/declarative-pipelines-programming-guide.html

Thank you!

 

 

Om Singh

JacekLaskowski
Databricks MVP

Just FYI, as of Jan 16th (the time I'm writing this answer), SDP and Delta Lake in their OSS versions don't work together yet.

SDP is part of Apache Spark 4.1, but Delta Lake does not support it at the moment. It's coming. No idea when it's gonna be available, though.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @Digvijay_11,

Here are answers to each of your three questions about Lakeflow Spark Declarative Pipelines (SDP):

1. RUNNING SDP PIPELINES IN PARALLEL WITH DYNAMIC PARAMETERS

SDP automatically determines the dependency graph across your table and view definitions. If two datasets do not depend on each other, the engine will execute them in parallel without any extra configuration on your part. You do not need to write explicit parallel logic.

For dynamic parameters, you define key-value pairs in the pipeline configuration (either through the UI or in your pipeline JSON definition). In Python, you reference them with spark.conf.get():

source_catalog = spark.conf.get("my_pipeline.source_catalog")

In SQL, you use the ${} syntax:

SELECT * FROM ${my_pipeline.source_catalog}.schema.table

You can set different values per environment. For example, in a JSON pipeline definition:

{
  "name": "My Pipeline - DEV",
  "configuration": {
    "my_pipeline.source_catalog": "dev_catalog",
    "my_pipeline.start_date": "2025-01-01"
  }
}

Note: parameter keys can contain underscores, hyphens, periods, and alphanumeric characters. Values are always strings. Avoid using reserved Spark configuration keys.

Docs: https://docs.databricks.com/aws/en/ldp/parameters

2. JOB-LEVEL VS PIPELINE-LEVEL PARAMETERS

You are correct that when the same parameter name is defined at both the job level and the pipeline level, the pipeline-level configuration takes precedence and the job-level value gets overwritten. This is by design: the pipeline configuration is the authoritative source for pipeline parameters.

The recommended approach is to use distinct naming conventions to avoid collisions. For example, prefix your pipeline parameters with a namespace like "mypipeline." (e.g., mypipeline.env, mypipeline.source_path). This keeps them separate from any job-level parameters. If you need to pass values from a job into a pipeline task, use unique parameter names that do not overlap with the pipeline's own configuration keys.

Alternatively, if your pipeline is a task within a multi-task job and you need to propagate values from upstream tasks, consider using task values (dbutils.jobs.taskValues) in a notebook task that runs before the pipeline task, then reference those values through a separate mechanism rather than relying on overlapping parameter names.

Docs: https://docs.databricks.com/aws/en/jobs/parameters

3. DO YOU ALWAYS HAVE TO CREATE A DELTA LIVE TABLE?

No, you do not always have to create a Delta table. SDP supports three dataset types:

- Streaming tables: for incremental, append-only workloads (e.g., ingesting from cloud storage or message buses).
- Materialized views: for batch transformations that are recomputed on each pipeline update.
- Temporary views: for intermediate transformations that do not persist data. These are useful when you need a transformation step but do not want to store the result as a table.

That said, streaming tables and materialized views are backed by Delta and provide benefits like ACID transactions, time travel, and schema enforcement. Temporary views are only available within the pipeline run and are not stored. Choose the dataset type based on whether you need the data to persist and be queryable outside the pipeline.

In Python:

import dlt

@dlt.view
def my_temp_view():
    return spark.read.table("source_table").filter("status = 'active'")

@dlt.table
def my_final_table():
    return dlt.read("my_temp_view").groupBy("category").count()

In SQL:

CREATE TEMPORARY LIVE VIEW my_temp_view AS
SELECT * FROM source_table WHERE status = 'active';

CREATE OR REFRESH LIVE TABLE my_final_table AS
SELECT category, count(*) as cnt FROM LIVE.my_temp_view GROUP BY category;

Docs: https://docs.databricks.com/aws/en/ldp/index

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.