Lakeflow Spark Declarative Pipelines (SDP) is a framework designed for building scalable, maintainable ETL pipelines. Everyone should want to make the transition from traditional imperative ETL—like standard Spark or Pandas—to a declarative approach because it saves time, effort and removes headache. This series of blog posts is going to make it easy by providing the solutions and best practices you need to get started.
By adopting a declarative approach, you focus exclusively on business logic. SDP automatically handles the heavy lifting—recomputing, retries, persistence, and state management—eliminating the redundant boilerplate code.
Let’s walk through an example.
Consider a common task: ingesting Change Data Capture (CDC) logs to maintain a Slowly Changing Dimension (SCD) Type 1 table. You need to upsert records based on a key, handle deletes, and ensure only the latest version of a record is kept.
|
Spark Imperative code |
Spark Declarative Pipelines |
|
|
In an imperative approach, this would require a complex, multi-line MERGE statement and manual state management. In SDP, the AUTO CDC flow handles the mechanics automatically. You focus only on the business logic. Bingo!
Of course, AUTO CDC is just one building block of the framework—there are a few other simple patterns to keep in mind to get the most out of your pipelines. Let’s start with the fundamentals: how to save results into a table in SDP.
In the traditional Spark world, we typically end every script with a "command." We transform our data, and then we tell Spark exactly what to do with it: df.write.saveAsTable("sales").
When you first switch to Lakeflow SDP, you might get blocked. You’ve written your transformation, you're ready to persist the data, and you reach for your trusty .write.save(). Except, it’s not there. You have the data, but how do you actually make it land in a table?
In fact, in SDP, explicitly calling a "write" action isn't needed.
The shift is simple: in an imperative world, you send data to a table. In a declarative world, you define a table as the result of a function or a SQL query (SDP supports both Python and SQL).
Instead of manually coding steps to follow, you simply use a decorator. Note that, in Python SDP, every table declaration is a Python function which returns a Spark DataFrame:
| The Imperative Way | The SDP Way |
|
|
Consider a more complex example to illustrate a daily aggregation over sales data:
| The Imperative Way | The SDP Way |
|
This approach forces you to step outside the main logic flow into a "helper" function, manually managing the sink and checkpoints. |
The framework handles the "how." You only define the "what." Because you used |
By adding @dp.table or @dp.materialized_view, you are telling the framework: "This function represents a physical table. Make it so." Note that, by default, the name of the table is inferred from the function name.
If you’re wondering, "Wait, when do I use a @dp.table versus a @dp.materialized_view?"—don't worry! I've used both in the examples above, and it’s perfectly normal if the distinction feels blurry right now. They look similar, but they handle data quite differently under the hood. We will dive into the differences in the next post.
As you start your journey with SDP, perform a regular "self-check" on your code:
"Am I focusing on the business logic, or am I getting into implementation details?"
If you find yourself figuring out save modes or foreachbatch semantics, you're likely still thinking imperatively. When you let the framework handle the "how," you know you're using SDP correctly.
SDP provides a standardized, managed framework for ETL pipelines. It automatically handles the checkpoints, storage paths, and boilerplate code, giving more room to focus on business logic and outcomes.
In the next post, we’ll dive deep into the Differences between Streaming Tables and Materialized Views, breaking down exactly when to use each one to keep your pipelines efficient. Stay tuned!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.