cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Conditional Execution in DLT Pipeline based on the output

rpilli
New Contributor

Hello ,


I'm working on a Delta Live Tables (DLT) pipeline where I need to implement a conditional step that only triggers under specific conditions. Here's the challenge I'm facing:

  • I have a function that checks if the data meets certain thresholds. If the data passes, the function returns the original DataFrame. If the thresholds are breached, it returns a different DataFrame containing logs or failure statistics.
  • The main challenge is that these two DataFrames have different schemas, which is causing difficulties in the pipeline
  • I only want to initiate the logs-related steps in DLT when the threshold check fails. Currently, my pipeline writes two outputs regardless: one for the logs and one for the passed data, even if the passed data count is zero. This isn't the behavior I want.

Question:
Is there a way to structure the DLT pipeline so that the logs path graph or process, is only initiated when the threshold check fails? Ideally, l'm looking for a nested or conditional DLT step that only runs when the threshold validation fails.

  • The fact that DLT doesn't have built-in flow control mechanism like ETL tools, is challenging.


Any guidance or best practices for achieving this would be greatly appreciated!
Thanks in advance for your help!

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

In Delta Live Tables (DLT), native conditional or branch-based control flow is limited; all table/stream definitions declared in your pipeline will execute, and dependencies are handled via @Dlt.table or @Dlt.view decorators. You canโ€™t dynamically skip or instantiate pipeline steps based purely on DataFrame content as you might in classic ETL toolsโ€”but there are best-practice patterns to approximate conditional logic for cases like threshold-based validation, especially with differing schemas.

Core Recommendation

You should separate your validation check from your log/error handling steps, and use filtering and table dependencies to "gate" data to each path, rather than trying to switch between DataFrames with differing schemas in the same step.

Pattern: Split Tables and Data Filters

  • Validation Table:
    Create a DLT table or view that applies your threshold check and only outputs rows passing the check.

  • Failure Logs Table:
    Create a separate DLT table or view that selects only rows failing your check, and transforms them into your failure schema. You can derive this from the same base input (or from the validation table, applying the inverse filter).

  • Downstream Consumption:
    Downstream steps will only see non-empty tables if the corresponding condition occurs. You can avoid writing zero-row tables by adding checks or limits in your final write logic, though DLT will still instantiate both tables as part of the pipeline graph.

    Example:

    python
    import dlt @Dlt.table def raw_data(): return spark.read.table("source_table") @Dlt.table def passed_data(): df = dlt.read("raw_data") return df.filter(df.metric > threshold) @Dlt.table def failed_logs(): df = dlt.read("raw_data") failed = df.filter(df.metric <= threshold) # Map failure schema return failed.selectExpr("id", "metric", "logMessage")

    This way, only the appropriate table receives data per run, but both destinations are defined in your graph.

Avoiding Zero-Row Tables

  • Optional: Use a final step to avoid writing downstream if a table has no data. DLT will materialize the step, but nothing will be written or processed downstream if the table/view is empty.

  • This does not prevent DLT from managing the table, but can keep your metrics cleaner.

Why Not Dynamic Branching?

DLT's declarative framework builds graphs at pipeline start timeโ€”you can't declare pipeline steps dynamically. However, by using DataFrame filters and data-dependent table population, you mimic conditional behavior within the limitations of the framework.

Handling Different Schemas

  • Never return different schema DataFrames from the same output step. Instead, output to separately-defined tables/views, each with a fixed schema, so DLT's schema expectations are satisfied. This keeps the pipeline reliable and clear.

Summary Table

Requirement DLT Solution Notes
Conditional logic Data filters Split via filtered tables/views
Different schemas Distinct tables/views Never change schema mid-pipeline
Output only logs on fail Logs table populated only if failed rows exist DLT materializes the table regardless; avoid zero-result processing downstream
 
 

References

  • Databricks forums and official docs recommend this pattern for validations, error handling, and conditional branches in DLT.


If you need nested conditional logic or true "branching," you might need to orchestrate pipelines via external tools (Databricks Jobs, Airflow, etc.) or preprocess thresholds before ingesting to DLT. For most operational use cases, filtering/gating via tables as described above is the validated approach.