cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Workflow Fail safe query

Skully
New Contributor

I have a large SQL query that includes multiple Common Table Expressions (CTEs) and joins across various tables, totaling approximately 2,500 lines. I want to ensure that if any part of the query or a specific CTE fails—due to a missing table or column—the entire workflow or query does not fail. Instead, I would like a mechanism to handle such failures by automatically substituting any missing attributes (columns) with NULL values. This would allow the query to continue executing without interruption.

How can I achieve this, and are there best practices for implementing a fail-safe mechanism for such a large and complex query?

1 REPLY 1

LingeshK
Databricks Employee
Databricks Employee

There are few options you can try. Based of the information shared, I am assuming a skeleton for you complicated query as follows:

WITH cte_one AS (
SELECT *
FROM view_one
),
-- Other CTEs
...
-- Your main query logic
SELECT
FROM cte_one
-- Joins and other clauses

In general if you are having missing values for a particular column and want to default it to a NULL value you can try using the COALESCE method. If you are already using that and want a custom Databricks functionality/approach you can try reading about the materialized views, streaming tables, generic views, and handling expectations on tables that Delta Live Table (DLT) framework offers in Databricks.

A change of approach can be considered here as well, if you are flexible enough to use DataFrame API in Spark instead of pure SQL. This means you can check for the columns' existence first and remediate using an alternate query prior to running the large query.

A final advise from me is for you to try a more controlled approach as follows:

  • STEP 1: Identify all required columns
  • STEP 2: Create views with default columns
  • STEP 3: Use the views in your query (either using the same SQL or refactoring the code to utilize a managed framework like DLT within Databricks)

If you have any additional information that might help us help you, please feel free to comment.

LK

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group