Workflow Fail safe query
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-20-2024 02:21 AM
I have a large SQL query that includes multiple Common Table Expressions (CTEs) and joins across various tables, totaling approximately 2,500 lines. I want to ensure that if any part of the query or a specific CTE failsโdue to a missing table or columnโthe entire workflow or query does not fail. Instead, I would like a mechanism to handle such failures by automatically substituting any missing attributes (columns) with NULL values. This would allow the query to continue executing without interruption.
How can I achieve this, and are there best practices for implementing a fail-safe mechanism for such a large and complex query?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-24-2024 10:25 AM
There are few options you can try. Based of the information shared, I am assuming a skeleton for you complicated query as follows:
WITH cte_one AS (
SELECT *
FROM view_one
),
-- Other CTEs
...
-- Your main query logic
SELECT
FROM cte_one
-- Joins and other clauses
In general if you are having missing values for a particular column and want to default it to a NULL value you can try using the COALESCE method. If you are already using that and want a custom Databricks functionality/approach you can try reading about the materialized views, streaming tables, generic views, and handling expectations on tables that Delta Live Table (DLT) framework offers in Databricks.
A change of approach can be considered here as well, if you are flexible enough to use DataFrame API in Spark instead of pure SQL. This means you can check for the columns' existence first and remediate using an alternate query prior to running the large query.
A final advise from me is for you to try a more controlled approach as follows:
- STEP 1: Identify all required columns
- STEP 2: Create views with default columns
- STEP 3: Use the views in your query (either using the same SQL or refactoring the code to utilize a managed framework like DLT within Databricks)
If you have any additional information that might help us help you, please feel free to comment.

