cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Comparing Methods for Scheduling Streaming updates via dbt

bobmclaren
New Contributor II

We are trying to schedule updates to streaming tables and materialized views in Azure Databricks that we have defined in dbt.

Two options we are considering are `SCHEDULE CRON` and just scheduling `dbt run` commands via CI/CD. 

The `SCHEDULE CRON` option seems attractive at first because it utilizes the *significantly cheaper* jobs compute SKUs.  However, I cannot find any kind of provision for orchestrating the refreshes so that dependencies are considered (i.e. Refresh the dependent MV after the ST is refreshed).  This adversely affects the recency of the data in the MVs that are dependent upon upstream STs due to the necessary time gap that must be placed between them in the schedules.

The `dbt run` approach handles this elegantly, multithreading where necessary and refreshing MV/STs in order according to their dependencies.  Unfortunately, it seems that dbt must connect to a SQL warehouse and thus cannot use the more cost efficient jobs compute SKUs.

Is my understanding of the pros/cons laid out here correct?  Are there other approaches that would provide a more cost effective use of resources?

 

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Your understanding of the trade-offs between using SCHEDULE CRON on streaming tables/materialized views and the dbt run orchestration in Azure Databricks is largely correct, and you’ve identified the two main pain points: orchestration of dependencies and cost efficiency of compute resources.

Comparing SCHEDULE CRON and dbt run Approaches

Approach Dependency Orchestration Cost Efficiency Parallelism/threading Suitability for ST/MV refreshes
SCHEDULE CRON Manual; must stagger or sequence jobs Utilizes jobs compute SKUs Limited; each cron is discrete Ok for simple, independent jobs, but complex for dependencies
dbt run via CI/CD Automatic; handles dependencies per DAG Requires SQL warehouse compute (expensive) Efficient; parallelizes within node dependency graph Ideal for dependency-respecting batch refreshes
 
 
  • SCHEDULE CRON is attractive for its access to cheaper jobs compute SKUs, but offers rudimentary orchestration; you must manually coordinate refresh sequences and buffer time between updates to avoid upstream/downstream staleness.

  • dbt run leverages dbt’s DAG to refresh dependent tables/views in order (including parallelism where possible), ensuring immediate recency for dependent MVs. However, dbt requires a SQL warehouse endpoint, which is more expensive.

Other Approaches to Improve Cost Efficiency

  • Orchestrate with Databricks Workflows: Use Databricks Jobs or Workflows, which can sequence tasks, respect dependencies, and leverage jobs compute. You can configure a Workflow to refresh streaming tables first, then dependent MVs, and even insert dbt calls where needed, sometimes using the jobs compute SKUs for Python/SQL tasks.

  • Hybrid Scheduling: For critical dependencies, use dbt only for the parts needing DAG orchestration, and schedule simpler, less dependent jobs with CRON/jobs compute. This reduces the number of expensive dbt runs.

  • dbt Artifacts for Orchestration: Use dbt’s manifest file to plan the dependency sequence and trigger Databricks jobs in a dependency-respecting order (e.g., parse manifest, update STs, trigger MV refreshes only after upstream STs complete). This may require some custom scripting but could help automate orchestration while using cheaper compute.

Summary Table

Pros/Cons SCHEDULE CRON dbt run Other approaches (Workflows, Hybrid)
Handles dependencies Must DIY+buffer times Handles in DAG Workflows can orchestrate with DAG
Compute cost Jobs compute (cheap) SQL warehouse (expensive) Jobs compute if orchestrated
Parallelism Manual/multiple crons Native multithreading Workflows can parallelize
Automation Limited Full dependency DAG Workflows + dbt artifacts
 
 

Recommendations

  • For dependency-respecting, resource-efficient orchestration, investigate Databricks Workflows with custom task sequencing; this may give you dbt-like ordering with jobs compute cost benefits.

  • A hybrid approach (mixing CRON for simple jobs, dbt for dependent refreshes, and workflows for orchestration) often maximizes cost efficiency and freshness but does require some extra configuration.

  • dbt is ideal for complex dependency graphs, but where practical, try to minimize the scope or frequency of SQL warehouse-powered dbt runs for cost savings.

Your analysis is accurate, and alternative orchestration via Databricks Workflows or hybrid strategies may provide better balance between cost and data freshness.