Databricks Community

bobmclaren · ‎11-07-2024

We are trying to schedule updates to streaming tables and materialized views in Azure Databricks that we have defined in dbt.

Two options we are considering are `SCHEDULE CRON` and just scheduling `dbt run` commands via CI/CD.

The `SCHEDULE CRON` option seems attractive at first because it utilizes the *significantly cheaper* jobs compute SKUs. However, I cannot find any kind of provision for orchestrating the refreshes so that dependencies are considered (i.e. Refresh the dependent MV after the ST is refreshed). This adversely affects the recency of the data in the MVs that are dependent upon upstream STs due to the necessary time gap that must be placed between them in the schedules.

The `dbt run` approach handles this elegantly, multithreading where necessary and refreshing MV/STs in order according to their dependencies. Unfortunately, it seems that dbt must connect to a SQL warehouse and thus cannot use the more cost efficient jobs compute SKUs.

Is my understanding of the pros/cons laid out here correct? Are there other approaches that would provide a more cost effective use of resources?

mark_ott · 3 weeks ago

Your understanding of the trade-offs between using SCHEDULE CRON on streaming tables/materialized views and the dbt run orchestration in Azure Databricks is largely correct, and you’ve identified the two main pain points: orchestration of dependencies and cost efficiency of compute resources.

Comparing `SCHEDULE CRON` and `dbt run` Approaches

Approach	Dependency Orchestration	Cost Efficiency	Parallelism/threading	Suitability for ST/MV refreshes
`SCHEDULE CRON`	Manual; must stagger or sequence jobs	Utilizes jobs compute SKUs	Limited; each cron is discrete	Ok for simple, independent jobs, but complex for dependencies
`dbt run` via CI/CD	Automatic; handles dependencies per DAG	Requires SQL warehouse compute (expensive)	Efficient; parallelizes within node dependency graph	Ideal for dependency-respecting batch refreshes

SCHEDULE CRON is attractive for its access to cheaper jobs compute SKUs, but offers rudimentary orchestration; you must manually coordinate refresh sequences and buffer time between updates to avoid upstream/downstream staleness.
dbt run leverages dbt’s DAG to refresh dependent tables/views in order (including parallelism where possible), ensuring immediate recency for dependent MVs. However, dbt requires a SQL warehouse endpoint, which is more expensive.

Other Approaches to Improve Cost Efficiency

Orchestrate with Databricks Workflows: Use Databricks Jobs or Workflows, which can sequence tasks, respect dependencies, and leverage jobs compute. You can configure a Workflow to refresh streaming tables first, then dependent MVs, and even insert dbt calls where needed, sometimes using the jobs compute SKUs for Python/SQL tasks.
Hybrid Scheduling: For critical dependencies, use dbt only for the parts needing DAG orchestration, and schedule simpler, less dependent jobs with CRON/jobs compute. This reduces the number of expensive dbt runs.
dbt Artifacts for Orchestration: Use dbt’s manifest file to plan the dependency sequence and trigger Databricks jobs in a dependency-respecting order (e.g., parse manifest, update STs, trigger MV refreshes only after upstream STs complete). This may require some custom scripting but could help automate orchestration while using cheaper compute.

Summary Table

Pros/Cons	SCHEDULE CRON	dbt run	Other approaches (Workflows, Hybrid)
Handles dependencies	Must DIY+buffer times	Handles in DAG	Workflows can orchestrate with DAG
Compute cost	Jobs compute (cheap)	SQL warehouse (expensive)	Jobs compute if orchestrated
Parallelism	Manual/multiple crons	Native multithreading	Workflows can parallelize
Automation	Limited	Full dependency DAG	Workflows + dbt artifacts

Recommendations

For dependency-respecting, resource-efficient orchestration, investigate Databricks Workflows with custom task sequencing; this may give you dbt-like ordering with jobs compute cost benefits.
A hybrid approach (mixing CRON for simple jobs, dbt for dependent refreshes, and workflows for orchestration) often maximizes cost efficiency and freshness but does require some extra configuration.
dbt is ideal for complex dependency graphs, but where practical, try to minimize the scope or frequency of SQL warehouse-powered dbt runs for cost savings.

Your analysis is accurate, and alternative orchestration via Databricks Workflows or hybrid strategies may provide better balance between cost and data freshness.

Databricks Community

Comparing Methods for Scheduling Streaming updates via dbt

Comparing `SCHEDULE CRON` and `dbt run` Approaches

Other Approaches to Improve Cost Efficiency

Summary Table

Recommendations

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Databricks Community

Comparing Methods for Scheduling Streaming updates via dbt

Comparing SCHEDULE CRON and dbt run Approaches

Other Approaches to Improve Cost Efficiency

Summary Table

Recommendations

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Comparing `SCHEDULE CRON` and `dbt run` Approaches