Comparing Methods for Scheduling Streaming updates via dbt

Warehousing & Analytics

Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.

We are trying to schedule updates to streaming tables and materialized views in Azure Databricks that we have defined in dbt.

Two options we are considering are `SCHEDULE CRON` and just scheduling `dbt run` commands via CI/CD.

The `SCHEDULE CRON` option seems attractive at first because it utilizes the *significantly cheaper* jobs compute SKUs. However, I cannot find any kind of provision for orchestrating the refreshes so that dependencies are considered (i.e. Refresh the dependent MV after the ST is refreshed). This adversely affects the recency of the data in the MVs that are dependent upon upstream STs due to the necessary time gap that must be placed between them in the schedules.

The `dbt run` approach handles this elegantly, multithreading where necessary and refreshing MV/STs in order according to their dependencies. Unfortunately, it seems that dbt must connect to a SQL warehouse and thus cannot use the more cost efficient jobs compute SKUs.

Is my understanding of the pros/cons laid out here correct? Are there other approaches that would provide a more cost effective use of resources?