Databricks Community

gubyb · ‎05-16-2024

Over the past few years, the variety of tools accessible to data teams has surged, with dbt emerging as a popular solution for data transformation. It empowers SQL-proficient users to craft flexible data pipelines featuring data validation, lineage, and documentation. Complementing dbt's capabilities, Databricks has integrated robust support by introducing a native task type in Databricks Workflows, facilitating the seamless scheduling of dbt tasks without the requirement for an external orchestrator.

Typically, dbt projects are executed using the ’build’ or ’run’ commands. However, executing the entire project in one go can pose challenges, especially when encountering failures. To mitigate this, dbt introduced the ’retry’ command, enabling execution from the point of failure in a previous run.

In this blog post, we'll dive deep into dbt retry command and explore how to use it together with the Unity Catalog Volumes and the repair run functionality in Databricks Workflows. These features, which are all natively available in the Databricks Data Intelligence Platform, aim to streamline data engineering processes and ensure the continuous operation of data pipelines.

About the dbt Retry Command

The dbt retry command is designed to re-run only the failed or skipped models from a previous dbt build, eliminating the need to re-run the entire project. This approach is more efficient than manually checking and re-running individual models. The retry command is applicable to models, tests, seeds, and snapshots, offering versatility in retrying various components of a dbt project.

This command relies on the state stored in “run_results.json” and reuses any selectors from that execution. To execute a retry, reference the folder where state files are stored as follows:

dbt retry --state /path/to/state_dir/

To customize the storage location of the run_results.json file, which is typically located in the /target folder of your dbt project's root directory, you can specify an alternative path using the --target-path option in dbt commands such as Build, Run, Test, Seed, etc.

For example:

dbt build --target-path /path/to/state_dir/

For more details, visit the dbt documentation.

Leveraging Volumes in Unity Catalog for dbt Retries

Unity Catalog's volumes feature significantly eases the storage and access of state from a dbt run. Volumes represent logical storage in cloud object storage services like in S3, ADLS, or GCS that are governed within the Unity Catalog.

The primary benefit of using volumes is their accessibility akin to a local filesystem, abstracting the complexities of cloud storage services. This simplification is particularly useful for managing the state required by the dbt retry command.

For instance, if the 'run_results.json' file is stored in a volume, the dbt retry command simplifies to:

dbt retry --state /Volumes/catalog/schema/volume_name/path_to_state

Note!

The workflow runner, user, or service principal running the job needs to have read and write permissions on the Volume.

How to Use dbt Retry in a Databricks Workflow

With volumes, storing and utilising the state from a dbt execution becomes straightforward.

The below steps descriptions provide more details for the example workflow above:

Initial dbt Build Step: This step involves running the dbt build with the “--target-path” option to direct the output path of the state file to the volume. Dynamic values in Databricks Workflows prevent state overwriting with each run. This setup can be tailored to your job's specific needs:
```
dbt build --target-path /Volumes/catalog/schema/my_volume/{{job.id}}/{{job.run_id}}
```
dbt Retry Task: This task executes the dbt retry command, pointing to the state file's path:
```
dbt retry --state /Volumes/catalog/schema/my_volume/{{job.id}}/{{job.run_id}}
```

Let's assume there is a code or data issue that plagues one of the tables in our dbt pipeline - this would cause both the dbt_build and dbt_retry tasks to subsequently fail. In these circumstances, users can investigate and resolve the underlying issue before repairing the Workflow and executing the “dbt_retry” task.

This subsequent run will proceed from the failure point, ensuring efficient and continuous pipeline operation. The repair run button is found in the top right corner of the Databricks Workflow UI.

Note!

A path to the workspace filesystem can also be used instead of volumes. This can be beneficial in a dev environment, where you might want to make a lot of experimental runs. It is simple to configure with environment overrides in Databricks Asset Bundles. However, we encourage the use of volumes for simpler governance and interoperability.

dbt build --target-path /Workspace/Users/user@mail.com/dbt_runs/{{job.id}}/{{job.run_id}}

Summary

In conclusion, the integration of dbt within Databricks Workflows offers a powerful and streamlined approach to data transformation and pipeline management. The dbt ’retry’ command saves time and resources by not having to re-run entire projects from scratch. This command, combined with the use of Unity Catalog volumes, simplifies the management of state files necessary for retries, making it easier to maintain and execute resilient data pipelines.

The blog post provides a clear example of how to implement the dbt retry command within a Databricks Workflow. It demonstrates the practicality of storing state files in volumes and the ease of referencing them for retries. This process is designed to be dynamic, preventing the overwriting of state with each run and allowing for custom configurations tailored to specific job needs.

Overall, the combination of dbt's retry capabilities and Databricks Workflows’ native support for dbt tasks represents a robust solution for data engineers seeking to optimize their dbt pipelines.

Databricks Community

Retrying dbt Runs in Databricks Workflows

About the dbt Retry Command

Leveraging Volumes in Unity Catalog for dbt Retries

How to Use dbt Retry in a Databricks Workflow

Summary

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL