Welcome to the fifth instalment of our blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform.
In our previous blog: Maximizing Resource Utilisation with Cluster Reuse, we delved into cluster reuse functionality that enhances resource utilization and streamlines workflow execution. In this blog, we will delve into the intricacies of Databricks Workflows, focusing on schedules and triggers. Workflows provide a robust scheduling system, enabling you to run jobs immediately, periodically, continuously or based on events.
Let us dive into it.
Table of Contents
In today's data-driven landscape, the ability to automate data pipelines is essential for organizations seeking to streamline their workflows and drive insights at scale. Databricks offers powerful features for orchestrating these pipelines. Databricks Workflows provides various ways to trigger your job. You can run your jobs immediately, periodically, based on events, or continuously.
Jobs in Databricks Workflows can be triggered in several ways:
Databricks provides various ways to set the triggers for the jobs:
Let us look at each of them in more detail.
You can define Jobs to trigger to run periodically at specified times on a schedule, continuously, or in an event-based manner.
Scheduled triggers play a central role in automating job execution based on predefined schedules. This is useful for tasks that need to be executed at specific times or intervals, providing a high degree of flexibility and control over your data workflows.
Below is how you can create a scheduled trigger in Databricks:
You can specify a time zone for your schedule, ensuring that your jobs run precisely when you need them to. It also allows you to pause a scheduled job at any time, thus giving you flexibility to manage your jobs according to your changing needs and priorities.
In Databricks, job scheduling plays a crucial role in orchestrating and automating various data processing tasks. However, there are important considerations to keep in mind when setting up and managing job schedules.
Note: Databricks uses the Quartz syntax for describing the schedule. For more complex schedules that cannot be set using UI you set it using Quartz syntax. |
Event-based triggers offer a mechanism for initiating data processing jobs in response to specific events. In Databricks, these triggers enable users to automate data workflows, streamline processes, and respond dynamically to changes in their data environment, thus enhancing the efficiency and responsiveness of data processing systems.
Right now Databricks Workflows offer two types of event triggers, File Arrival Triggers and Table Triggers (coming soon).
File Arrival Triggers initiate a job run when new files arrive in a specified cloud storage location governed by Unity Catalog. This is particularly useful when data arrives irregularly, making scheduled or continuous jobs inefficient. These triggers check for new files every minute without incurring additional costs beyond the cloud provider fees. They can be set up through the UI or API. Notifications can be set up to alert when a file arrival trigger fails to evaluate. As this is an important feature, we will cover it in detail in the next blog post in this series.
Table triggers in Databricks are designed to manage job runs based on updates to specified Delta tables. They are particularly useful when data is written unpredictably or in bursts, which could otherwise lead to frequent job triggers, causing unnecessary costs and latency.
Note: The File Arrival Triggers and the Table Triggers will be covered in detail in separate blog posts later in this series. |
Real-time data processing has become increasingly crucial for businesses aiming to gain insights and make informed decisions swiftly. Continuous triggers in data pipelines offer a seamless solution, ensuring that a job is always running even in the case of failures.
In Databricks you can create a continuous trigger like below:
In continuous trigger, there are a few key considerations to keep in mind:
Manual triggers allow you to initiate a job directly from the Workflows UI. This is particularly useful for ad-hoc tasks or when you need to run a job immediately. Manual triggering is particularly useful for testing, debugging, or handling unexpected data processing needs. The UI provides 2 ways to manually trigger a job as shown below.
The "Run Now" button allows you to initiate a job immediately as is. If you have job-level parameters, the job will be submitted with the default values.
If you have job-level parameters then you can choose to change it before submitting it. An example case is when the job needs to be re-run for the past several days. In the screenshot below, the job level parameter is ‘load_past_n_days’ set to 1 by default. Use can choose to change it while submitting to make it reload data for multiple days
Note: Databricks provides a Run Submit API which triggers a one-time run. This endpoint allows you to submit a workload directly without creating a job. When using this Endpoint, the user needs to pass the job's config and a new run will be submitted in Databricks. Runs submitted using this endpoint don’t display in the UI. |
You can set your Workflow triggers through the Databricks UI or programmatically via APIs.
The Databricks Workflows UI offers a user-friendly and intuitive interface that simplifies the creation, management, and monitoring of multi-task workflows for ETL, analytics, and machine learning pipelines.
You can set these triggers on the left hand pane under ‘Schedules & Triggers’ of your Databricks job.
API based approaches enable you to efficiently manage, scale, and automate your data workflows. You can programmatically create jobs by specifying various configurations, including the cluster on which the job should run on, and define tasks as either notebook tasks, Spark jar tasks, or Python scripts, among others. You can also define the trigger to schedule it to run periodically at specified times, continuously or on events as explained in the earlier sections of this blog.
You can programmatically author and set the triggers of your Workflows through the following methods:
The choice between these options often depends on your specific use case, the complexity of your workflows, and your team's familiarity with these tools.
Option |
Why Choose |
When to Choose |
Databricks Python SDK |
The Python SDK provides a Pythonic way to interact with Databricks. It covers all public Databricks REST API operations and is particularly useful for managing resources such as clusters, jobs, and notebooks dynamically. | Choose Python SDK when your team wants a Pythonic way to work with Databricks for automating operations in accounts, workspaces, and related resources. For example setting triggers. |
Databricks REST API |
The REST API enables programmatic access to Databricks through HTTP requests and is a good choice if you want to directly interact with the API via curl or a library like 'requests'. | Choose REST API when you want to automate operations directly without the need for an SDK. It is also useful when you want to integrate Databricks operations into other systems or services that can make HTTP requests. |
Databricks CLI |
The Databricks CLI wraps the Databricks REST API, providing a command-line interface to automate Databricks and is useful for integrating Databricks workflows into broader data processing and analysis pipelines. | Choose CLI when you want a command-line interface to automate Databricks accounts, workspace resources, and data operations. |
Databricks Asset Bundles (DABs) and the Databricks Terraform Provider are both popular approaches you can take to streamline the deployment and management of Databricks Workflows. They allow you to package your data pipelines as code, enabling automated, repeatable deployments simplifying the process of promoting complex workflows across different environments and reducing the possibility of human error.
For a deeper dive into setting up your Databricks Workflows using these tools, refer to the Databricks Asset Bundles and the Databricks Terraform Provider documentation.
Stay tuned for future blogs that will explore these topics in greater detail!
In the realm of data processing and analytics, the ability to automate and precisely control the execution of workflows is paramount. Databricks Workflows emerges as a pivotal solution, offering a sophisticated scheduling system and versatile triggering mechanisms that cater to a wide array of operational needs. From the convenience of scheduled triggers to the dynamic responsiveness of event and continuous triggers, Databricks provides the tools necessary for the seamless automation of data pipelines. The introduction of innovative features such as File Arrival Triggers and Table Triggers further underscores Databricks' commitment to minimizing latency and optimizing resource utilization. Databricks Workflows equips organizations with the capability to efficiently manage complex data tasks. This not only enhances productivity but also empowers teams to derive valuable insights from their data with higher efficiency.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.