Databricks

Sanjeev_Kumar

Welcome to the fifth instalment of our blog series exploring Databricks Workflows, a powerful product for orchestrating data processing, machine learning, and analytics pipelines on the Databricks Data Intelligence Platform.

In our previous blog: Maximizing Resource Utilisation with Cluster Reuse, we delved into cluster reuse functionality that enhances resource utilization and streamlines workflow execution. In this blog, we will delve into the intricacies of Databricks Workflows, focusing on schedules and triggers. Workflows provide a robust scheduling system, enabling you to run jobs immediately, periodically, continuously or based on events.

Let us dive into it.

Table of Contents

Overview of Job Schedules and Triggers
Triggers
Scheduled Triggers
Event Triggers
File Arrival Triggers
Table Triggers (Coming Soon)
Continuous Triggers
Run Job/Manual Trigger
Run Now
Run now with different parameters
Setting Triggers
UI Based
API Based
Packaging Databricks Workflows
Conclusion

Overview of Job Schedules and Triggers

In today's data-driven landscape, the ability to automate data pipelines is essential for organizations seeking to streamline their workflows and drive insights at scale. Databricks offers powerful features for orchestrating these pipelines. Databricks Workflows provides various ways to trigger your job. You can run your jobs immediately, periodically, based on events, or continuously.

Jobs in Databricks Workflows can be triggered in several ways:

Scheduled trigger: You can set a schedule in the UI or API. Databricks Workflows follows Quartz CRON syntax for advanced patterns.
Event trigger: This allows job execution based on specific occurrences or conditions, enhancing automation by responding to external events in real-time. It eliminates the need for external services (AWS Lambda/Azure Functions, etc.) to trigger Databricks jobs.
Continuous trigger: Runs real-time streaming jobs efficiently. With continuous mode, a new job run will automatically start if the existing run fails.
Run Job/Manual: The run job triggers an existing job and is used on the UI to manually trigger a job. Databricks also provides a Run Submit API which submits a one-time run. This is used by external orchestrators like Azure Data Factory or Airflow to submit a job in Databricks.

Databricks provides various ways to set the triggers for the jobs:

API-based: Databricks APIs allow for the automation of jobs in response to specific events or conditions without manual intervention. By leveraging the Databricks Jobs API, users can programmatically create, manage, and monitor job runs, integrating seamlessly with external systems and tools for comprehensive workflow orchestration. Various options are:

UI Based: One can also set triggers from Workflows UI.

Let us look at each of them in more detail.

Triggers

You can define Jobs to trigger to run periodically at specified times on a schedule, continuously, or in an event-based manner.

Scheduled Triggers

Scheduled triggers play a central role in automating job execution based on predefined schedules. This is useful for tasks that need to be executed at specific times or intervals, providing a high degree of flexibility and control over your data workflows.

Below is how you can create a scheduled trigger in Databricks:

You can specify a time zone for your schedule, ensuring that your jobs run precisely when you need them to. It also allows you to pause a scheduled job at any time, thus giving you flexibility to manage your jobs according to your changing needs and priorities.

In Databricks, job scheduling plays a crucial role in orchestrating and automating various data processing tasks. However, there are important considerations to keep in mind when setting up and managing job schedules.

Minimum Interval Requirement: Databricks enforces a minimum interval of 10 seconds between consecutive runs triggered by a job's schedule. This interval remains constant regardless of the specified seconds configuration in the cron expression. Adhering to this requirement helps maintain a consistent execution cadence and prevents overloading the system with frequent job runs.
Time Zone Selection: When configuring job schedules, users have the option to choose between time zones that observe daylight saving time or UTC. To ensure runs occur reliably at every hour, especially for absolute time-based schedules, opting for UTC is recommended.
Latency Considerations: Delays in job execution, ranging from a few seconds to several minutes, may occur due to network or cloud infrastructure issues. In such scenarios, scheduled jobs will execute promptly once the service becomes available, ensuring minimal disruption to data processing workflows.

Note: Databricks uses the Quartz syntax for describing the schedule. For more complex schedules that cannot be set using UI you set it using Quartz syntax.

Event Triggers

Event-based triggers offer a mechanism for initiating data processing jobs in response to specific events. In Databricks, these triggers enable users to automate data workflows, streamline processes, and respond dynamically to changes in their data environment, thus enhancing the efficiency and responsiveness of data processing systems.

Efficient Resource Utilization: Event-based triggers allow jobs to run only when necessary, such as when a new file arrives in cloud storage or when a table load finishes. This is particularly beneficial in scenarios where a continuous streaming job is not economical due to infrequent updates or many locations. This way, resources are not wasted on checking for updates when none have occurred.
Improved Data Processing Turnaround Time: Many businesses receive external data into their storage systems and need to process it as soon as possible. This reduces the turnaround time and allows for quicker data analysis and decision-making.
Integration with Other Systems: Such triggers can integrate smoothly with other systems like Auto Loader and Delta Live Tables. For instance, customers with jobs that update their Delta Live Tables (DLT) pipelines can attach a trigger on data arrival to their DLT pipeline update job, instead of scheduling updates based on time.

Right now Databricks Workflows offer two types of event triggers, File Arrival Triggers and Table Triggers (coming soon).

File Arrival Triggers

File Arrival Triggers initiate a job run when new files arrive in a specified cloud storage location governed by Unity Catalog. This is particularly useful when data arrives irregularly, making scheduled or continuous jobs inefficient. These triggers check for new files every minute without incurring additional costs beyond the cloud provider fees. They can be set up through the UI or API. Notifications can be set up to alert when a file arrival trigger fails to evaluate. As this is an important feature, we will cover it in detail in the next blog post in this series.

Table Triggers (Coming Soon)

Table triggers in Databricks are designed to manage job runs based on updates to specified Delta tables. They are particularly useful when data is written unpredictably or in bursts, which could otherwise lead to frequent job triggers, causing unnecessary costs and latency.

Note: The File Arrival Triggers and the Table Triggers will be covered in detail in separate blog posts later in this series.

Continuous Triggers

Real-time data processing has become increasingly crucial for businesses aiming to gain insights and make informed decisions swiftly. Continuous triggers in data pipelines offer a seamless solution, ensuring that a job is always running even in the case of failures.

In Databricks you can create a continuous trigger like below:

In continuous trigger, there are a few key considerations to keep in mind:

Concurrency Control: Only one instance of a continuous job can be active at any given time, ensuring seamless execution without interference.
Minimal Delay: While transitioning between runs, there's a brief pause, typically less than 60 seconds, to allow for a smooth handover from one run to the next.
Task Dependencies and Retry Policies: Unlike traditional job setups, continuous jobs do not support task dependencies or retry policies. Instead, they utilize exponential backoff to manage run failures effectively.
Immediate Action on Paused Jobs: Initiating a "Run now" action on a paused continuous job promptly triggers a new run, facilitating rapid response to changes in job status.
Updating Job Configurations: To apply updated configurations to a continuous job, simply cancel the current run. This action automatically initiates a new run with the revised settings. Alternatively, utilize the "Restart run" option to ensure the job runs with the latest configurations

Run Job/Manual Trigger

Manual triggers allow you to initiate a job directly from the Workflows UI. This is particularly useful for ad-hoc tasks or when you need to run a job immediately. Manual triggering is particularly useful for testing, debugging, or handling unexpected data processing needs. The UI provides 2 ways to manually trigger a job as shown below.

Run Now

The "Run Now" button allows you to initiate a job immediately as is. If you have job-level parameters, the job will be submitted with the default values.

Run now with different parameters

If you have job-level parameters then you can choose to change it before submitting it. An example case is when the job needs to be re-run for the past several days. In the screenshot below, the job level parameter is ‘load_past_n_days’ set to 1 by default. Use can choose to change it while submitting to make it reload data for multiple days

Note: Databricks provides a Run Submit API which triggers a one-time run. This endpoint allows you to submit a workload directly without creating a job. When using this Endpoint, the user needs to pass the job's config and a new run will be submitted in Databricks. Runs submitted using this endpoint don’t display in the UI.

Setting Triggers

You can set your Workflow triggers through the Databricks UI or programmatically via APIs.

UI Based

The Databricks Workflows UI offers a user-friendly and intuitive interface that simplifies the creation, management, and monitoring of multi-task workflows for ETL, analytics, and machine learning pipelines.

You can set these triggers on the left hand pane under ‘Schedules & Triggers’ of your Databricks job.

API Based

API based approaches enable you to efficiently manage, scale, and automate your data workflows. You can programmatically create jobs by specifying various configurations, including the cluster on which the job should run on, and define tasks as either notebook tasks, Spark jar tasks, or Python scripts, among others. You can also define the trigger to schedule it to run periodically at specified times, continuously or on events as explained in the earlier sections of this blog.

You can programmatically author and set the triggers of your Workflows through the following methods:

The choice between these options often depends on your specific use case, the complexity of your workflows, and your team's familiarity with these tools.

Option	Why Choose	When to Choose
Databricks Python SDK	The Python SDK provides a Pythonic way to interact with Databricks. It covers all public Databricks REST API operations and is particularly useful for managing resources such as clusters, jobs, and notebooks dynamically.	Choose Python SDK when your team wants a Pythonic way to work with Databricks for automating operations in accounts, workspaces, and related resources. For example setting triggers.
Databricks REST API	The REST API enables programmatic access to Databricks through HTTP requests and is a good choice if you want to directly interact with the API via curl or a library like 'requests'.	Choose REST API when you want to automate operations directly without the need for an SDK. It is also useful when you want to integrate Databricks operations into other systems or services that can make HTTP requests.
Databricks CLI	The Databricks CLI wraps the Databricks REST API, providing a command-line interface to automate Databricks and is useful for integrating Databricks workflows into broader data processing and analysis pipelines.	Choose CLI when you want a command-line interface to automate Databricks accounts, workspace resources, and data operations.

Packaging Databricks Workflows

Databricks Asset Bundles (DABs) and the Databricks Terraform Provider are both popular approaches you can take to streamline the deployment and management of Databricks Workflows. They allow you to package your data pipelines as code, enabling automated, repeatable deployments simplifying the process of promoting complex workflows across different environments and reducing the possibility of human error.

For a deeper dive into setting up your Databricks Workflows using these tools, refer to the Databricks Asset Bundles and the Databricks Terraform Provider documentation.

Stay tuned for future blogs that will explore these topics in greater detail!

Conclusion

In the realm of data processing and analytics, the ability to automate and precisely control the execution of workflows is paramount. Databricks Workflows emerges as a pivotal solution, offering a sophisticated scheduling system and versatile triggering mechanisms that cater to a wide array of operational needs. From the convenience of scheduled triggers to the dynamic responsiveness of event and continuous triggers, Databricks provides the tools necessary for the seamless automation of data pipelines. The introduction of innovative features such as File Arrival Triggers and Table Triggers further underscores Databricks' commitment to minimizing latency and optimizing resource utilization. Databricks Workflows equips organizations with the capability to efficiently manage complex data tasks. This not only enhances productivity but also empowers teams to derive valuable insights from their data with higher efficiency.