Modern data pipelines can be complex, especially when dealing with massive volumes of data from diverse sources. Managing the processing of this data is not too dissimilar to the responsibilities of a conductor in an orchestra, coordinating each element of the pipeline to streamline the flow of data in harmony.
Databricks Workflows offers a unified and streamlined approach to managing your Data, BI, and AI workloads. You can define data workflows through the user interface or programmatically – making it accessible to technical and non-technical teams. The platform's simplicity, reliability, ease of authoring, and price point (it is FREE!) empowers organisations of all sizes to tackle the challenges of data orchestration efficiently.
If you want to learn more about the pivotal role of orchestration in achieving success with your data pipelines, I recommend diving into the blog: ‘Why orchestration is your key to success for modern data pipelines.’
In this 2-part blog series, I will guide you through the core orchestration, operations and monitoring concepts of Databricks Workflows. I aim to illustrate the fundamentals of Databricks Workflows from the perspective of a Retail customer looking to run their product recommendation data pipeline.
Although the underlying narrative of this blog is centred around orchestrating a product recommendation pipeline, the core concepts discussed are applicable across all industry verticals and use cases.
You are part of a large retailer looking to provide more personalised customer experiences. The goal is to construct a cutting-edge product recommender system that can provide online and offline AI-driven insights. This new data capability will allow the retailer to create a holistic and data-driven ecosystem that personalises online interactions and informs strategic initiatives to elevate and improve the customer journey.
Your Data Engineering team has built the components of the data pipeline responsible for:
Your Data Analyst teams want to extract insights to understand customer behaviour better. The team has built:
Your Data Science and Machine Learning team want to build on past customer behaviour patterns and provide tailored recommendations to better personalise and improve customer experience.
Your task is to integrate these components into a sophisticated, end-to-end data pipeline capable of orchestrating the Data Engineering, Data Warehousing & BI, and Machine Learning components - enabling your business to enhance the shopping experience and foster customer satisfaction and loyalty.
Putting it all together, our Databricks Workflow might look like the following:
A Databricks job allows you to execute data processing and analysis tasks within a Databricks workspace.
To create your first Databricks job:
A Databricks job comprises one or more Databricks tasks – a specific unit of work representing an individual step or action. Jobs can consist of just a single task, or they can be an intricate workflow of multiple tasks chained together by dependencies.
Our product recommender pipeline will be a multi-task job composed of many parts that span data engineering, data warehousing and machine learning.
Databricks Workflows supports the execution of many different types of tasks, including but not limited to: Notebooks, JARs, Python Scripts and Wheels, and SQL Files. You can also define a Delta Live Tables (DLT) pipeline as a task or orchestrate DBT jobs natively. If you follow a modular, parameter-driven design approach, you can “run a job as a task” and execute a job from another.
You can read more about the different supported task types in our documentation.
As the Data Engineering team built their ETL pipelines in Databricks Notebooks, our first task will be of type Notebook. These Notebooks can reside either in the Workspace or can be sourced from a remote Git repository.
To create our Notebook task:
You can configure additional task configurations that provide fine-grained control and flexibility to meet the execution and orchestration requirements of each task.
Configuration |
Description |
Dependent Libraries |
The configuration permits you to specify additional libraries required for your task's execution, ensuring the necessary resources are installed and available when the task runs. |
Parameters |
Depending on the task type, you can pass parameters either as key-value pairs, JSON, or as positional arguments. For example, Notebooks accept JSONs or key-value pairs via the UI, but Python scripts take in positional arguments as a list. You can customise the behaviour of your tasks without having to create multiple copies of the same task with different settings. You can also dynamically set and retrieve parameter values across tasks to build more mature and sophisticated data pipelines. |
Depends On |
Task dependencies allow you to establish control flows within a job, ensuring the task only runs when its dependencies have finished executing. |
Notifications |
Notifications allow you to receive updates on when your task starts, succeeds, fails or breaches defined duration thresholds. |
Task retries |
With Retries, you can specify how many times a task should be retried in case of a failure. This feature enhances the reliability of your workflows by automatically attempting to recover from transient issues. |
Duration threshold |
By setting Duration Thresholds, you can define the execution time limits for a task to either warn you if the task runs longer than expected or alert and terminate the task if it runs beyond the maximum set completion time. |
Once you have set these, you can click Create Task!
Repeat Steps #2 and #3 for the rest of the tasks in your orchestration workflow, selecting the appropriate task type.
Databricks Workflows allows you to orchestrate different capabilities under SQL-based workloads.
Machine Learning workloads are typically also written in a Databricks Notebook. From within a Notebook, you can perform the Feature Engineering, Model Retraining, and Batch Scoring. However, if your pipeline is written as Python files or wheels, these code assets do not need to be converted to a Notebook and can be natively orchestrated as their own task types.
To schedule your model retraining on a weekly basis even if your pipeline is scheduled to run on a different cadence (let’s just say daily), we can add an If/Else Condition task and specify the below boolean expression:
{{job.start_time.[iso_weekday]}} == 7
This will be evaluated to be “True” only if the day of the week is Sunday.
For tasks dependent on another task's outcomes, you can specify these relationships in the ‘Depends on’ field by selecting the desired task name. Based on these configurations, Databricks Workflows will automatically schedule and trigger the tasks as and when dependencies are met efficiently.
Example: Our dimension tables in our product recommendation pipeline are not dependent on one another and, therefore can be orchestrated and loaded in parallel. However, we will only want our Fact table to run once all the dependent Dimension tables have been successfully loaded. Based on the provided dependencies, Databricks Workflows will be able to automatically determine the correct and efficient order to execute tasks. |
By default, a Databricks Workflow task will only run if all its dependencies successfully complete. However, tasks can also fail or be excluded. If a failure occurs, the workflow is immediately stopped with the errors raised. Under such a scenario, you might have requirements to catch these errors and perform certain automated activities.
With the Run If dependencies field, you have the ability to define these processes that are triggered if specific conditions are met.
Example: We want to perform custom logging and clean-up if even one of our dependencies fails. We can do so by changing the Run If dependencies from “All succeeded” to “At least one failed” for our designated “Log and Clean Up” Databricks task. |
For more dynamic, parameter-driven control flows, you can create an “If/else condition” task type to create branching logic based on the results of a Boolean expression. You can evaluate based on a job, task parameter variable, or task value.
As we did in our first task, we will need to define the compute cluster to execute our tasks.
Compute |
Benefits |
Challenges |
Re-use an existing job cluster |
- No start-up time for subsequent tasks after initialisation - Very cost-efficient |
- Potential resource contention on the cluster if many parallel tasks are running with demanding compute needs |
Instantiate a different job cluster |
- Tailor instance types and cluster sizes to your workload - Improve resource isolation across parallel tasks |
- Need to wait for a new cluster to start up |
Leverage an all-purpose, interactive cluster (Not recommended for production workloads!) |
- Can re-use across multiple jobs to reduce start-up times |
- More expensive!!! (relative to job clusters) - Resource contention issues - Does not automatically terminate when the job ends |
Job Pools |
- Reduce start-up times within and across jobs as VMs can be pulled from a Cluster Pool |
- Higher TCO! |
Serverless Workflows is a fully managed service that brings substantial advantages to your data operations. With Serverless Workflows, you'll experience rapid start-up times, cost-efficient yet high-performance data processing, and automatic optimisations that can adapt seamlessly to your specific workloads. These benefits translate to a lower TCO, better reliability, and an improved user experience for more efficient, simpler, and cost-effective operations.
In the first part of our blog series, we've comprehensively examined the fundamentals of creating a Databricks Workflow. We've explored the Databricks Workflow UI, learning how to craft jobs, integrate tasks, configure dependencies, and address different compute scenarios to meet your data processing needs.
In Part 2 of our series, we'll dive even deeper into the intricacies of Databricks Workflows. We'll explore additional job-level configurations that can enhance the efficiency and reliability of your operations. We'll also delve into the monitoring and alerting capabilities that Workflows offers, helping you keep a close eye on and detect issues on your pipelines. Furthermore, we'll discuss various deployment options, ensuring that your data pipeline is functional and seamlessly integrated into your organisation's infrastructure.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.