Databricks Community

Avnish_Jain · ‎12-12-2023

Introduction

Modern data pipelines can be complex, especially when dealing with massive volumes of data from diverse sources. Managing the processing of this data is not too dissimilar to the responsibilities of a conductor in an orchestra, coordinating each element of the pipeline to streamline the flow of data in harmony.

Databricks Workflows offers a unified and streamlined approach to managing your Data, BI, and AI workloads. You can define data workflows through the user interface or programmatically – making it accessible to technical and non-technical teams. The platform's simplicity, reliability, ease of authoring, and price point (it is FREE!) empowers organisations of all sizes to tackle the challenges of data orchestration efficiently.

If you want to learn more about the pivotal role of orchestration in achieving success with your data pipelines, I recommend diving into the blog: ‘Why orchestration is your key to success for modern data pipelines.’

In this 2-part blog series, I will guide you through the core orchestration, operations and monitoring concepts of Databricks Workflows. I aim to illustrate the fundamentals of Databricks Workflows from the perspective of a Retail customer looking to run their product recommendation data pipeline.

Part 1: Creating your pipeline will focus on the basics of creating a data pipeline in Databricks Workflows. We will look at how to create jobs and tasks, establish control flows and dependencies, and address the different compute scenarios to meet your data processing needs.
Part 2: Monitoring your pipeline explores the additional job-level configurations that can improve your data pipeline's reliability and monitoring capabilities. We will also understand how to programmatically author and deploy Workflows and integrate with your CI/CD pipelines.

Although the underlying narrative of this blog is centred around orchestrating a product recommendation pipeline, the core concepts discussed are applicable across all industry verticals and use cases.

Setting the Scene

You are part of a large retailer looking to provide more personalised customer experiences. The goal is to construct a cutting-edge product recommender system that can provide online and offline AI-driven insights. This new data capability will allow the retailer to create a holistic and data-driven ecosystem that personalises online interactions and informs strategic initiatives to elevate and improve the customer journey.

Your Data Engineering team has built the components of the data pipeline responsible for:

Ingesting data from external sources like ERPs and CRMs into the Bronze Layer
Cleansing and refining the data sets to ensure strong data quality in the Silver Layer
Aggregating and transforming data for formal business consumption in the Gold Layer

Your Data Analyst teams want to extract insights to understand customer behaviour better. The team has built:

Comprehensive dashboards for intuitive analysis of customer preferences and purchasing patterns
SQL reports and queries that provide valuable insights into specific customer groups, popular product categories, and the performance of the recommender system over time.
SQL alerts to activate when a specific product category experiences a notable surge in popularity among a particular demographic to enable the marketing team to evolve better.

Your Data Science and Machine Learning team want to build on past customer behaviour patterns and provide tailored recommendations to better personalise and improve customer experience.

Update and append customer and product feature tables for dynamic and evolving profiles.
Schedule regular model retraining sessions to incorporate the latest data, ensuring the recommender system adapts to changing patterns.
Perform batch scoring to generate inferences at scale for offline insights to shape strategic marketing campaigns better.
Provide a model serving endpoint that can operate in real-time to offer customers tailored product recommendations as they navigate digital platforms.

Your task is to integrate these components into a sophisticated, end-to-end data pipeline capable of orchestrating the Data Engineering, Data Warehousing & BI, and Machine Learning components - enabling your business to enhance the shopping experience and foster customer satisfaction and loyalty.

Example Databricks Workflow: Product Recommender

Putting it all together, our Databricks Workflow might look like the following:

Data Engineering: Our ETL processes begin with Bronze tables loading in parallel, followed by Silver and Gold tables only when their dependencies are successfully completed.
Data Warehousing and BI: Our SQL Dashboards will refresh once the Gold layer completes the load, providing analysts with up-to-date and clean data. Analysts can schedule SQL Reports to perform additional business transformations in parallel with proactive alerting set if there are breaches in the defined thresholds.
Machine Learning and AI: Independently, as soon as the Gold layer completes, the Customer and Product Feature tables are updated or appended with new data. If both feature tables are successfully updated, model retraining, batch scoring and model serving can execute in parallel; otherwise, a clean-up and logging action will occur.

Create your first Databricks Workflow

Step #1 - Create a Databricks job

A Databricks job allows you to execute data processing and analysis tasks within a Databricks workspace.

To create your first Databricks job:

Navigate to Databricks Workflows by clicking on ‘Workflows’ on the left sidebar
Click the ‘Create job’ button on the right-hand side of the window.
Input a name for your Databricks job in the top left corner in place of the “Add a name for your job…” prompt.

In my case, I have chosen to exercise my creative side with ‘product-recommendation-pipeline’.

Step #2 - Create your first task

A Databricks job comprises one or more Databricks tasks – a specific unit of work representing an individual step or action. Jobs can consist of just a single task, or they can be an intricate workflow of multiple tasks chained together by dependencies.

Our product recommender pipeline will be a multi-task job composed of many parts that span data engineering, data warehousing and machine learning.

Databricks Workflows supports the execution of many different types of tasks, including but not limited to: Notebooks, JARs, Python Scripts and Wheels, and SQL Files. You can also define a Delta Live Tables (DLT) pipeline as a task or orchestrate DBT jobs natively. If you follow a modular, parameter-driven design approach, you can “run a job as a task” and execute a job from another.

You can read more about the different supported task types in our documentation.

As the Data Engineering team built their ETL pipelines in Databricks Notebooks, our first task will be of type Notebook. These Notebooks can reside either in the Workspace or can be sourced from a remote Git repository.

To create our Notebook task:

Provide the task name in the ‘Task name’ field.
In the Type dropdown menu, select Notebook.
Select the location for where the Notebook code resides in the Source field:
- Workspace for a notebook located in a Databricks workspace folder
- Git provider for a notebook located in a remote Git repository

In the Cluster field, we will provide the details of the compute cluster responsible for the execution of the task.
- Job clusters - these clusters are designed for executing tasks and jobs in a production environment. They are recommended for production use cases over Interactive Clusters as they are approximately 50% cheaper. Additionally, they terminate when the job ends, reducing resource usage and costs.
- Interactive clusters - these are tailored for data exploration, ad hoc queries, and development tasks. They provide an interactive environment for data practitioners to perform ad hoc analysis, data exploration, or development. Interactive should ideally not be used in Production as they are not cost-efficient.

As this is our first task, we will create a new job cluster by clicking “Add new job cluster.”
- We recommend you select the latest Long Time Support (LTS) Databricks Runtime (DBR) for your Production workloads.
- We also recommend you Enable Photon Acceleration. Photon is a C++ vectorised execution engine that provides faster performance for your Spark and SQL workload.

Step #3 - Additional task configurations

You can configure additional task configurations that provide fine-grained control and flexibility to meet the execution and orchestration requirements of each task.

Configuration	Description
Dependent Libraries	The configuration permits you to specify additional libraries required for your task's execution, ensuring the necessary resources are installed and available when the task runs.
Parameters	Depending on the task type, you can pass parameters either as key-value pairs, JSON, or as positional arguments. For example, Notebooks accept JSONs or key-value pairs via the UI, but Python scripts take in positional arguments as a list. You can customise the behaviour of your tasks without having to create multiple copies of the same task with different settings. You can also dynamically set and retrieve parameter values across tasks to build more mature and sophisticated data pipelines.
Depends On and Run Ifs	Task dependencies allow you to establish control flows within a job, ensuring the task only runs when its dependencies have finished executing. With Run Ifs, you can configure tasks to run under specific conditions. For example, you can execute a task even if some or all of its dependencies have failed, enabling your job to recover from failures and maintain uninterrupted execution.
Notifications	Notifications allow you to receive updates on when your task starts, succeeds, fails or breaches defined duration thresholds. You can configure alerts to be notified via Email or other communication channels like Slack, Teams, PagerDuty (and more) - providing real-time observability into your job's execution.
Task retries	With Retries, you can specify how many times a task should be retried in case of a failure. This feature enhances the reliability of your workflows by automatically attempting to recover from transient issues.
Duration threshold	By setting Duration Thresholds, you can define the execution time limits for a task to either warn you if the task runs longer than expected or alert and terminate the task if it runs beyond the maximum set completion time.

Once you have set these, you can click Create Task!

Step #4 - Repeat for your other tasks

Repeat Steps #2 and #3 for the rest of the tasks in your orchestration workflow, selecting the appropriate task type.

SQL Workloads

Databricks Workflows allows you to orchestrate different capabilities under SQL-based workloads.

To enable our firm’s business users and decision-makers to have access to the most up-to-date data, we will select SQL > Dashboard and provide the name of the Dashboard built by our Data Analyst that needs to be updated or refreshed as soon as the reporting objects are loaded.
We can embed our Data Analyst SQL-based reports by running SQL > Query task, or SQL > File versioned in a Git repository.

For example, a report might calculate the percentage growth of different product categories over time.
With that, we could evaluate and proactively trigger a notification that can be sent to us when a product category has grown by more than 50% by adding a SQL > Alerts task type.

Machine Learning Workloads

Machine Learning workloads are typically also written in a Databricks Notebook. From within a Notebook, you can perform the Feature Engineering, Model Retraining, and Batch Scoring. However, if your pipeline is written as Python files or wheels, these code assets do not need to be converted to a Notebook and can be natively orchestrated as their own task types.

To schedule your model retraining on a weekly basis even if your pipeline is scheduled to run on a different cadence (let’s just say daily), we can add an If/Else Condition task and specify the below boolean expression:

{{job.start_time.[iso_weekday]}} == 7

This will be evaluated to be “True” only if the day of the week is Sunday.

Step #5 - Define dependencies and control flows

For tasks dependent on another task's outcomes, you can specify these relationships in the ‘Depends on’ field by selecting the desired task name. Based on these configurations, Databricks Workflows will automatically schedule and trigger the tasks as and when dependencies are met efficiently.

If a task has no dependencies on another task, Workflows will schedule and execute these tasks independently and in parallel.
If a task has one or more dependencies, the subsequent task will be executed only after the dependent tasks have met the desired outcome

Example: Our dimension tables in our product recommendation pipeline are not dependent on one another and, therefore can be orchestrated and loaded in parallel. However, we will only want our Fact table to run once all the dependent Dimension tables have been successfully loaded.

Based on the provided dependencies, Databricks Workflows will be able to automatically determine the correct and efficient order to execute tasks.

By default, a Databricks Workflow task will only run if all its dependencies successfully complete. However, tasks can also fail or be excluded. If a failure occurs, the workflow is immediately stopped with the errors raised. Under such a scenario, you might have requirements to catch these errors and perform certain automated activities.

With the Run If dependencies field, you have the ability to define these processes that are triggered if specific conditions are met.

Example: We want to perform custom logging and clean-up if even one of our dependencies fails. We can do so by changing the Run If dependencies from “All succeeded” to “At least one failed” for our designated “Log and Clean Up” Databricks task.

For more dynamic, parameter-driven control flows, you can create an “If/else condition” task type to create branching logic based on the results of a Boolean expression. You can evaluate based on a job, task parameter variable, or task value.

Step #6 - Define compute clusters for your tasks

As we did in our first task, we will need to define the compute cluster to execute our tasks.

Compute	Benefits	Challenges
Re-use an existing job cluster	- No start-up time for subsequent tasks after initialisation - Very cost-efficient	- Potential resource contention on the cluster if many parallel tasks are running with demanding compute needs
Instantiate a different job cluster	- Tailor instance types and cluster sizes to your workload - Improve resource isolation across parallel tasks	- Need to wait for a new cluster to start up
Leverage an all-purpose, interactive cluster (Not recommended for production workloads!)	- Can re-use across multiple jobs to reduce start-up times	- More expensive!!! (relative to job clusters) - Resource contention issues - Does not automatically terminate when the job ends
Job Pools	- Reduce start-up times within and across jobs as VMs can be pulled from a Cluster Pool	- Higher TCO! - VM costs incurred to keep instances warm

Coming soon: Serverless Workflows!

Serverless Workflows is a fully managed service that brings substantial advantages to your data operations. With Serverless Workflows, you'll experience rapid start-up times, cost-efficient yet high-performance data processing, and automatic optimisations that can adapt seamlessly to your specific workloads. These benefits translate to a lower TCO, better reliability, and an improved user experience for more efficient, simpler, and cost-effective operations.

Conclusion

In the first part of our blog series, we've comprehensively examined the fundamentals of creating a Databricks Workflow. We've explored the Databricks Workflow UI, learning how to craft jobs, integrate tasks, configure dependencies, and address different compute scenarios to meet your data processing needs.

In Part 2 of our series, we'll dive even deeper into the intricacies of Databricks Workflows. We'll explore additional job-level configurations that can enhance the efficiency and reliability of your operations. We'll also delve into the monitoring and alerting capabilities that Workflows offers, helping you keep a close eye on and detect issues on your pipelines. Furthermore, we'll discuss various deployment options, ensuring that your data pipeline is functional and seamlessly integrated into your organisation's infrastructure.

Databricks Community

Basics of Databricks Workflows - Part 1: Creating your pipeline