Authors: Shasidhar Eranti (@Shasidhar_ES) & Kiran Sreekumar (@Kiran_Sreekumar)
In today's business landscape, data stands as the primary differentiator for companies looking to get a competitive edge with analytics and AI. To keep pace with the demand, organizations must take a modern approach to data engineering which includes a modern approach to data orchestration. As data usage becomes increasingly widespread and sophisticated, the size, volume, and complexity of data pipelines are on the rise. According to a Gartner report, this heightened complexity in data processing pipelines can be attributed to several key factors:
Figure 1: Complexities with the modern data orchestration
Given the complexity arising from these factors, the demand for modern data orchestration has never been more pronounced. Modern orchestration is essential to efficiently manage and leverage the full potential of these intricate data pipelines.
Modern data orchestration empowers data teams to efficiently create, execute, oversee, and analyze data-driven workflows that transform data into valuable insights. It offers seamless support for different workloads, including batch processing, streaming, data engineering, and ML/AI pipelines, delivering improved performance while reducing costs, all without compromising on reliability, simplicity, and accessibility within your data platform.
Many companies currently rely on third-party, open-source, or custom in-house solutions for their data orchestration needs. However, these approaches introduce their own set of challenges, which include:
Additionally, a recent IDC survey revealed that many organizations employ at least 10 different tools for data engineering, highlighting the emerging challenge of managing this complexity.
Databricks Workflows addresses these issues by serving as the Lakehouse Orchestrator. It is specifically designed to simplify and streamline the orchestration of data pipelines in the evolving data processing landscape.
Databricks Workflows is the native orchestration solution built for the Databricks Lakehouse platform. It addresses the issues of the current orchestration practices in the industry.
Figure 2: Databricks Lakehouse platform architecture
The key advantages of Databricks Workflows are:
Simple authoring: It comes out of the box inside every Databricks deployment. Any Databricks user can start building data pipelines and schedule jobs using the Workflows UI, in just a few clicks. Workflows also supports end to end CI/CD for more advanced users using Databricks built-in IDE integration (VS-Code extension).
Figure 3: Authoring Workflows in the Databricks workspace
Actionable insights: As the Databricks Workflows is deeply integrated into the platform, you get the much deeper monitoring and observability capabilities vs. external orchestration tools like Apache Airflow. With Databricks Workflows, users get job metrics and operational metadata for the jobs they execute. Visual monitoring provided out of the box, helps you quickly see the health of your jobs and effectively troubleshoot issues in the event of failures.
Figure 4: Monitoring job run metrics in the Workflows dashboard
Proven reliability: Databricks Workflows is a fully managed orchestration service with a proven reliability of 99.95% uptime and is provided as a part of the Databricks Lakehouse Platform at no additional cost. It has a proven track record, and is already trusted by thousands of customers and running millions of production workloads every day. We are also introducing Serverless Jobs Compute for Workflows which will provide even better orchestration by reducing the overhead of managing compute resources for you.
To demonstrate the power of some of the key features included in Databricks Workflows, let’s look at a fraud detection use case.
Figure 5: Fraud detection pipeline orchestrated by Databricks Workflows
The pipeline above shows a real time fraud detection use case built using the Databricks Lakehouse platform. In this example, a financial institution collects transactional data from multiple source applications and ingests them onto the medallion architecture bronze layer. To facilitate real time fraud detection, a Delta Live Table pipeline is used which ingests data in near real time from different source applications. The data is then transformed and cleansed by applying data quality rules which moves the data across to the Silver and Gold layer. A machine learning model is built using the data and is used for detecting fraudulent transactions on the real time data flowing through the pipeline.
The use case is built as three separate Jobs:
The Workflow needs to be orchestrated based on the below requirements.
Lets deep dive on how we can implement the requirements of the use case orchestration using Databricks Workflows.
Continuous stream processing and transformation
Streaming data ingestion is built using Streaming Tables and Delta Live Tables, which is a declarative framework for building reliable, maintainable, and testable data processing pipelines. In a real-time fraud detection use case, data must be ingested as soon as it arrives. To do this, Databricks Workflows utilizes File Arrival Triggers which automatically initiates the ingestion process when new datasets land in the designated storage locations.
Figure 6: Setting up a File Arrival Trigger in Workflows
Databricks Workflows also supports Continuous Jobs to process incoming transactions in near real-time, ensuring that potential fraud is detected promptly.
Any Delta Live Tables pipeline can be orchestrated using Workflows and the compute resources will scale based on the data volume and processing requirements using enhanced autoscaling.
Data transformations, including aggregations are managed using DBT. Databricks Workflows integrates with DBT, and effectively orchestrates the ingestion and transformation pipeline.
Workflows supports all major task types as below:
The pipeline includes data quality checks and machine learning stages. Conditional execution ensures that the machine learning step is executed only if the data quality check passes.
Databricks Workflows allows passing parameters between tasks using the task parameterization feature. Parameters can be passed between all job tasks as well using the job parameterization.
Figure 7: Adding a Job parameter in Workflows
Improved performance and cost-efficiency
Databricks Workflows optimizes resource usage by reusing clusters. When multiple tasks are configured in a job to reuse clusters, the cluster is started only when the first task is run. The same cluster will be used across the tasks eliminating the startup times required for individual clusters to be spun up for each task. This reduces overall time and cost.
Figure 8: Reusing clusters across tasks within a Workflow
Monitoring and Alerting
Job run metrics are captured and visualized in the Workflows user interface making it easier for support teams to monitor the Workflow status.
Figure 9: Job run metrics in the workflows UI
Job queueing ensures jobs are not skipped when concurrency limits are hit for the Databricks workspace.
Duration threshold can be optionally configured for each individual Task and also at the complete Workflow level for warning and timeout to trigger alerts when the corresponding Task or Job run time exceeds a pre-configured time threshold.
Figure 10: Defining job duration thresholds in a Workflow
In the event of a job failure due to data issues or resource constraints, the repair and rerun feature allows for rerunning only the failed tasks eliminating the requirement to rerun the entire pipeline. By using the notification feature, alerts can be sent to the support teams via email, PagerDuty, Slack, Microsoft Teams or Web hooks to keep the team updated on any failures which require manual intervention.
Figure 11: Setting up alerts in a Workflow
Job Modularity
The ingestion, transformation, machine learning model training and evaluation stages are encapsulated as a separate job for ease of maintenance and logical separation. With the Run Job task type feature, the entire fraud detection Workflow can be modularised and be tracked and monitored within a unified Workflow.
Upon detecting potential fraud, the pipeline triggers reporting and alerting tasks. These tasks generate reports for investigators and send alerts for immediate action. The entire reporting process can be orchestrated within the Workflow.
Streamlining deployments using CI/CD and Automation
Pipeline configurations, code, and transformations are version-controlled using the Workflows integration with git with support for the below GIT providers
Figure 12: Setting up git integration in Workflows
The entire Workflow, including infrastructure provisioning, can be defined as code using Terraform. This enables consistent and reproducible deployments across environments. Databricks Asset Bundles encapsulate the entire pipeline, including notebooks, libraries, and configurations. This simplifies packaging, sharing, and deployment of the fraud detection system. For platforms where external tools like Airflow and ADF are used for orchestration, Workflows integrate well with these tools via APIs to provide a seamless integration.
In summary, Databricks Workflows plays a pivotal role in orchestrating an end-to-end data pipeline for multiple use cases. From ingesting transaction data to real-time processing, model training, reporting, and alerting, every step is seamlessly coordinated. The use of features such as file arrival triggers, conditional execution, and Git integration ensures efficiency, reliability, and scalability. Be sure to continue with the next blog post of the series, in which we will discuss and deep dive into the basics of Workflows.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.