Databricks Community

Kiran_Sreekumar · ‎11-03-2023

Authors: Shasidhar Eranti (@Shasidhar_ES) & Kiran Sreekumar (@Kiran_Sreekumar)

In today's business landscape, data stands as the primary differentiator for companies looking to get a competitive edge with analytics and AI. To keep pace with the demand, organizations must take a modern approach to data engineering which includes a modern approach to data orchestration. As data usage becomes increasingly widespread and sophisticated, the size, volume, and complexity of data pipelines are on the rise. According to a Gartner report, this heightened complexity in data processing pipelines can be attributed to several key factors:

Diverse data sources: These encompass a wide range of data, including real-time streaming data.
Multi-step data flows: Data pipelines often involve intricate multi-stage processing.
Interdependencies: Various steps and Workflows within these pipelines depend on one another.
Varied use cases: Data pipelines serve multiple purposes, including ETL (Extract, Transform, Load), Business Intelligence (BI), Machine Learning (ML), Artificial Intelligence (AI), and real-time applications.
Shared data assets: Different sub-departments or teams within an organization may utilize the same data assets.

Figure 1: Complexities with the modern data orchestration

Given the complexity arising from these factors, the demand for modern data orchestration has never been more pronounced. Modern orchestration is essential to efficiently manage and leverage the full potential of these intricate data pipelines.

What is modern data orchestration?

Modern data orchestration empowers data teams to efficiently create, execute, oversee, and analyze data-driven workflows that transform data into valuable insights. It offers seamless support for different workloads, including batch processing, streaming, data engineering, and ML/AI pipelines, delivering improved performance while reducing costs, all without compromising on reliability, simplicity, and accessibility within your data platform.

What are the challenges with modern data orchestration??

Many companies currently rely on third-party, open-source, or custom in-house solutions for their data orchestration needs. However, these approaches introduce their own set of challenges, which include:

Usability Challenges: These solutions are often complex and challenging for many practitioners to use effectively, leading to siloed data teams and reduced productivity.
Troubleshooting Difficulty: A lack of end-to-end observability makes it hard to identify the root causes of issues, resulting in poor monitoring and difficult troubleshooting.
Complex Maintenance: Managing and maintaining these solutions typically involves complex architectures, leading to higher ownership costs and decreased reliability.
Lack of Cloud-Agnostic Options: The absence of cloud-agnostic orchestrators complicates multi-cloud architecture deployments, adding to the overall complexity.

Additionally, a recent IDC survey revealed that many organizations employ at least 10 different tools for data engineering, highlighting the emerging challenge of managing this complexity.

Databricks Workflows addresses these issues by serving as the Lakehouse Orchestrator. It is specifically designed to simplify and streamline the orchestration of data pipelines in the evolving data processing landscape.

Databricks Workflows

Databricks Workflows is the native orchestration solution built for the Databricks Lakehouse platform. It addresses the issues of the current orchestration practices in the industry.

Figure 2: Databricks Lakehouse platform architecture

The key advantages of Databricks Workflows are:

Simple authoring: It comes out of the box inside every Databricks deployment. Any Databricks user can start building data pipelines and schedule jobs using the Workflows UI, in just a few clicks. Workflows also supports end to end CI/CD for more advanced users using Databricks built-in IDE integration (VS-Code extension).

Figure 3: Authoring Workflows in the Databricks workspace

Actionable insights: As the Databricks Workflows is deeply integrated into the platform, you get the much deeper monitoring and observability capabilities vs. external orchestration tools like Apache Airflow. With Databricks Workflows, users get job metrics and operational metadata for the jobs they execute. Visual monitoring provided out of the box, helps you quickly see the health of your jobs and effectively troubleshoot issues in the event of failures.

Figure 4: Monitoring job run metrics in the Workflows dashboard

Proven reliability: Databricks Workflows is a fully managed orchestration service with a proven reliability of 99.95% uptime and is provided as a part of the Databricks Lakehouse Platform at no additional cost. It has a proven track record, and is already trusted by thousands of customers and running millions of production workloads every day. We are also introducing Serverless Jobs Compute for Workflows which will provide even better orchestration by reducing the overhead of managing compute resources for you.

Use Case: Orchestrating a Fraud Detection Data Pipeline

To demonstrate the power of some of the key features included in Databricks Workflows, let’s look at a fraud detection use case.

Figure 5: Fraud detection pipeline orchestrated by Databricks Workflows

The pipeline above shows a real time fraud detection use case built using the Databricks Lakehouse platform. In this example, a financial institution collects transactional data from multiple source applications and ingests them onto the medallion architecture bronze layer. To facilitate real time fraud detection, a Delta Live Table pipeline is used which ingests data in near real time from different source applications. The data is then transformed and cleansed by applying data quality rules which moves the data across to the Silver and Gold layer. A machine learning model is built using the data and is used for detecting fraudulent transactions on the real time data flowing through the pipeline.

The use case is built as three separate Jobs:

A Delta Live Table pipeline for streaming transactional data ingestion into the lakehouse. This layer is responsible for the data ingestion to the bronze and silver tables.
An analytics and transformation pipeline which implements all the data transformation rules with data quality checks which cleanses the data and makes it ready to be consumed by a machine learning model.
A machine learning pipeline doing model training and serving to detect fraudulent transactions

The Workflow needs to be orchestrated based on the below requirements.

The Delta Live Table pipeline needs to be triggered based on new files arriving in a landing storage bucket from different source applications.
The analytics and transformation Workflow can run in parallel with the DLT Workflow.
The model should only be trained with new datasets if the data quality checks are passed.
Compute clusters should be reused across tasks for efficient processing and resource usage.
There should be an option to pass parameters between tasks.
Ability to orchestrate a dbt transformation.
The assets including notebooks and code base for the use case should be consumed from a code repository.
The Workflow should be deployed via an automated process.
Monitoring with alerts and operational ease.

Lets deep dive on how we can implement the requirements of the use case orchestration using Databricks Workflows.

Continuous stream processing and transformation

Streaming data ingestion is built using Streaming Tables and Delta Live Tables, which is a declarative framework for building reliable, maintainable, and testable data processing pipelines. In a real-time fraud detection use case, data must be ingested as soon as it arrives. To do this, Databricks Workflows utilizes File Arrival Triggers which automatically initiates the ingestion process when new datasets land in the designated storage locations.

Figure 6: Setting up a File Arrival Trigger in Workflows

Databricks Workflows also supports Continuous Jobs to process incoming transactions in near real-time, ensuring that potential fraud is detected promptly.

Any Delta Live Tables pipeline can be orchestrated using Workflows and the compute resources will scale based on the data volume and processing requirements using enhanced autoscaling.

Data transformations, including aggregations are managed using DBT. Databricks Workflows integrates with DBT, and effectively orchestrates the ingestion and transformation pipeline.

Workflows supports all major task types as below:

Notebook
Python Script
Python Wheel
SQL
dbt
DLT pipeline
JAR
Spark Submit
Run Job
If/Else Condition

The pipeline includes data quality checks and machine learning stages. Conditional execution ensures that the machine learning step is executed only if the data quality check passes.

Databricks Workflows allows passing parameters between tasks using the task parameterization feature. Parameters can be passed between all job tasks as well using the job parameterization.

Figure 7: Adding a Job parameter in Workflows

Improved performance and cost-efficiency

Databricks Workflows optimizes resource usage by reusing clusters. When multiple tasks are configured in a job to reuse clusters, the cluster is started only when the first task is run. The same cluster will be used across the tasks eliminating the startup times required for individual clusters to be spun up for each task. This reduces overall time and cost.

Figure 8: Reusing clusters across tasks within a Workflow

Monitoring and Alerting

Job run metrics are captured and visualized in the Workflows user interface making it easier for support teams to monitor the Workflow status.

Figure 9: Job run metrics in the workflows UI

Job queueing ensures jobs are not skipped when concurrency limits are hit for the Databricks workspace.

Duration threshold can be optionally configured for each individual Task and also at the complete Workflow level for warning and timeout to trigger alerts when the corresponding Task or Job run time exceeds a pre-configured time threshold.

Figure 10: Defining job duration thresholds in a Workflow

In the event of a job failure due to data issues or resource constraints, the repair and rerun feature allows for rerunning only the failed tasks eliminating the requirement to rerun the entire pipeline. By using the notification feature, alerts can be sent to the support teams via email, PagerDuty, Slack, Microsoft Teams or Web hooks to keep the team updated on any failures which require manual intervention.

Figure 11: Setting up alerts in a Workflow

Job Modularity

The ingestion, transformation, machine learning model training and evaluation stages are encapsulated as a separate job for ease of maintenance and logical separation. With the Run Job task type feature, the entire fraud detection Workflow can be modularised and be tracked and monitored within a unified Workflow.

Upon detecting potential fraud, the pipeline triggers reporting and alerting tasks. These tasks generate reports for investigators and send alerts for immediate action. The entire reporting process can be orchestrated within the Workflow.

Streamlining deployments using CI/CD and Automation

Pipeline configurations, code, and transformations are version-controlled using the Workflows integration with git with support for the below GIT providers

GitHub
GitHub Enterprise
BitBucket Cloud
BitBucket Server
GitLab
GitLab Enterprise Edition
Azure Devops Services
AWS CodeCommit

Figure 12: Setting up git integration in Workflows

The entire Workflow, including infrastructure provisioning, can be defined as code using Terraform. This enables consistent and reproducible deployments across environments. Databricks Asset Bundles encapsulate the entire pipeline, including notebooks, libraries, and configurations. This simplifies packaging, sharing, and deployment of the fraud detection system. For platforms where external tools like Airflow and ADF are used for orchestration, Workflows integrate well with these tools via APIs to provide a seamless integration.

Conclusion

In summary, Databricks Workflows plays a pivotal role in orchestrating an end-to-end data pipeline for multiple use cases. From ingesting transaction data to real-time processing, model training, reporting, and alerting, every step is seamlessly coordinated. The use of features such as file arrival triggers, conditional execution, and Git integration ensures efficiency, reliability, and scalability. Be sure to continue with the next blog post of the series, in which we will discuss and deep dive into the basics of Workflows.

Databricks Community

Why orchestration is your key to success for modern data pipelines.

What is modern data orchestration?

What are the challenges with modern data orchestration??

Databricks Workflows

Use Case: Orchestrating a Fraud Detection Data Pipeline

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks