Databricks Community

osamam · ‎10-10-2024

Introduction

In today's data-driven world, organisations aconstantly seekways to optimise their data integration and serving patterns. This blog post delves into the powerful combination of Azure Cosmos DB and Azure Databricks, exploring advanced integration techniques that can revolutionise data synchronisation, processing, and analytics workflows. By leveraging the strengths of these two robust Azure services, businesses can create scalable, high-performance data solutions capable of handling diverse data types and workloads, from transactional processing to advanced analytics and AI. This technical deep dive will guide you through the intricacies of integrating Azure Cosmos DB and Azure Databricks, providing valuable insights into efficient data management strategies that can drive your organisation's data intelligence to new heights.

Below is a high-level description of both components:

Azure Databricks

Azure Databricks is the Data Intelligence Platform that provides a collaborative, scalable environment for data engineering, data science, and business analytics. Built on Apache Spark™, it offers:

High-performance computing for big data processing
Interactive notebooks for data exploration and visualisation
Seamless integration with other Azure services
Built-in support for machine learning and AI workloads

Azure Cosmos DB

Azure Cosmos DB is a fully managed, globally distributed NoSQL database service designed for high availability, elastic scalability, and low latency performance. Key features include:

Multi-model support (document, key-value, graph, and column-family data)
Global distribution with multi-region writes
Guaranteed single-digit millisecond response times
Flexible consistency models

Integration Benefits

When used together, Azure Databricks and Cosmos DB create a powerful integration for modern data architectures:

Compliance requirements, such as Consume includeght (CDR) legislation in Australia or Open Banking. Some industry verticals, such as FSI, Energy and Telecoms have mandates to, have mandates to make data available to a customer or external 3rd party brokers on specific consent from the customer.
Seamless data pipeline: Easily ingest, process, and analyse data from Cosmos DB using Databricks’ Spark-based analytics engine.
Real-time analytics: Combine Cosmos DB's low-latency data access with Databricks’ streaming capabilities for real-time insights.
Advanced querying: Leverage Databricks’ SQL capabilities to perform complex queries on Cosmos DB data.
Unified data lake and warehouse: Implement a Lakehouse architecture by combining Cosmos DB's NoSQL capabilities with Databricks’ Delta Lake technology.

This integration allows organisations to build scalable, high-performance data solutions that handle diverse data types and workloads, from transactional processing to advanced analytics and AI.

Reference Architecture

The reference architecture below explains the different components and the integration pattern. In this blog post, we’ll focus on the integration between Azure Databricks and Cosmos DB.

Azure databricks integration with CosmosDB (1).png

Cosmos DB Spark Connector

The Azure Cosmos DB Spark Connector is the foundation of Cosmos DB and Azure Databricks integration. This connector enables seamless data transfer and processing between the two services.

Setup and Configuration

Install the new Cosmos DB connector for Spark 3.x.

From the compute tab, go to your Databricks cluster and click on Libraries and then Install new.

Screenshot 2024-09-27 at 9.47.40 AM.png

Select Maven and click on Search Packages:

Screenshot 2024-09-27 at 9.49.32 AM.png

Select Maven Central and start typing com.azure.cosmos.spark

Screenshot 2024-09-27 at 9.50.37 AM.png

Based on the version of Spark you’re using, select the right library, and click on Select.
Configure the Spark connector in your Databricks Notebook:

readCfg = {

  "spark.cosmos.accountEndpoint": cosmosEndpoint,

  "spark.cosmos.accountKey": cosmosMasterKey,

  "spark.cosmos.database": cosmosDatabaseName,

  "spark.cosmos.container": cosmosContainerName,

  "spark.cosmos.read.inferSchema.enabled"/:: "true",

  "spark.cosmos.write.strategy": "ItemOverwrite"

}

Integration Patterns

1. Batch Read and Write

This pattern is suitable for periodic data synchronisation or large-scale data processing.

#Read the data into a Spark dataframe and print the count
cosmos_df = spark.read.format("cosmos.oltp").options(**readCfg).load()

# Process data
processed_df = cosmos_df.filter(cosmos_df.age > 30).select("id", "name", "age")

# Write back to Cosmos DB
processed_df.write.format("cosmos.oltp").options(**readCfg).mode("append").save()

Streaming with Change Feed

Leverage Cosmos DB's change feed for real-time data processing in Databricks.

# Configure change feed options
changeFeedConfig = {
  "spark.cosmos.changeFeed.startFrom": "Beginning",
  "spark.cosmos.changeFeed.mode": "Incremental",
  "spark.cosmos.changeFeed.checkpointLocation": "/tmp/checkpoint",
  "spark.cosmos.accountEndpoint": cosmosEndpoint,
  "spark.cosmos.accountKey": cosmosMasterKey,
  "spark.cosmos.database": cosmosDatabaseName,
  "spark.cosmos.container": cosmosContainerName,
  "spark.cosmos.read.partitioning.strategy": "Default",
  "spark.cosmos.read.inferSchema.enabled" : "true",
}

# Read from change feed
change_feed_df = spark.readStream.format("cosmos.oltp.changeFeed").options(**changeFeedConfig).load()

# Process streaming data

streaming_output = (
  change_feed_df
  .writeStream
  .trigger(once=True)
  .format('delta')
  .outputMode('append')
  .option("checkpointLocation", "/cosmos/checkpoint_changefeed")
  .table('cosmos_changefeed')
)

Bulk Insert and Upsert

For high-throughput data ingestion from Databricks to Cosmos DB:

bulk_data = spark.createDataFrame([
    ("1", "John Doe", 30),
    ("2", "Jane Smith", 28)
], ["id", "name", "age"])

bulk_data.write.format("cosmos.oltp").mode("append").options(**readCfg).save()

Analytics and Aggregations

Perform complex analytics on Cosmos DB data using Databricks:

cosmos_df = spark.read.format("cosmos.oltp").load()
analytics_result = cosmos_df.groupBy("category").agg({"price": "avg", "quantity": "sum"})

Screenshot 2024-09-27 at 10.03.22 AM.png

Advanced Considerations

Optimising RU Consumption:

Use the spark.cosmos.read.inferSchema.enabled option to reduce RU consumption during schema inference.
Implement retry logic for rate-limited requests.

Partition Key Optimisation:

Align Spark partitions with Cosmos DB physical partitions for better performance.
Use the spark.cosmos.read.partitioning.strategy option to control data distribution.

Security and Networking:

Implement Azure Databricks in a virtual network with a Service Endpoint enabled for Azure Cosmos DB to enhance security.

Change Feed Checkpointing:

Use ADLS Gen2 for reliable checkpointing in production scenarios.

Schema Handling:

Use spark.cosmos.read.schema option to define a custom schema for better control and performance.

Conclusion

The integration between Azure Cosmos DB and Azure Databricks offers powerful capabilities for building scalable, real-time data processing and analytics solutions. By leveraging the Cosmos DB Spark connector and implementing these advanced integration patterns, data engineers can create efficient data pipelines that combine the strengths of both platforms.

Consider data volume, latency requirements, and cost optimisation when implementing these patterns. Proper use of these integration techniques can significantly enhance the performance and scalability of your data architecture, enabling you to build sophisticated analytics and data processing workflows that span both Azure Cosmos DB and Azure Databricks.