cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are the prerequisites for connecting Confluent Kafka with Databricks?

shan-databricks
New Contributor III

Please provide the prerequisites for connecting Confluent Kafka with Databricks, the different connection options, their respective advantages and disadvantages, and the best option for the deliverable.

Thanks

Shanmugam

 

1 ACCEPTED SOLUTION

Accepted Solutions

lingareddy_Alva
Honored Contributor III

Hi @shan-databricks 

Connecting Confluent Kafka with Databricks creates a powerful "data in motion" to "data at rest" architecture.
Below are the prerequisites, connection methods, and strategic recommendations for your deliverable.

1. Prerequisites
Before establishing a connection, ensure the following are in place:
Confluent Cloud/Platform Details: * Bootstrap Server: The URL of your Kafka brokers (e.g., pkc-xxxx.us-east-1.aws.confluent.cloud:9092).
API Keys: A cluster-level API Key and Secret for authentication.
Schema Registry (Optional): If using Avro/Protobuf, you need the Schema Registry URL and its specific API Key/Secret.

Databricks Workspace: * Network Connectivity: Ensure your Databricks cluster has egress access to Confluent. For production, VNet Injection or Private Link is recommended to avoid routing traffic over the public internet.
Libraries: Install the spark-sql-kafka connector (typically built into Databricks Runtime) and confluent-kafka (for Python-based schema handling).
Secrets Management: Store your API Secrets in Databricks Secret Scopes rather than hardcoding them in notebooks.

2. Connection Options
There are three primary ways to bridge these platforms, each suited for different use cases.

Option A: Spark Structured Streaming (Native Integration)
This is the most common "Pull" method where Databricks acts as the consumer.
Pros: * Granular Control: Complete control over transformations (PySpark/SQL) within Databricks.
Exactly-Once Semantics: Built-in fault tolerance using Spark Checkpointing.
Unified Batch/Streaming: Use the same code for real-time streams and historical batch processing.
Cons: * Compute Costs: Requires a running Databricks cluster (Always-on or Job cluster).
Management: You are responsible for managing the Spark code and scaling logic.

Option B: Confluent Delta Lake Sink Connector
A "Push" method where Confluent managed-service writes directly to your cloud storage (S3/ADLS).
Pros: * No-Code Ingestion: Managed by Confluent; no Spark code is required for the initial landing.
Offloads Compute: Does not consume Databricks cluster resources during ingestion.
Simplicity: Best for simple "mirroring" of Kafka topics to Delta tables.
Cons: * Latency: Often involves landing data in object storage first before Databricks picks it up (Metadata discovery overhead).
Limited Transformation: Only supports basic Single Message Transforms (SMTs).

Option C: Confluent Tableflow (The New "Zero-Copy" Way)
Confluentโ€™s newest feature that materializes Kafka topics as Delta Lake or Iceberg tables automatically.
Pros: * Lowest Overhead: Data is stored once but accessible by both platforms.
Performance: Eliminates the need for custom ETL/Connectors.
Cons: * Maturity: Newer feature with specific region and cloud provider availability.

My Recommendation: For a robust enterprise deliverable, start with Structured Streaming as it demonstrates the highest technical proficiency with your Databricks/Spark skill set and provides the most flexibility for future business requirements.

LR

View solution in original post

2 REPLIES 2

lingareddy_Alva
Honored Contributor III

Hi @shan-databricks 

Connecting Confluent Kafka with Databricks creates a powerful "data in motion" to "data at rest" architecture.
Below are the prerequisites, connection methods, and strategic recommendations for your deliverable.

1. Prerequisites
Before establishing a connection, ensure the following are in place:
Confluent Cloud/Platform Details: * Bootstrap Server: The URL of your Kafka brokers (e.g., pkc-xxxx.us-east-1.aws.confluent.cloud:9092).
API Keys: A cluster-level API Key and Secret for authentication.
Schema Registry (Optional): If using Avro/Protobuf, you need the Schema Registry URL and its specific API Key/Secret.

Databricks Workspace: * Network Connectivity: Ensure your Databricks cluster has egress access to Confluent. For production, VNet Injection or Private Link is recommended to avoid routing traffic over the public internet.
Libraries: Install the spark-sql-kafka connector (typically built into Databricks Runtime) and confluent-kafka (for Python-based schema handling).
Secrets Management: Store your API Secrets in Databricks Secret Scopes rather than hardcoding them in notebooks.

2. Connection Options
There are three primary ways to bridge these platforms, each suited for different use cases.

Option A: Spark Structured Streaming (Native Integration)
This is the most common "Pull" method where Databricks acts as the consumer.
Pros: * Granular Control: Complete control over transformations (PySpark/SQL) within Databricks.
Exactly-Once Semantics: Built-in fault tolerance using Spark Checkpointing.
Unified Batch/Streaming: Use the same code for real-time streams and historical batch processing.
Cons: * Compute Costs: Requires a running Databricks cluster (Always-on or Job cluster).
Management: You are responsible for managing the Spark code and scaling logic.

Option B: Confluent Delta Lake Sink Connector
A "Push" method where Confluent managed-service writes directly to your cloud storage (S3/ADLS).
Pros: * No-Code Ingestion: Managed by Confluent; no Spark code is required for the initial landing.
Offloads Compute: Does not consume Databricks cluster resources during ingestion.
Simplicity: Best for simple "mirroring" of Kafka topics to Delta tables.
Cons: * Latency: Often involves landing data in object storage first before Databricks picks it up (Metadata discovery overhead).
Limited Transformation: Only supports basic Single Message Transforms (SMTs).

Option C: Confluent Tableflow (The New "Zero-Copy" Way)
Confluentโ€™s newest feature that materializes Kafka topics as Delta Lake or Iceberg tables automatically.
Pros: * Lowest Overhead: Data is stored once but accessible by both platforms.
Performance: Eliminates the need for custom ETL/Connectors.
Cons: * Maturity: Newer feature with specific region and cloud provider availability.

My Recommendation: For a robust enterprise deliverable, start with Structured Streaming as it demonstrates the highest technical proficiency with your Databricks/Spark skill set and provides the most flexibility for future business requirements.

LR

Thank you for your response. I will try the integration and options and will reach out if I need further assistance.
 
Shanmugam