Databricks Community

sanutopia · ‎06-05-2025

Hi Friends,

My customer is using Databricks (as GCP partner product). The ask is to ingest data from sources into Databricks Lakehouse. Currently customer has 3 types of sources : SAP (ECC, Hana) , Oracle and Kafka Stream

What are the Databricks native ETL service available to ingest data from all these 3 sources into Lakehouse ?

Is using GCP native ETL service (like Data Fusion or Dataflow) can be good option over Databricks native service ?

Kindly reply.

Thanks

Santanu

mark_ott · ‎10-29-2025

Databricks on GCP offers several native ETL services and integration options to ingest data from SAP (ECC, HANA), Oracle, and Kafka Streams into the Lakehouse. Comparing Databricks-native solutions with GCP-native ETL like Data Fusion or Dataflow reveals tradeoffs in integration, scalability, and workflow preference.

Databricks Native ETL Services

Databricks provides dedicated connectors, pipeline frameworks, and automation tools for ingesting data from major enterprise systems:

SAP (ECC, HANA) Integration: Databricks introduced SAP connectors and integration flows, allowing the Lakehouse platform to ingest SAP data via JDBC or OData, often using Databricks Notebooks or tools like Lakeflow Declarative Pipelines for streamlined ETL.
Oracle Integration: You can pull data from Oracle databases using Databricks JDBC connectors, Databricks notebooks, or Lakeflow Data Pipelines. Delta Live Tables (DLT) further automates CDC and batch ingestion from Oracle into Lakehouse.
Kafka Stream Integration: Databricks supports batch and streaming ingestion from Apache Kafka using built-in connectors (like read_kafka), Structured Streaming APIs, and SQL (for Databricks SQL). These allow seamless ingestion directly into Delta Lake tables.
Lakeflow Declarative Pipelines: Databricks Lakeflow is an orchestration and data pipeline service enabling low-code ETL from multiple data sources (including Oracle, SAP, Kafka).
Delta Live Tables (DLT): Provides declarative pipeline management and automation for ETL, including support for incremental/streaming sources.
Databricks Notebooks & Auto Loader: SQL/Python/Scala notebooks can leverage Auto Loader for scalable file ingestion, as well as custom code for JDBC or OData connectors.

GCP Native ETL Services vs. Databricks ETL

GCP’s core ETL products, Data Fusion and Dataflow, offer strong integration with GCP-native Lakehouse architectures (BigQuery, Dataplex), automated schema management, and robust scalability:

Data Fusion: A fully-managed, drag-and-drop ETL tool, supporting connections for enterprise databases and streaming sources, easily loading data to BigQuery, Data Lake, or Databricks tables.
Dataflow: Built for scalable batch and stream processing pipelines (based on Apache Beam), native integration with GCP storage, analytics, and real-time data applications.
Data Fusion/Dataflow excel in mixed-cloud, heavy GCP-centric architectures—especially when users prefer BigQuery for analytics or strong GCP governance features.
Databricks ETL is highly optimized for Spark-native workflows, advanced AI/ML with unified data management via Delta Lake, and supports low-code/no-code pipelines in Lakehouse environments, minimizing data movement and latency when working directly on Databricks.

Summary Table

Source	Databricks Native ETL	GCP Native ETL Options
SAP	Lakeflow, DLT, JDBC/OData Notebook	Data Fusion (SAP connectors)
Oracle	Lakeflow, DLT, JDBC, Notebook	Data Fusion, Dataflow (JDBC/CDC)
Kafka	Structured Streaming, read_kafka	Dataflow (Beam KafkaIO), Data Fusion

Key Considerations

Databricks ETL is preferred for Spark-native, advanced ML/AI, and unified Lakehouse operations.
GCP ETL services are robust for multi-cloud, GCP-centric, and BigQuery-focused analytics.
Both Databricks and GCP ETL offer connectors for major enterprise sources, but workflow, governance, and cost models differ substantially.

For unified Lakehouse architectures directly on Databricks, native Databricks ETL services (Lakeflow, DLT, and built-in connectors) are usually optimal unless there is a strong requirement to integrate with non-Databricks GCP analytics or governance tools.