Hi @Pratikmsbsvm,
Happy to help with this one. SAP S/4HANA to Databricks is one of the most common enterprise data migration scenarios, and there are several well-proven approaches depending on your requirements for data freshness, volume, and budget. Let me walk through your options.
OPTION 1: PARTNER INGESTION TOOLS (RECOMMENDED FOR MOST SAP MIGRATIONS)
This is the most common and production-proven approach. Databricks has validated integrations with several ingestion partners that have purpose-built SAP connectors. These handle the complexity of SAP extraction protocols (ODP, BAPI, RFC, CDS Views) so you do not have to.
Fivetran
- Databricks-validated partner, available directly through Partner Connect in your workspace
- Has dedicated SAP connectors that support S/4HANA, ECC, and BW
- Supports SAP ODP (Operational Data Provisioning) for delta/incremental extraction
- Can extract from SAP tables, CDS Views, and ODP sources
- Writes directly to Delta Lake in your Databricks Lakehouse
- Setup: Workspace > Partner Connect > Fivetran, or configure manually
- Docs: https://docs.databricks.com/partners/ingestion/fivetran
Informatica Cloud Data Integration
- Also a Databricks-validated ingestion partner with Unity Catalog support
- Informatica has deep SAP expertise with connectors for S/4HANA, ECC, BW, and HANA
- Supports BAPI, RFC, IDoc, and ODP extraction methods
- Mass ingestion capabilities for large-scale SAP migrations
- Docs: https://docs.databricks.com/integrations/
Other partners to evaluate:
- Qlik (formerly Attunity) -- strong SAP CDC (Change Data Capture) capabilities
- Precisely (Connect) -- SAP-certified, supports real-time replication from S/4HANA
- Theobald Software (Xtract Universal) -- lightweight, SAP-certified extraction tool popular in the Databricks community
- SNP Glue -- purpose-built for SAP data extraction to cloud data platforms
OPTION 2: LAKEFLOW CONNECT (MANAGED CONNECTORS)
Databricks Lakeflow Connect provides fully managed, native connectors for popular data sources. As of today, the managed connector list includes Salesforce, ServiceNow, Workday, SQL Server, MySQL, PostgreSQL, Google Analytics, and several others.
SAP is NOT currently available as a Lakeflow Connect managed connector. However, the Lakeflow Connect connector catalog continues to expand, so keep an eye on the release notes for future SAP support.
Current managed connectors: https://docs.databricks.com/ingestion/lakeflow-connect/
OPTION 3: JDBC CONNECTION TO SAP HANA DATABASE
If your S/4HANA system runs on SAP HANA (which is the standard for S/4HANA), you can connect directly to the underlying HANA database using JDBC from Databricks.
How it works:
1. Download the SAP HANA JDBC driver (ngdbc.jar) from SAP
2. Upload the driver JAR to your Databricks cluster or Unity Catalog volumes
3. Use spark.read.format("jdbc") to query HANA tables directly
Example code:
df = (spark.read
.format("jdbc")
.option("url", "jdbc:sap://your-hana-host:30015")
.option("dbtable", "your_schema.your_table")
.option("user", "your_user")
.option("password", "your_password")
.option("driver", "com.sap.db.jdbc.Driver")
.load())
df.write.format("delta").saveAsTable("catalog.schema.sap_table")
Important caveats with the JDBC approach:
- This reads directly from HANA, which puts load on your SAP production system
- No built-in CDC/delta extraction -- you need to build your own incremental logic
- You are reading raw database tables, not SAP business objects (no BAPI/RFC)
- Consider pointing this at a read replica or SAP HANA sidecar to avoid production impact
- Best for smaller tables or initial bulk loads, not ongoing replication
Docs: https://docs.databricks.com/integrations/jdbc-oss/
OPTION 4: SAP DATA INTELLIGENCE / SAP BTP INTEGRATION
If your organization already uses SAP Data Intelligence (now part of SAP Datasphere) or SAP BTP, you can:
- Use SAP Data Intelligence pipelines to extract and push data to cloud storage (S3, ADLS, GCS)
- Then use Databricks Auto Loader to incrementally pick up those files into Delta tables
- This keeps SAP as the extraction engine and Databricks as the processing/analytics platform
Auto Loader docs: https://docs.databricks.com/ingestion/cloud-object-storage/auto-loader/
OPTION 5: CUSTOM EXTRACTION WITH PYRFC OR SAP ODATA
For advanced users who want full control:
PyRFC approach:
- Use the SAP PyRFC library to call SAP RFCs/BAPIs from Python
- Run this in a Databricks notebook to extract business objects natively
- Good for complex extractions where you need SAP business logic
OData approach:
- S/4HANA exposes CDS Views as OData services
- Use Python requests or PySpark to call OData endpoints
- Good for entity-level extraction with built-in filtering
- Works well with S/4HANA Cloud edition
RECOMMENDED ARCHITECTURE FOR SAP DATA MIGRATION
For a production-grade SAP to Databricks migration, I recommend a medallion architecture:
Bronze layer: Raw SAP data landed as-is into Delta tables (from any extraction method above)
Silver layer: Cleansed and conformed SAP data. This is where you handle SAP-specific transformations like:
- Converting SAP date formats (YYYYMMDD strings to proper dates)
- Resolving SAP domain values and text tables (e.g., VBAK + VBAP joins)
- Handling SAP deletion flags and change documents
- Currency and unit conversions using SAP reference tables (TCURR, T006)
Gold layer: Business-ready datasets and aggregations for reporting and ML
Use Lakeflow Spark Declarative Pipelines (SDP) to orchestrate Bronze-to-Silver-to-Gold transformations with built-in data quality expectations.
Docs: https://docs.databricks.com/delta-live-tables/
HUB-AND-SATELLITE (HUB & SPOKE) ARCHITECTURE
For your Hub & Spoke requirement with a central hub for harmonized SAP data and satellites for domain-specific analytics, Unity Catalog is the key enabler in Databricks.
Central Hub (shared catalog):
- Create a shared Unity Catalog catalog (e.g., "sap_harmonized") that contains your cleansed, conformed SAP data models (Silver layer)
- This catalog contains cross-domain reference data: master data (customers, vendors, materials, GL accounts), currency conversion tables, org structure mappings
- Use Lakeflow Spark Declarative Pipelines (SDP) to maintain these harmonized tables with data quality checks
Satellite Domains (domain-specific catalogs):
- Create separate catalogs per business domain (e.g., "finance_analytics", "supply_chain_analytics", "sales_analytics")
- Each domain catalog contains Gold-layer tables with domain-specific aggregations, KPIs, and business logic
- Domain teams own their catalogs and build on top of the shared hub data
How Unity Catalog enables this:
- Cross-catalog queries: Domain teams can JOIN their satellite tables with the central hub tables seamlessly
- Fine-grained access control: Grant each domain team read access to the hub and full access to their own satellite catalog
- Data lineage: Track how data flows from SAP through the hub into each satellite
- Data sharing: Use Delta Sharing if satellites need to share data across organizational boundaries
Example structure:
sap_harmonized (Hub catalog)
raw_sap (Bronze schema) -- raw SAP extracts
conformed_sap (Silver schema) -- cleansed, joined SAP data
master_data (schema) -- shared reference/master data
finance_analytics (Satellite catalog)
gl_reporting (schema) -- GL aggregations, trial balance
ap_ar_aging (schema) -- payables/receivables analytics
supply_chain_analytics (Satellite catalog)
inventory_kpis (schema) -- stock levels, turnover
procurement (schema) -- PO analysis, vendor performance
Docs on Unity Catalog: https://docs.databricks.com/data-governance/unity-catalog/
WHICH OPTION SHOULD YOU CHOOSE?
If you want fastest time-to-value: Use a partner tool like Fivetran or Informatica via Partner Connect. They handle all the SAP extraction complexity.
If you need real-time or near-real-time: Look at Precisely Connect or Qlik for SAP CDC replication to Databricks.
If you have a simple one-time migration: JDBC to SAP HANA may be sufficient for bulk extraction.
If you already have SAP Data Intelligence: Use the SAP-side extraction pipeline with Auto Loader on the Databricks side.
If you need custom SAP business logic: Use PyRFC in Databricks notebooks for RFC/BAPI-based extraction.
Let me know which approach best fits your scenario and I can help with more specific guidance on implementation.
Helpful links:
- Lakeflow Connect overview: https://docs.databricks.com/ingestion/lakeflow-connect/
- Ingestion overview: https://docs.databricks.com/ingestion/overview
- Partner Connect: https://docs.databricks.com/partner-connect/
- Fivetran setup: https://docs.databricks.com/partners/ingestion/fivetran
- JDBC connectivity: https://docs.databricks.com/integrations/jdbc-oss/
- Auto Loader: https://docs.databricks.com/ingestion/cloud-object-storage/auto-loader/
- Lakeflow Spark Declarative Pipelines (SDP): https://docs.databricks.com/delta-live-tables/
- Unity Catalog: https://docs.databricks.com/data-governance/unity-catalog/
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.