Hey everyone,
I’ve been building a data pipeline to bring SAP S/4HANA data into Databricks for real-time analytics and reporting. The setup relies on Delta Lake for storage, and while batch ingestion works fine, I’m running into issues with latency and micro-batch handling once I increase the refresh frequency.
It feels like the more frequently I pull data, the heavier the system gets, especially when handling transactional tables with frequent updates. I’ve experimented with auto-scaling clusters and adjusting checkpoint intervals, but the performance inconsistency remains.
Since this topic overlaps with what’s covered in the C_C4H51_2405 certification exam, I’ve been going through Pass4Future SAP practice resources to brush up on the data integration side, but I’d really value some hands-on advice from engineers who’ve done this at scale.
How do you manage incremental data loads from SAP into Databricks without hurting performance or reliability? Any architecture patterns, caching strategies, or configuration tweaks that made a noticeable difference?
Appreciate any insights or lessons learned.
Britanney