As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions designed to enhance customer and employee experiences, Freshworks depends on real-time data to power decision making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA—while ensuring tenant-level data isolation in a multi-tenant setup.
Freshworks’ legacy pipeline was built with Python consumers, where user actions triggered events sent in real time from products to Kafka, and the Python consumers transformed and routed those events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded the batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well suited for early growth, but it soon hit limits as event volume surged.
Rapid growth exposed core challenges, including:
As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support business growth and analytics needs.
The solution? A foundational redesign centered on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.
As Freshworks grew, it became clear that we needed a more powerful, flexible, and optimized data pipeline—and that’s exactly what we set out to build. We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake—all in one job, running entirely within Databricks.
This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time to insight.
Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark Structured Streaming, optimized for accuracy, scale, and cost efficiency:
Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:
To optimize costs while meeting business SLAs, the pipeline incorporates customized autoscaling that dynamically adjusts system capacity, scaling up or down to efficiently handle workload volume without sacrificing performance.
Autoscaling is driven by batch lag and execution time, monitored in real time. The required resizing is triggered through Jobs APIs using the OnProgress callback of Spark’s QueryListener after each batch, ensuring in-flight processing isn’t disrupted. This keeps the system responsive, resilient, and efficient.
To maintain data integrity and availability, the architecture includes robust fault tolerance:
This design guarantees data integrity without human intervention, even during peak loads or schema changes, and the ability to republish the failed events later.
Complementing the observability provided within Databricks, our enterprise monitoring stack of Prometheus, Grafana, and Elasticsearch seamlessly integrates with Databricks, giving end-to-end visibility:
.
Transformation health can be tracked using the above metrics to identify issues and trigger alerts for quick investigations.
Perhaps the most transformative shift has been in simplicity.
What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We’ve eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy. Fewer moving parts means fewer surprises and more confidence.
By reimagining the data stack around Spark Structured Streaming and Delta Lake, we’ve built a system that not only meets today’s scale but is ready for tomorrow’s growth.
As we reimagined our data architecture, we evaluated several technologies, including Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged as the clear choice, offering a unique blend of performance, simplicity, and ecosystem alignment that we were confident would continue to meet the evolving needs of Freshworks.
Rather than stitching together multiple tools, Databricks offers an end-to-end platform that spans job orchestration, data governance, and CI/CD integration, reducing complexity and accelerating development:
Key capabilities like automated resource allocation, unified batch and streaming architecture, executor fault recovery, and dynamic scaling to process millions of records allow us to maintain consistent throughput, even during traffic spikes or infrastructure hiccups.
Databricks’ disk caching has proven to be the key factor in meeting the required data latency, as most merges are served from hot data stored in the disk cache.
Its capability to automatically detect changes in underlying data files and keep the cache updated ensures that the batch processing intervals consistently meet the SLA.
Delta Lake plays a critical role in the pipeline, enabling low-latency, ACID-compliant, high-integrity data processing at scale.
|
Delta Lake Feature |
SaaS Pipeline Benefit |
|
ACID Transactions |
Freshworks leverages Delta Lake’s ACID compliance to guarantee data consistency during high-frequency streaming from multiple sources and concurrent write operations. |
|
Schema Evolution |
The schemas of our products are constantly evolving. Delta Lake’s schema evolution capability adapts to the changing requirements, with the new schemas seamlessly applied to Delta tables and automatically picked up by Spark Structured Streaming applications. |
|
Time Travel |
With millions of transactions, the ability to go back to a snapshot of the data in Delta Lake supports auditing and rollback to point in time needs. |
|
Scalable Change Handling & Deletion Vectors |
Delta Lake supports and enables efficient insert/update/delete operations through transaction logs without rewriting large data files. This proved crucial in reducing ingestion latencies from hours to a few minutes in our pipelines. |
|
Open Format |
Freshworks being a multi-tenant SaaS system, the open Delta format is critical. It ensures interoperability with a wide range of analytics tools on top of the Lakehouse, supporting multi-tenant read operations. |
By combining the speed of Apache Spark on Databricks, Delta Lake’s reliability, and Databricks’ integrated platform, we built a scalable, robust, and cost-effective future-ready foundation for Freshworks’ real-time analytics.
No transformation is without its challenges. Along the way, we encountered a few issues and surprises that taught us valuable lessons:
Fix: Switching to Delta-based caching for deduplication drastically improved memory efficiency and stability. The overall S3 list cost and memory footprint were vastly reduced, helping to lower the time and cost of data deduplication.
Fix: Clustering on a single primary column led to better file organization and significantly faster queries by optimizing data scans.
Fix: We had to introduce weekly job restarts to mitigate prolonged GC cycles and performance degradation.
Fix: Repartitioning before transformations ensured a balanced data distribution, evening out the data processing load and improving throughput.
Fix: We implemented an anti-join before merges and early discarding of late-arriving or irrelevant records, significantly speeding up merges by preventing unnecessary data from being loaded.
By using Databricks and Delta Lake, Freshworks has redefined its data architecture, moving from fragmented, manual workflows to a modern, unified, real-time platform.
The impact?
This transformation empowers every customer of Freshworks—from IT to support—to make faster, data-driven decisions without worrying about the data volume required for their business needs being processed and delivered.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.