Databricks Community

AnandChiddarwar

As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions designed to enhance customer and employee experiences, Freshworks depends on real-time data to power decision making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA—while ensuring tenant-level data isolation in a multi-tenant setup.

Legacy Architecture and the Case for Change

Freshworks’ legacy pipeline was built with Python consumers, where user actions triggered events sent in real time from products to Kafka, and the Python consumers transformed and routed those events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded the batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well suited for early growth, but it soon hit limits as event volume surged.

Rapid growth exposed core challenges, including:

Scalability: The pipeline struggled to handle millions of messages per minute, especially during spikes, and required frequent manual scaling.
Operational complexity: The multi-stage flow made schema changes and maintenance risky and time-consuming, often resulting in mismatches and failures.
Cost inefficiency: Storage and compute expenses grew quickly, driven by redundant processing and lack of optimization.
Responsiveness: The legacy setup couldn’t meet demands for real-time ingestion or fast, reliable analytics as Freshworks scaled. Prolonged ingestion delays impaired data freshness and impacted customer insights.

As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support business growth and analytics needs.

New Architecture: Real-Time Data Processing with Apache Spark and Delta Lake

The solution? A foundational redesign centered on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.

As Freshworks grew, it became clear that we needed a more powerful, flexible, and optimized data pipeline—and that’s exactly what we set out to build. We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake—all in one job, running entirely within Databricks.

This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time to insight.

Key Components of the New Architecture

The Streaming Component: Spark Structured Streaming

Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark Structured Streaming, optimized for accuracy, scale, and cost efficiency:

Efficient deduplication. Events, identified by UUIDs, are checked against a Delta table of previously processed UUIDs to filter out duplicates between streaming batches.

Data validation. Schema and business rules filter malformed records, ensure required fields are present, and handle nulls.

Custom transformations with JSON-e. The JSON-e engine enables advanced, reusable logic like conditionals, loops, and Python user-defined functions, allowing product teams to define dynamic, reusable logic tailored to each product.

Flattening to tabular form. JSON events are transformed and flattened into thousands of structured tables. An internal schema management tool, which handles over 20,000 tables and 5 million columns, enables product teams to manage table schemas. This tool also automates the promotion of schema changes to production by registering them in Delta Lake; the changes are then seamlessly integrated by Spark Structured Streaming.

Flattened data deduplication. A hash of stored columns is compared against the last 4 hours of processed data in Redis, preventing duplicate ingestion and reducing compute costs.

The Storage Component: Lakehouse

Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:

Parallel writes with multiprocessing. A single Spark job typically writes to ~250 Delta tables, applying varying transformation logic. This is executed using Python multiprocessing, which performs Delta merges in parallel, maximizing cluster utilization and reducing latency.
Efficient updates with deletion vectors. Up to 35% of records per batch are updates or deletes. Instead of rewriting large files, we leverage deletion vectors to enable soft deletes. This improves update performance by 3x, making real-time updates practical even at terabyte scale.
Accelerated merges with disk caching. Disk caching ensures that frequently accessed (hot) data remains in memory. By caching only the columns needed for merges, we achieve up to 4x faster merge operations while reducing I/O and compute costs. Today, 95% of merge reads are served directly from the cache.

Autoscaling and Adapting in Real Time

To optimize costs while meeting business SLAs, the pipeline incorporates customized autoscaling that dynamically adjusts system capacity, scaling up or down to efficiently handle workload volume without sacrificing performance.

Autoscaling is driven by batch lag and execution time, monitored in real time. The required resizing is triggered through Jobs APIs using the OnProgress callback of Spark’s QueryListener after each batch, ensuring in-flight processing isn’t disrupted. This keeps the system responsive, resilient, and efficient.

Built-in Resilience: Handling Failures Gracefully

To maintain data integrity and availability, the architecture includes robust fault tolerance:

Events that fail transformation are retried via Kafka with backoff logic.
Permanently failed records are stored in a Delta table for offline review and reprocessing, ensuring no data is lost.

This design guarantees data integrity without human intervention, even during peak loads or schema changes, and the ability to republish the failed events later.

Observability and Monitoring at Every Step

Complementing the observability provided within Databricks, our enterprise monitoring stack of Prometheus, Grafana, and Elasticsearch seamlessly integrates with Databricks, giving end-to-end visibility:

Every batch in Databricks logs key metrics, such as input record count, transformed records, and error rates, to provide real-time alerts to the support team.
Event statuses are logged to enable fine-grained debugging, allowing both product teams (the producers) and analytics teams (the consumers) to trace issues.

.

Transformation health can be tracked using the above metrics to identify issues and trigger alerts for quick investigations.

From Complexity to Confidence

Perhaps the most transformative shift has been in simplicity.

What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We’ve eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy. Fewer moving parts means fewer surprises and more confidence.

By reimagining the data stack around Spark Structured Streaming and Delta Lake, we’ve built a system that not only meets today’s scale but is ready for tomorrow’s growth.

Why Databricks?

As we reimagined our data architecture, we evaluated several technologies, including Amazon EMR with Spark, Apache Flink, and Databricks. After rigorous benchmarking, Databricks emerged as the clear choice, offering a unique blend of performance, simplicity, and ecosystem alignment that we were confident would continue to meet the evolving needs of Freshworks.

A Unified Ecosystem for Data Processing

Rather than stitching together multiple tools, Databricks offers an end-to-end platform that spans job orchestration, data governance, and CI/CD integration, reducing complexity and accelerating development:

Unity Catalog acts as the single source of truth for data governance. With granular access control, lineage tracking, and centralized schema management, it ensures our teams are:

Able to secure all data assets, organize data access for each tenant, and preserve strict access boundaries
Compliant with regulatory needs, with all events and actions being audited in the audit tables, including information on who has access to which assets

Databricks jobs have inherent orchestration, ending our reliance on external orchestrators like Airflow. Native scheduling and pipeline execution reduce operational friction and improve reliability.
CI/CD and REST APIs helped Freshworks’ teams to automate everything, from job creation and cluster scaling to schema updates. This automation has accelerated releases, improved consistency, and minimized manual errors, allowing us to experiment fast and learn fast.

Optimized Spark Platform

Key capabilities like automated resource allocation, unified batch and streaming architecture, executor fault recovery, and dynamic scaling to process millions of records allow us to maintain consistent throughput, even during traffic spikes or infrastructure hiccups.

High-Performance Caching

Databricks’ disk caching has proven to be the key factor in meeting the required data latency, as most merges are served from hot data stored in the disk cache.

Its capability to automatically detect changes in underlying data files and keep the cache updated ensures that the batch processing intervals consistently meet the SLA.

Delta Lake: The Foundation for Real-Time and Reliable Ingestion

Delta Lake plays a critical role in the pipeline, enabling low-latency, ACID-compliant, high-integrity data processing at scale.

Delta Lake Feature	SaaS Pipeline Benefit
ACID Transactions	Freshworks leverages Delta Lake’s ACID compliance to guarantee data consistency during high-frequency streaming from multiple sources and concurrent write operations.
Schema Evolution	The schemas of our products are constantly evolving. Delta Lake’s schema evolution capability adapts to the changing requirements, with the new schemas seamlessly applied to Delta tables and automatically picked up by Spark Structured Streaming applications.
Time Travel	With millions of transactions, the ability to go back to a snapshot of the data in Delta Lake supports auditing and rollback to point in time needs.
Scalable Change Handling & Deletion Vectors	Delta Lake supports and enables efficient insert/update/delete operations through transaction logs without rewriting large data files. This proved crucial in reducing ingestion latencies from hours to a few minutes in our pipelines.
Open Format	Freshworks being a multi-tenant SaaS system, the open Delta format is critical. It ensures interoperability with a wide range of analytics tools on top of the Lakehouse, supporting multi-tenant read operations.

By combining the speed of Apache Spark on Databricks, Delta Lake’s reliability, and Databricks’ integrated platform, we built a scalable, robust, and cost-effective future-ready foundation for Freshworks’ real-time analytics.

What We Learned: Key Insights

No transformation is without its challenges. Along the way, we encountered a few issues and surprises that taught us valuable lessons:

State store overhead—high memory footprint and stability issues. Using Spark’s dropDuplicatesWithinWatermark caused high memory use and instability, especially during autoscaling, and led to increased S3 list costs due to many small files.

Fix: Switching to Delta-based caching for deduplication drastically improved memory efficiency and stability. The overall S3 list cost and memory footprint were vastly reduced, helping to lower the time and cost of data deduplication.

Liquid clustering—common challenges. All the queries had a primary predicate with several other secondary predicates. Clustering on multiple columns resulted in sparse data distributions and increased data scans, reducing query performance.

Fix: Clustering on a single primary column led to better file organization and significantly faster queries by optimizing data scans.

Garbage collection (GC) issues—job restarts needed. Long-running jobs (7+ days) started experiencing performance slowness and more frequent garbage collection cycles.

Fix: We had to introduce weekly job restarts to mitigate prolonged GC cycles and performance degradation.

Data skew—handling Kafka topic imbalance. Data skew was observed, as different Kafka topics had varying data volumes. This led to uneven data distribution across processing nodes, causing skewed task workloads and nonuniform resource utilization.

Fix: Repartitioning before transformations ensured a balanced data distribution, evening out the data processing load and improving throughput.

Conditional merge—optimizing merge performance. Even if only a few columns were needed, the merge operations were loading all columns from the target table, which led to high merge times and I/O costs.

Fix: We implemented an anti-join before merges and early discarding of late-arriving or irrelevant records, significantly speeding up merges by preventing unnecessary data from being loaded.

Conclusion

By using Databricks and Delta Lake, Freshworks has redefined its data architecture, moving from fragmented, manual workflows to a modern, unified, real-time platform.

The impact?

4x improvement in data sync time during traffic surges
~25% cost saving due to scalable, cost-efficient operations with zero downtime
50% reduction in maintenance effort
High availability and SLA-compliant performance, even during peak loads
Improved customer experience via real-time insights

This transformation empowers every customer of Freshworks—from IT to support—to make faster, data-driven decisions without worrying about the data volume required for their business needs being processed and delivered.