cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
IvoEverts
Databricks Employee
Databricks Employee

Time series for OT/IT convergence in energy and manufacturing
Comparing Spark, Tempo and ADX

IvoEverts_0-1738148199309.png

 

Introduction

Time series data is instrumental to the convergence of Operational Technology (OT) and Information Technology (IT) because it can represent a detailed, chronological view of physical and digital processes in industrial environments. OT systems, which monitor and control physical assets like machines and sensors, generate vast amounts of time-stamped data reflecting operational performance, anomalies, and environmental conditions. Integrating this data with IT systems, which handle data analytics, business intelligence, and decision-making, enables organizations to achieve real-time insights and predictive capabilities. Time series data is therefore a key driver of Industry 4.0, enabling manufacturers to create digital twins of entire production lines and implement predictive maintenance strategies. By bridging the gap between physical operations and digital systems, OT/IT convergence facilitates the transformation towards smart assets, where real-time data exchange enhances operations and decision-making.

In the following, we present different options for time series manipulation, and make a comparison in terms of performance and cost. Code can be found at https://github.com/ivoeverts/ts-benchmark.

Scenario

IvoEverts_1-1738148199237.png

With most of IT running in the cloud these days, we consider the scenario in which sensor measurements are already ingested e.g. by sending over an event bus to a cloud platform for monitoring and analytics. 

In this scenario it is very typical for values to be missing and/or lagging, hence any pipeline contains steps for interpolation and resampling. With interpolation, missing values can be inferred from adjacent measurements so as to ensure a fixed frequency. Resampling ensures a fixed time interval over which aggregated values are observed for meaningful analyses. Finally, time series are usually presented and analysed in some aggregated form by considering a rolling average over some period.

IvoEverts_2-1738148199415.png

 


Note
We do not advise here to always persist time series data in a bronze-silver-gold scheme as best practices may differ per use case.


Tech stack

A common platform choice in the industry is Azure, with its popular Event Hub services for streaming data ingestion and Azure Data Explorer (ADX) for subsequent time series processing. As an alternative to a managed service such as ADX, there are open source implementations available in Spark and its ecosystem, from which we consider the following:

  • Vanilla Spark with custom implementations for interpolation, resampling, and aggregation using the PySpark API.
  • Tempo (https://databrickslabs.github.io/tempo/), a Databricks Labs project, simplifying time series manipulation in Spark.

Each of these offer an approach to handling time series data, but they vary in computational complexity, cost, and optimal use cases.

Results

For this benchmark, we consider time series data of the form:

<TagID, Timestamp, Value1, Value2, Value3, Value4, Value5>

where the values can represent temperature in+out, pressure in+out, and humidity, for example.

We generate such a datapoint every second, over a period of 5 minutes, with a probability of 95% (i.e. 5% of data is ‘missing’), for a total of 50k unique tags. Every reported result is the average over 3 runs, where we typically observe a standard deviation of around 10% of the average due to infrastructure fluctuations. For resampling and aggregation steps, we consider 30sec and 60sec windows and report the average - we did not observe noteworthy variations there. More details on the experimental setup can be found below; the respective code snippets are available on demand. For compute, we use 2 instances of Standard_E8ads_v5 64GB 8 cores on both ADX and Databricks (for Spark and Tempo).

Interpolation and Resampling

For interpolation we consider a simple piecewise constant implementation i.e. fill missing values with the closest available. The resampling step then boils down to just sampling the time series on a regular temporal grid. See the figure for processing times associated with each scenario.

IvoEverts_3-1738148199241.png

 

Spark vs ADX: the main difference here can be attributed to the fact that ADX uses local SSDs as hot cache, also for writing. Most of Spark’s compute time is spent on writing the results hence Spark is slower for simpler operations such as resampling, but faster when operations become a bit more involved as with interpolation.

Spark vs Tempo: as can be expected from an abstraction layer as provided by Tempo, there is some computational overhead which shows through the ~20% increased processing time for resampling. In the case of interpolation, the Tempo implementation performs a resampling step as well which we think is the main reason for the large increase in processing time.

Aggregation

As a final processing step we consider a rolling average, and add a scenario in which there are 10k unique tags in addition to the 50k scenario. The table below shows processing time and associated cost for ADX and Spark.

 

Unique tags

Method

Duration (sec)

Cost ($)

10k

ADX

7.4

0.0014

Spark

7.8

0.0028

50k

ADX

30.7

0.0056

Spark

11.0

0.0037

Again, it shows that Spark is especially capable when the size and complexity of the workload increase, whereas ADX shows favorable cost/performance in the 10k scenario.

Conclusion

We arrive at the conclusion that the main differences in terms of cost/performance between ADX and Spark relate to size and complexity of the workload. Spark is of course particularly well known to excel at large complex workloads, and this is no different for time series data and associated operations.

An abstraction layer on top of Spark such as Tempo provides simplicity in terms of development, and naturally comes with some computational overhead.

Next steps

In a next iteration of benchmarking time series processing methods, we aim for a pure streaming scenario as opposed to the multi-hop batch scenario considered here. Also, we will look into other proprietary and open source alternatives that are available.

Acknowledgments

Appreciation goes out to the team at Celebal Technologies: Manu Singh, Bharat Singh, Mayank Jhamb, Khushi Garg, Akash Chakraborty and Deepanker Panchal, and Romy Li from Databricks.

A suggested related reference can be found here.