Einstein, Time, and CEP

Ever since Einstein came up with special relativity over a hundred years ago people have started to realize that the notion of time is not nearly as straight-forward as it seems. Well, Einstein never had to build a CEP engine, or he would have learned a few more things about time!

It turns out that handling time correctly inside a CEP engine is a very complex issue, with many implications. Some approaches to time handling in CEP systems are discussed in depth in the 2004 paper by U. Srivastava and J. Widom: Flexible Time Management in Data Stream Systems, but the paper is very technical. There is a much more approachable paper on Coral8's web site written by our very own Bob Hagmann: Fundamentals of Time in Coral8. I am not going to repeat here what these papers say, but rather I want to talk about some practical implications, relevant to anybody using a CEP engine.

Before I go there, however, I want to explain why the whole issue is so intricate. The first reaction of some folks new to CEP, when first exposed to time management issues, is predictable: why are you making this so complex??? I'm sure Einstein has heard the same objection. But in this case, the objection seems quite logical. Most non-CEP applications just don't worry about time too much. Every computer has a system clock, which reliably keeps the time. Every operating system has a number of system calls to access this clock. So what's so hard about this?

The reason this problem is so hard is that CEP applications are typically highly distributed. Events are often generated far away from where they are processed. Think of sensor networks, or of trading applications that subscribe to data feeds from multiple exchange. Even if physically the data source and the CEP engine are not that far from each other, there may be non-trivial latency in getting events in.

So by the time the events from multiple source reach the CEP engine, they may be delayed, typically by different time deltas, and may even be out of order! If you want to analyze precisely what has happened, the system clock has very little relevance to the times when events were generated! So rather than using system clock, the CEP engine must use virtual stream clock, which is the clock driven by the arrival of events on one or more input streams.

If events arrive fast and in-order, this virtual stream clock may just run slightly behind the system clock. But if events are delayed, and have to be synchronized across multiple event sources, and have to be pre-sorted to handle out-of-order issues, then the "virtual stream clock" is quite different from the system clock.

Therefore, the Coral8 Engine provides two modes for analyzing data: one according to the system clock, another according to the virtual stream clock, or how we call this in our documentation, according to event timestamps. And in this latter mode our engine can automatically synchronize and pre-sort events coming from multiple data sources. This takes the deep magic described in the Stanford paper.

If one is analyzing real-time data, virtual stream clock is normally somewhat behind the system clock, due to transport layer delay. There are also use cases where the virtual stream clock runs much faster than the system clock! For example, one of the common use cases on Wall Street is back-testing of trading strategies. Once somebody comes up with a new strategy, they need to test it on historical data.

Playing historical data back to the CEP engine is not a problem, but people typically want to speed up this process as much as possible. So now the time is compressed! Einstein would appreciate some of the issues this causes. What in reality took 1 hour, may take 1 minute in the accelerated playback mode. A jumping 1 hour long window will be emptied every 1 minute, not every 1 hour. Of course, the Coral8 Engine implements the accelerated mode correctly, where no matter how fast you go, the results are guaranteed to be exactly the same.

I've touched upon some of these notions in my post on Determinism in CEP, but I did not quite explain why some CEP engines are deterministic, and some less so. Hopefully now this is a bit clearer. Some engines handle virtual stream time, and some do not. Some handle virtual stream time for one stream, but not for many. Yet handling virtual stream time for multiple streams is the only way to address the issues we have talked about here. It may be non-trivial, but there is no way around it. Like Einstein said, everything should be made as simple as possible but not simpler.

Mark Tsimelzon, President & CTO, Coral8

question about a time use case

Hi Mark, what about this scenario: In production, the application needs to time out windows based on system time, having nothing to do with a time stamp that comes in with the data. We are not worried about data being delayed in this case. However, when playing back testing data at 5x speed, those windows need to stay open for .2 of the amount of time that they did in production. Or even better, when playing in data as fast as possible, the windows need to depend on the speed of the playback. How would you handle this situation? Hans

Answer

Ah, great question! Ok, here is a little secret about how the two modes are implemented in our engine: there are no two modes, there is only one! Inside the engine, time is always, always driven by event timestamps. The "use the system clock" mode is implemented by assigning the current timestamp (based on the system clock) to incoming messages before they hit the core query processor module.

Inside the core processing module, everything is exactly the same, regardless of what mode you are in. You don't create windows differently depending on whether you want to use event timestamps or the system clock. That's why it's easy to switch between the modes, do the acceleration/deceleration, or even implement the "play data as fast as possible" option. Time is only driven by event timestamps.