Determinism in CEP

I spent the last week with a bunch of Coral8 customers and prospects in New York, and one question came up a few times: the question of determinism. Specifically, there seems to be a lot of confusion on the subject of what makes a CEP engine deterministic, and how one can easily test whether an engine is or is not deterministic. I promised somebody to write a post on this subject, so here it goes:

What is determinism? For the purpose of this post, we'll use a simplified definition:

A deterministic CEP engine will always produce the same results on the same input.

For example, if one stores streaming data in one or more files, and plays it back to the engine, then the engine will always produce the same results.

Why is determinism important?

Building complex CEP applications is hard enough, but it's pretty much impossible to build them without determinism. It's very hard to test an app or its parts if the engine is non-deterministic. How do you know that every small change in the app, in the algorithm or in the engine itself does not break anything? You don't.

Moreover, as you come up with new algorithms or strategies, you will often want to back-test them on existing data. How can you do this if the engine is non-deterministic, or if the results of running the algorithm on live data are different from the results of running it on historical data, not to mention running with acceleration to speed up your testing? You can't.

Interestingly enough, one can talk about degrees of determinism. Let's coin some terms here: an engine may be non-deterministic, single-stream deterministic, or multi-stream deterministic. Let's consider them in order:

Non-determinism

A non-deterministic engine does not produce reproducible results. By far the most common reason for this is that the engine does not process events according to event timestamps, and instead uses the time of arrival as the timestamp for the event. Also, it probably uses a clock for measuring windows. This is a big no-no if one wants to achieve determinism.

A good engine will typically have both options: to process events according to event timestamps, and to process events according to the time of arrival (the system clock). Both have advantages and disadvantages, but only the first option can guarantee determinism.

Single-stream determinism

There are engines on the market that can process events according to event timestamps, but only if the app has a single input stream. How can you test that the engine is deterministic for one stream? Here is a query you can run:

Insert Into Result
Select Count(*)
From S Keep 1 second;

This query should work in any good CEP engine, modulo trivial syntax changes. It creates a sliding window on stream S, and for each event it computes the number of events in one second prior to the event. If S has a reasonably high data rate (10,000+ events/sec), then one should be able to use this query as a test of single-stream determinism. Namely, if you run this query on the same stored data, it should produce the same results, provided you choose the option of using event timestamps for event processing. If every time you run this query you get different results (use 'diff'), then the engine is not single-stream deterministic.

Multi-stream determinism

Most interesting applications take more than one data stream. For example, many financial applications look at multiple feeds from exchanges and multiple FIX order streams. In this case, the relative order of events as they happened at their sources is very important! Unfortunately, achieving multi-stream determinism is pretty hard. One has to be able to handle and synchronize delayed events, out-of-order events, and so on. Such stream synchronization is tricky, so here is how you can test if you engine is multi-stream deterministic. Create a simple query which takes data from more than one stream, such as

Insert Into Result
Select *
From Orders as O,
     Ticks as T Keep Last;

This query pairs each order with the last tick (of course, you can use any other streams). Again, try this query at high enough data rates. If your engine is multi-stream deterministic, you'll always get the same results! If you don't, then the engine is not multi-stream deterministic, simple as that.

Conclusion

I hope this post has demystified the notion of determinism a little. Of course, if you don't want to run all these tests, you can just ask me which engines are deterministic and which ones are not. I won't mention any names here, but as far as I know, the Coral8 Engine is the only commercial multi-stream deterministic CEP engine on the market today. If your tests prove otherwise, please let me know, and I'll gladly report so here!

Mark Tsimelzon, President & CTO, Coral8