Are we related ? Correlation and Causation in Microservices


Architecture, Distributed Systems, Events, Microservices, Software, Technology / Saturday, March 24th, 2018

In a synchronous blocking call such as RPC, a request is followed by a response and there are no other requests or responses interleaved with this outstanding request. So there is absolutely no ambiguity in matching the response to the request.

Enter the free for all world of distributed asynchronous messaging, where at any point of time there will be several concurrent requests and responses being processed across different services. In such a world it becomes essential to have a definite and unambiguous way of matching responses to requests. In fact, in most cases, the correct behavior of the system will depend on this. For example, think retransmission of requests or triggering an action after a successful response etc. These depend on identifying the right response to a request when multiple responses are interleaved. If this matching is wrong, it will lead to erroneous behavior and could potentially result in an inconsistent system state.

This is indeed a well known issue in the distributed world and is addressed in many messaging protocols. One of the  examples that I am most familiar with is in the Unified Communications world. The Session Initiation Protocol (SIP) is used to establish multi-media sessions between users (voice, video, chat, collaboration etc). A user wanting to establish a voice call, sends a Invite request to the user with whom he wishes to converse. This request is propagated through multiple entities and is typically serviced by an Application Server sitting in the middle. The server handles several hundred thousands of such requests per second. Any single request could fan out to many devices or even other users, when the called users answering service is set up to do so. In such a case, the original request could result in multiple sub-requests and receive multiple responses from different entities. The Application server handles all these messages, for all calls simultaneously.  So it needs a way to tie the requests/responses back to a specific request and converge them to for various reasons such as call features, billing, logging, monitoring etc.

In the event based microservices world, a request by a client will trigger numerous other events as part of the workflow. Here one of the main concern is that of logging and monitoring. Consider the example of order placement in an e-commerce application. Placing an order will trigger a cascade of events in many services such as Order service, Payment service, Shipping, Warehouse, Recommendations, Loyalty etc. Say a failure occurs in processing the order at some point along the workflow. In order to debug that if you have to refer to the logs, how do you track the events triggered by this particular order ? There will be numerous events interleaved in the logs along with this one order. There should exist an easy way to search, such that the results are only the set of events related to this order.

We can generalize this problem to say that we need a way of grouping the set of all related messages/events arising from a single request in a distributed system.

Correlation-ID

Both of the above issues, the grouping of all related parts of a distributed transaction or grouping of all the events that are part of a workflow are solved by using a simple concept known as a Correlation-ID (CID).

A CID is a globally unique identifier or simply a unique token that is carried in the messages, which can unambiguously answer the question for a set of messages – Are we related ? The way this works is that when a client originates a new request it adds a unique id (usually a UUID), as part of the message metadata or the event metadata. As the message flows through different entities/services, they copy the CID to any downstream sub requests and all upstream responses. In the events world it works the same way in that all events that are triggered in response to the external event, preserve the CID and pass it along in any newly created events. The CID is also included in all the logging and monitoring entries. Thus the CID becomes the single token that can be used to tie up all the related messages/events together, be it actively in the workflow or in logs and monitoring tools.

There are several existing protocols and system that support CIDs. The syntax and name of it might vary between different protocols and systems, but semantically they all serve the same purpose. SIP has the Session-ID header. HTTP has X-Request-ID and X-Correlation-ID. Java Messaging Service has a JMSCorrelationID that is included in replies. A common usage with request/reply pattern is to copy the unique Request-ID from the request into the reply as the CID. Note that in order for this pattern to work successfully, every entity in the path must support the CID. As part of the protocol or service interface, the syntax and semantics of CID must be well defined and must be supported by all involved.

An interesting case is when an external request comes in with no CID. In such a case the entity or service that first encounters the request or event must add a CID to any ensuing events. The other case along the same lines is when interacting with external entities that do no understand or support the CID, then how do you preserve the relatedness relationship ? I don’t have a good answer to this yet except that typically nodes are expected to live by Postel’s law , in that even if they don’t understand the CID, they should be generous in accepting it. If they are generous enough, they will hopefully also pass it along. This is the common practice in the SIP world at least.

Causation-ID

Another beneficial pattern when it comes to relationships in the eventful world of microservices is identifying causation. This is again a really simple concept – When one event results in causing another event directly, then the second event is said to have been caused by the first event. This relationship is captured in the second event. This is typically done by the usage of two ids. Each event has a unique id, a message-id that identifies it. Each also has a causation-id which is set to the message-id of the event that caused it.

No more relationship issues

The set of three ids – the message-id, the correlation-id and causation-id is extremely beneficial to microservices especially given its distributed nature. Most frameworks and related books recommend using these as a must to make life easier. I absolutely agree.  We have enough tough problems to deal with in a distributed world already, we can do with one or two less. Add the ids even if you can’t enumerate all the benefits right away. You will thank yourself later.

One last point, sometimes you might find an existing piece of metadata that is already being used for other purposes serve the needs of a CID as well. In such cases, it will perhaps be easier to reuse it for correlation, but be aware of the pitfalls of reuse, overloading and coupling it might cause. If it’s not hard to add an explicit id for correlation, I would recommend it to overloading an existing id.

Further Reading

Note that this article is part of the Microservices series. You can read the previous ones here : Prelude, Introduction, Evolution, Guiding Principles, Ubiquitous Language, Bounded Contexts, Communication Part 1, Communication Part2, Communication Part 3, Communication Part 4, Communication Part 5, Kafka

Designing Event Sourced Microservices

Distributed Tracing

Another blog post on Correlation-id

One more Blog post

Leave a Reply

Your email address will not be published. Required fields are marked *