In the previous post, I spoke about how we used an Event driven architecture with Finite State Machines to simplify complex problems and as a means of communicating between different services. However there are several issues to consider when using Events that will require specific design choices and trade-offs to be made, which will affect the system behavior.
One of the basic tenets of Microservices is that each service maintain it’s data privately. The private data store can take any form – Relational, NoSQL etc. This private data is not accessible to any other service. The services talk to each other only through public APIs, in this case through Events. The Events are published by the producers to a Message Broker. The Message Broker delivers the events to interested Consumers. So far so good. If the service just consumed external events and updated state internally without having to do anything further, there would be no issues. However, in real applications, there will always be a need for different services to interact on a certain subset of events and make changes to their internal state in correlation with the events. This is when things start to get hairy.
Did you say 2 Phase Commit ?
Relational Databases have spoiled us and made life really easy over the last several decades. ACID guarantees are mostly taken for granted and anytime we need to make several related changes together atomically, we just use a Transaction. Simple, no second thoughts about analyzing for any failures that might happen or how we’ll recover from them. The database takes care of it all unbeknownst to us.
Back to the reality of microservices – what happens when we want to say update our private data store and publish a correlating event in a microservice? How do we do that ? Say we first update the private data and then publish the event. Several things can go wrong in this scenario. The event might fail to publish due to any reason after we persist data in our private store. Or the event is published but does not get to the consumer. Whatever the failure path, the result of it is that our system, in terms of our internal data and the external reflection of it, is no longer consistent. Depending on our application, this may or may not be acceptable. Most often than not, this is not an acceptable situation.
Similarly, consider the case where a process workflow requires several services to collaborate to accomplish a business function. I’ll use the classic case of an e-retailer website, that everyone is familiar with (I was told that the telephony example in the previous post was confusing).
A user places an order for an item. The Order service creates an open order but before it can be confirmed into a valid order, perhaps the payment system imposes some invariants. One such invariant could be that the total amount of the order be below the user’s credit limit. Further processing of the order can only happen after the Payment service confirms it as such. If the Payment service rejects the order due to the total exceeding the credit limit, the Order service has to cancel the order and not proceed any further with it. The service local states should not be left with half updated results. These two services should atomically handle this workflow. If they don’t then the system will lead to inconsistent state and errors by violating the invariants.
So how do we deal with these issues ? The classic text book answer is to use a 2 phase commit or distributed transactions. Technically that is the correct answer, but practically speaking no one in the distributed world recommends using it. A detailed coverage of data consistency in distributed systems ( CAP theorem) is available in one of my previous posts. In short, it affects performance and causes severe operational problems. Especially so, if the distributed transaction co-ordinator fails, recovery from 2 phase commit is very expensive and hard to do. Also many modern data stores do not even support it. Cloud based architectures are also hesistant about supporting distributed transactions.
So we can rule out distributed transactions. The only viable option then is to accept the reality that with Microservices, as is common in any distributed system, we may have to settle for Eventual Consistency and design our services to account for it.
A few alternative approaches to distributed transactions are :
- Change Data Capture or CDC as it is known, is a method that observes and captures all changes occurring in a database with the purpose of applying these changes in the order of occurrence to other systems so as to replicate the DB. This concept is extended to make the changes available in streams and conceptually this can be seen as an event stream of state changes. There are several tools that support this – MongoRiver, DataBus, GoldenGate etc. Some databases are also exposing APIs that can be used to access CDC. I personally think that with this approach the events are tied to the low level representation of the DB and hence creates a low level coupling and one should be aware of the tradeoffs when using this approach.
- The writes can be handled as part of a local transaction to two tables – one that represents the state of the Orders and another table that represents Events. The first write updates the local DB table corresponding to the order. Then another write is made to the Events table and is committed atomically. The Events table is read by a separate thread, which then posts the event to the message broker. Essentially the events table acts like a trigger on state updates on the other tables, which then continually publishes those changes as events to the broker. The granularity of the state changes to events can be controlled by the application.
Event Sourcing has it’s origins going back to accounting ledgers and the financial industry. Think about your bank account statements. They essentially list all the transactions for a time period and then list the final state of the account in terms of balance etc. If there is an erroneous transaction, it is not deleted outright but a compensating transaction that fixes the error is added to the list. This is exactly how Event Sourcing works too.
Event Sourcing is a radical idea that events are the single source of truth in the system and as such are persisted forever. In this case, the events represent immutable facts at discrete points of time and the state of the application is derived from the persisted events. The events are written to an append only Event log and updates and deletions are prohibited. The most important and differentiating characteristic of an Event Source is that at any point of time, the state of the application can be built from replaying the events from the start to that point in time.
If you think about it, there are some parallels to the CDC concept and the event log of Event Sourcing. However, CDC and analogous systems do not provide state derivation and also represent a lower level of abstraction compared to events in the Event Source which are modeled at the application layer abstraction level.
Using an Event Source eliminates the 2 phase commit issue since writing to the event log does both the things for us atomically in one operation now – persist the event and also publish the event to consumers.
Event Sourcing has several benefits :
- Audit trail
- Ability to time travel
- Multiple consumers can replay events and build their own derived state
- Consumers can control the rate of consumption since the events are persisted for ever
- Fault tolerant
- Extremely useful for debugging and tracking transient errors that occur in production
Replaying events from the beginning to current time to build the application state may result in performance issues as the event log grows over time. For this reason, the event logs are used to create snapshots in time. So services will only have to replay events from their snapshot to the current time, reducing the events to manageable chunks and thus improving performance.
Command Query Responsibility Segregation (CQRS)
An Event log is not the easiest format when it comes to executing queries. To help with queries, CQRS is usually used with Event Sourcing to enable services to create materialized views from the event stream. CQRS is a pattern that differentiates between Commands – those events that change the state and Queries which are read only and do not alter the state. This is especially powerful in a microservices architecture because it enables the decoupling of services that change state and services that need to interpret that state differently for querying purposes. Think of a report generating service that needs to aggregate state from all the services. Such a service can consume events from all other services off the event log and build it’s own private model of reporting data derived off of the events. One other advantage of CQRS is the flexibility it gives one to scale reads and writes separately. This is beneficial in systems where the read/write ratio is largely disproportional.
Sagas is another old pattern preceding microservices. It essentially allows us to handle distributed workflows that involve several services as in the order service and payment service example discussed earlier. Instead of using distributed transactions what Saga proposes is to use compensating transactions to bring the system to a consistent state in the cases where there are failures or aborts. Think of this as doing a local rollback in each service. Each service executes it’s local changes and makes note of a compensating changes required. As the events flow along different services, if one of the downstream service aborts then every upstream service executes the compensating changes to bring the system to a consistent state. Although a powerful pattern, a word of caution. In practice implementing Sagas is rather complex and requires careful thought and design.
In summary, Event driven systems are really the way to go with Microservices. EventSourcing, CQRS and Sagas help us take the Microservices to a finer granularity level than possible otherwise. Although these patterns are complex and come with challenges, they are not insurmountable. EventSourcing is an unfamiliar and new paradigm. It is worth keeping in mind that EventSourcing and CQRS are complex patterns and that not everything in the system should be using them. If some services are served well by simple CRUD mechanisms, then it should be used by those services. The Events granularity level between internal and external events is another design factor to be considered carefully. Typically a relatively coarser granularity level for external events may be sufficient.
So reiterating in conclusion, Event Sourcing, CQRS and Sagas should only be applied to parts of the system that really need and benefit from them. Event Sourcing is gaining popularity of late and there are several frameworks that are coming out that support it. So I can only expect frameworks and our knowledge about these patterns to get better in the coming years with experience. I have only covered the basics in this post. For further reading check out the sources below.
Further Reading :
Note that this article is part of the Microservices series. You can read the previous ones here : Prelude, Introduction, Evolution, Guiding Principles, Ubiquitous Language, Bounded Contexts, Communication Part 1, Communication Part2, Communication Part 3, Communication Part 4