There are some things one can only achieve by a deliberate leap in the opposite direction.
Weddings are on my mind today. Yesterday was one of my dearest cousin’s engagement and I’m still on that buzz. So forgive me for juxtaposing a sacred union of two hearts to the rather technical topic of this post – databases, message brokers and Kafka.
On second thought, the comparison is not as incongruous as it might seem at first. Let me explain.
Think of databases and the first word that pops to mind is persistence. This is rightly so because the first and foremost goal of a database is data persistence. All additional features of a database are offshoots, off the property of data persistence. The data in a DB is persisted forever, meaning that the only way to get rid of the data is by explicitly deleting it.
Now think of message brokers and the basic concept that looms large is that of transience. A message broker also deals with units of data, as messages. It has producers at one end that produce messages. At the other end are the consumers which consume these messages. In between is the message broker that primarily serves to ensure that the messages get to the consumers from the producers as quickly and efficiently as possible. Once the message is successfully handed over to the consumer, the message broker is done and it purges the message from its internals. Thus the messages are persisted by the message broker for only as long as needed to deliver it across and therefore are always transient – under normal conditions of operation typically running into milliseconds to under a second.
Conceptually there is nothing preventing a database or even a simple file from being seen as a message broker. The producers write to the DB or file. The consumers read from the DB or file. The only major difference being that the consumers have to pull the data by polling the DB or the file periodically. The issue with polling is that for applications that need access to data as it happens in real time, i.e streaming data, polling becomes very inefficient and cumbersome. So what is most efficient in this case, is for the data store to support a push notification mechanism. One could argue that a database can do it through the usage of Triggers. Triggers can be used to push notifications whenever a write operation on the DB is executed. In principle this sounds fine, but the issue is one of practicality in terms of efficiency and scalability. Triggers are a mere second class citizen in the DB world and it is said that they have been added only as an after thought and fail to scale beyond a small number of triggers. Streaming is typically characterized by very large volumes of data and hence triggering on such volume will pose major challenges in scaling.
Traditionally the application space of databases and message brokers have been isolated. The distributed nature of the internet and applications arising out it have changed the requirements for the communication middleware. Now there are a large set of applications that need the persistence capabilities of a typical database and also the streaming capabilities of a message broker, along with scalability, independence and flexibility.
So the persistent database world meets the event streaming world of message brokers. What’s not to like ? They complement each other perfectly. Out of their happy union came Log based message brokers, Kafka being the most popular of them all.
Log based message brokers like Kafka deliver messages as they occur, in a streaming manner with extremely low latencies to the consumers, but also persist the messages even after delivery to consumers. The combination of the two features allows for great flexibility in communication and data management. The persistence allows some consumers to read the event stream as and when required, long after the event has happened as in a typical DB, while the push notifications allows other consumers to react to events in near real time as in a typical message broker.
In short, the message broker is an append only log, the writes are appended to the end continually, which the consumers read in a FIFO manner. The messages are ordered within a particular scope (called partitions) and also associated with a sequential offset to mark the position of last read. Each message broker typically defines a set of QOS guarantees and applications can further fine tune them through configuration options.
Going into the details of kafka is a topic unto itself which deserves its own future post. For now, if you have ever wondered how Kafka got it’s name (I really couldn’t figure out what Kafka the writer had to do with this Kafka, the distributed commit log, until I came across this), you need not wonder any longer. Although a little disappointing here it is, as per Jay Kreps, one of the founders of Kafka :
I thought that since Kafka was a system optimized for writing, using a writer’s name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.
(Note that the beginnings of Kafka does not exactly align with the story I have here. It started out at LinkedIn to solve the issues they had with metrics collection and user activity tracking tools. The solution to those problems culminated in the architecture of Kafka.)
Confluent The makers of Kafka