Understanding How Apache Pulsar Works

I will be writing a series of blog posts about Apache Pulsar, including some Kafka vs Pulsar posts. First up though I will be running some chaos tests on a Pulsar cluster like I have done with RabbitMQ and Kafka to see what failure modes it has and its message loss scenarios.

I will try to do this by either exploiting design defects, implementation bugs or poor configuration on the part of the admin or developer.
In this post we’ll go through the Apache Pulsar design so that we can better design the failure scenarios. This post is not for people who want to understand how to use Apache Pulsar but who want to understand how it works. I have struggled to write a clear overview of its architecture in a way that is simple and easy to understand. I appreciate any feedback on this write-up.

Claims

The main claims that I am interested in are:
  • guarantees of no message loss (if recommended configuration applied and your whole data center doesn't burn to the ground)
  • strong ordering guarantees
  • predictable read and write latency
Apache Pulsar chooses consistency over availability as does its sister projects BookKeeper and ZooKeeper. Every effort is made to give strong consistency.
We'll be taking a look at Pulsar's design to see if those claims are valid. In the next post we'll put the implementation of that design to the test. I won’t cover geo-replication in this post, we’ll look at that another day, we’ll just focus on a single cluster.

Multiple layers of abstraction

Apache Pulsar has the high level concept of topics and subscriptions and at its lowest level data is stored in binary files which interleave data from multiple topics distributed across multiple servers. In between are a myriad of details and moving parts. I personally find it easier to understand the Pulsar architecture if I separate it out into different layers of abstraction, so that’s what I’ll do in this post.
Let's take a journey down the layers.
Fig 1. Layers of abstraction

Layer 1 - Topics, Subscriptions and Cursors

This is not a post about messaging architectures that you can build with Apache Pulsar. We’ll just cover the basics of what topics, subscriptions and cursors are but not any depth about the wider messaging patterns that Pulsar enables.
Fig 2. Topics and Subscriptions
Messages are stored in topics. A topic, logically, is a log structure with each message being at an offset. Apache Pulsar uses the term Cursor to describe the tracking of offsets. Producers send their messages to a given topic and Pulsar guarantees that once the message has been acknowledged it won’t be lost (bar some super bad catastrophe or poor configuration).


Comments