New token allocation algorithm in Cassandra 3.0


The central idea of the algorithm is to generate candidate tokens, and figure out what would be the effect of adding each of them to the ring as part of the new node. The new token will become primary for part of the range of the next one in the ring, but it will also affect the replication of preceding ones.
The algorithm is able to quickly assess the effects thanks to some observations which lead to a simplified but equivalent version of the replication topology2:

  • Replication is defined per datacentre and replicas for data for this datacentre are only picked from local nodes. That is, no matter how we change nodes in other datacentres, this cannot affect what replicates where in the local one. Therefore in analysing the effects of adding a new token to the ring, we can work with a local version of the ring that only contains the tokens belonging to local nodes.
  • If there are no defined racks (or the datacentre is a single rack), data must be replicated in distinct nodes. If racks are defined, data must be replicated in distinct racks3. In either case, there is a clearly defined separation of all token ranges in the local ring into groups where only one replica of any data can reside.
  • The <n> token ranges where a data item is replicated are allocated going onwards in the ring, skipping token ranges in the same replication group. "Skipping" is difficult to deal with efficiently, but this turns out to be equivalent to saying that a vnode has responsibility for a contiguous span of tokens that ends at its token, and begins at the nearest token of the <n>-th distinct replication group that precedes it, or at the nearest other member of the same replication group, whichever is closer.
  • The latter observation makes it relatively easy to assess the changes to the responsibility caused by adding a new token in the ring.
  • Under the assumptions1 of the algorithm, the size of that responsibility ("replicated ownership") is proportional to the load that the presence of this vnode causes. The load over the whole node is thus proportional to the sum of the replicated ownerships of its vnodes.
The latter, the sum of the replicated ownerships of each node's vnodes, is what the algorithm tries to distribute evenly. We do this by evaluating the standard deviation in the ownership of all nodes and the effect on this deviation of selecting a specific token and pick the best in a set of candidates. To keep complexity under control, the candidate tokens are chosen to be the midpoints between existing ones4. Doing this repeatedly for all requested vnodes plus some optimizations gives the allocation algorithm.

Details: https://issues.apache.org/jira/browse/CASSANDRA-7032

Comments