Progress for big data in Kubernetes


shared data platform in a Kubernetes systemKubernetes is really cool because managing services as flocks of little containers is a really cool way to make computing happen. We can get away from the idea that the computer will run the program and get into the idea that a service happens because a lot of little computing just happens. This idea is crucial to making reliable services that don’t require a ton of heroism to stand up or keep running.


But there is a dark side here. Containers want to be agile because that is the point of containers in the first place. We want containers because we want to make computing more like a gas made up of indistinguishable atoms instead of like a few billiard balls with colors and numbers on their sides. Stopping or restarting containers should be cheap so we can push flocks of containers around easily and upgrade processes incrementally. If ever a container becomes heavy enough that we start thinking about that specific container, the whole metaphor kind of dissolves.

So that metaphor depends on containers being lightweight. Or, at least, they have to be lightweight compared to the job they are doing. That doesn’t work out well if you have a lot of state in a few containers. The problem is that data lasts a long time and takes a long time to move. The life cycle of data is very different than the life cycle of applications. Upgrading an application is a common occurrence, but data has to live across multiple such upgrades.


Comments