Why Kerberos? Simple Explanation in the HDFS Context

25 Jan 2016

For people without a security background, Kerberos can be very hard to understand. I recently had to fix a number of Kerberos issues in HDFS, and that turned out to be very helpful for my education.

From a high level, every process should represent someone who is entitled to access resources and request services (principle in Kerberos terminology). The way it proves to a resource / service provider is to show a token. A token is a chunk of bytes that is hard to generate but easy to verify. Kerberos KDC is in charge of distributing tokens of this sort.

As a concrete example, in a Kerberized cluster, Balancer acquires a token from KDC saying that it represents the cluster administrator. It then uses the token to check with NameNode about block locations, and work with DataNodes to move those blocks.