My HDFS Research Agenda

18 May 2015

Approaching its 10th or 4th year, depending on how you count it, and 9000 JIRA issues, HDFS is still a young project. Scores of interesting technical problems await solutions.

HDFS touches many active research areas in the academia: operating system, data storage, distributed computing, just to name the major ones. More joint research projects between universities and companies would benefit both sides in unique ways, and advance the entire big data industry.

Almost a year into HDFS development, below is my personal collection of open problems that should be studied more scientifically.

Storage policy / software defined storage

Heterogeneous storage management (HSM) was recently introduced to HDFS. While enhancing storage cost-efficiency, it requires system administrators to manually set the storage tier for each file. This burden scales with the amount of data in the system. A more advanced solution is follow the software defined storage principle and build a policy engine to translate high level, human friendly policies to storage machineries under the hood. IOFlow is an example along the direction.

In-memory caching

A few directions have been explored to leverage DRAM on HDFS servers for better performance. HDFS read caching follows a similar philosophy (and provides similar semantics) as traditional OS cache. In contrast, HDFS discardable distributed memory (DDM) uses memory as a type of storage medium, which is similar to the Tachyon project.

Smart block data retention policy

HDFS already has a Trash mechanism to protect against mis-deletions on the file level. However, a block-level solution is desirable for a few reasons. One fundamental problem is to sort out the priorities of blocks in the context of data retention. Existing caching eviction policies might not work well.

If you liked this post, you can share it with your followers or follow me on Twitter!