Paper review of Tachyon (SoCC '14)
Summary
Tachyon is a distributed caching layer on top of disk-based file systems such as HDFS or GlusterFS. The core idea is to use lineage to provide lightweight, optimistic fault tolerance.
More importantly, due to the inherent bandwidth limitations of replication, a lineage-based recovery model might be the only way to make cluster storage systems match the speed of in-memory computations in the future.
Extending from Spark, Tachyon also assumes an elegant Lambda-calculus-flavored model, where new datasets are generated by closed-form operators (e.g., map and group-by) from existing ones. This empowers Spark and Tachyon to persist the operator binaries – whose sizes are negligible – instead of replicating the datasets themselves. As quoted above, this might indeed be the only way to write data with local memory speed – from an information theoretic perspective, how else could cheap redundancy be achieved?
Under this elegant assumption, Spark allows you to program with PB-sized variables; and Tachyon provides a name space for those variables to be shared among applications. The programming model becomes much more flexible than MapReduce.
Details
Tachyon has a simple API:
Signature | Return |
---|---|
createDependency(inputFiles, outputFiles, binaryPrograms, config, dependencyType) |
lineageID |
getDependency(lineageId) |
Dependency Info |
Of course the binaryProgram
needs to be deterministic and inputFiles
need to be immutable.
Files are first stored in the memory-based lineage layer and then checkpointed in the disk-based persistence layer. If the lineage layer gets full, files are also evicted to the persistence layer following LRU policy by default.
Checkpoints are created in the background. Edge files and hot files are given priority to be checkpointed.
Limitations
In principle, Tachyon should work well as long as the basic assumption holds: datasets are connected by closed-form jobs. In general, this should hold for most analytical workloads. But how about transactional workloads such as the ones supported by HBase? The job binary – describing a newly inserted value – will be as large as the data itself.
As mentioned in the paper, data serialization is also a fundamental bottleneck. Here’s a back-of-the-envelop calculation based on Figures 7 and 8 from the paper: if the theoretical upper bound of replication is 2.5x faster than MemHDFS (0.5GBps / 0.14GBps), then it should be able to catch up with Tachyon in the real workload tests. Time for some sort of short-circuit writing?
With Tachyon, the main overhead was serialization and de-serialization since we used the Hadoop TextInputFormat. With a more efficient serialization format, the performance gap is larger.
The lineage-based recovery model also contradicts with a file system user’s expectations. For example, it would surprise many applications to get a 1-minute latency (even occasionally) to read a small piece from a file.
My assessment
The initial idea of Spark was similar to MapReduce Online. However, the lineage-based optimistic fault tolerance model greatly generalized and externalized intra-job intermediate results, bringing them to the unprecedented stage of programmable units. This is a breakthrough in distributed computing.
Tachyon itself makes 2 new contributions, as outlined in the abstract:
The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers.
Both are solid, incremental contributions around better prioritization of fault tolerance tasks. Certainly not as fundamental as Spark, but critical in adopting the main idea to the file system context.
If you liked this post, you can share it with your followers or follow me on Twitter!