Having worked on Hadoop for 2 years, this is my first Hadoop Summit – will be the only one as well, because the conference will be rebranded as Dataworks Summit next year. It took place in San Jose Convention Center from Tuesday 6/28 to Thursday 6/30. This post summarizes my key takeaways.
Hadoop and Beyond
The rebranding as well as the conference agenda suggest a trend to focus more on the application layer (what can be done with Hadoop) instead of the low level technologies.
Unfortunately I couldn’t attend many of the application-focused talks (conference has 8~9 parallel sessions). I was impressed by the “Unresonable effectiveness of ACID” keynote talk from Microsoft – ACID here stands for algorithms, cloud, IoT, and data. The talk highlights the social aspect of big data with an example of preventing school dropouts through large scale data collection and machine learning. Another keynote talk by Ben Hammersley, my personal favorite from the conference, delivers a similar message in a humorous tune.
HDFS and YARN
Above being said, a lot still remains to be done on Apache Hadoop, which has just turned 10 years old. From what I saw, scalability and cloud integration are the two focus areas this year.
Apache Hadoop was initially designed with a single-master architecture. Many global-scale companies are deploying clusters with 5k to 10k nodes, and scalability is becoming a severe constraint.
Interesting summary of different usage patterns of Hadoop on cloud: 1) all I/O happens to HDFS, and HDFS stages data with blob store; 2) input data from blob store, output data written to HDFS and eventually copied to blob store; 3) both input and output I/O on blob store, HDFS only has temporary data.
Interesting summary of pros and cons of blob store and HDFS, including scalability, consistency, and locality
Connectors like s3a bridge the semantics gap between blob stores and HDFS, to some degree
Enhances consistency via a “secondary metadata store”
Hadoop compatible file system (HCFS), combined with file system contract tests, is key for extending to more cloud blob stores
HBase is more closely coupled with HDFS (instead of HCFS) than other applications
Operationalizing YARN in the cloud talk from Qubole: [to be added]
A lot of jobs are poorly configured, causing resources wastage
Static overcommit (configure containers to use more than OS offers) doesn’t work just like “over-selling flight tickets”
NodeManager reports utlization to ResourceManager via heartbeats, to facilitate dynamic overcommitting
Presto is gaining a lot of tractions, especially in global-scale Internet firms which directly use Apache Hadoop releases (instead of vendor distributions). The joint talk from Facebook and Teradata includes many exciting new features. The presenter acknowledged the performance advantage of Impala and mentioned the major win with Presto is the capability to operate on multiple data sotres, including S3.
The Hive HBase metastore talk was also interesting. It is somewhat surprising that HDFS and YARN scalability solutions are based on federation, but Hive relies on a Key-Value store.
Streaming is quickly emerging as a major topic in big data analytics. Unfortunatelly I didn’t have chance to attend many streaming talks. Will add details after watching some video recordings.
For people without a security background, Kerberos can be very hard to understand. I recently had to fix a number of Kerberos issues in HDFS, and that turned out to be very helpful for my education.
From a high level, every process should represent someone who is entitled to access resources and request services (principle in Kerberos terminology). The way it proves to a resource / service provider is to show a token. A token is a chunk of bytes that is hard to generate but easy to verify. Kerberos KDC is in charge of distributing tokens of this sort.
As a concrete example, in a Kerberized cluster, Balancer acquires a token from KDC saying that it represents the cluster administrator. It then uses the token to check with NameNode about block locations, and work with DataNodes to move those blocks.
UC Berkeley AMPLab generously hosts annual retreats to present early-stage research work. It also serves as a venue for collaborators from the acdemia and industries to exchange ideas. This year’s winter retreat was held at the Granlibakken resort in Tahoe City. Very good ski resort for beginners and families with young kids.
This is my summary from attending the event:
The most interesting topic was the annoucement of a new lab to succeed the current AMPLab. From my understanding, this succession includes a new name (to be determined) and a new theme. Ion Stoica presented a number of visions under three highlighted points:
Ion used the example of stock trading and envisions a broader range of applications to match the msec-level data freshness and microsec-level decision latency in the future. Nagivation and auto-driving was used as a second example, where the improved data freshness and decision latency could make a huge impact.
Their goal is to create a general platform – can be seen as next Spark – to supplement these three properties to upper level applications. Ion pointed out a few challenging trade-offs between conflicting goals, and drew a spectrum to position current systems in these trade-offs. Hadoop-based solutions was categorized as good-latency, bad-freshness, bad-security, OK-functionality. BDAS-based solution got similar scores except for OK-freshness, and somewhat better functionalities. So security has been identified as a common weak point. Despite storage-level encryption, many places in the compute pipe are unprotected. There are some security solutions to support computation on encrypted data (e.g. CryptDB, Mylar) but they only support simple algorithms.
So their goal is to win along every dimension on the spetrum (“Spark-like functionality with 100x lower latency on 10x fresher data, with strong security”).
Several talks and posters were dedicated to Succint, an in-memory compression format. The project already has a published paper so I won’t go into too much details. The presentation was of high quality and worth watching. First time for me to learn that Succint optimizes for point queries while sacrificing scan-based queries. This means users have to determine their workloads before-hand and decide whether to use Succint. Succint-Spark package is already GA. The current plan is to enable more features, including regex search, graph search and so forth. They are also building an enryption-compression solution. The technique is mini-batching (compress and encrypt a number of rows) with some empirical parameter tuning.
I don’t have deep ML background and therefore received most ML talks conceptually. The Helio project aims to create a platform where a data owner can specify which portion (e.g. columns) of their data can be queries, and what degree of aggregation can be output. The CoCoA and Stumptown projects are optimizations which tunes the ratio of local computation and global communication. KeystoneML is a general platform for data scientists to easily write ML applications.
It was a good idea that dinner tables were divided as “interest groups”. Myself and David Alves sat in a “System” table. Two topics that I found most interesting:
What are good use cases of fresh data? In both stock trading and auto-driving, fresh data is very important. What properties in an application cause so? Can we discover more applications where fresh data makes big impact?
When should we push computation to the “edge”? Edge computing is faster and more secure. But how to balance the model accuracy?
This is a quick note comparing arrays and (direct / indirect) ByteBuffer. So basically, Buffer is an abstraction using which to prepare data for channel-based I/O. ByteBuffer is a subtype offering byte-oriented operations.
The ByteBuffer API can be implemented by either wrapping a byte array (non-direct ByteBuffer) or allocating from off-heap memory (direct ByteBuffer).
Here’s the easiest way for me to understand ByteBuffer: 1) we introduced a way to operate on off-heap memory called DirectByteBuffer; 2) to unify its handling with on-heap byte arrays we introduced an abstraction called ByteBuffer.
Approaching its 10th or 4th year, depending on how you count it, and 9000 JIRA issues, HDFS is still a young project. Scores of interesting technical problems await solutions.
HDFS touches many active research areas in the academia: operating system, data storage, distributed computing, just to name the major ones. More joint research projects between universities and companies would benefit both sides in unique ways, and advance the entire big data industry.
Almost a year into HDFS development, below is my personal collection of open problems that should be studied more scientifically.
Storage policy / software defined storage
Heterogeneous storage management (HSM) was recently introduced to HDFS. While enhancing storage cost-efficiency, it requires system administrators to manually set the storage tier for each file. This burden scales with the amount of data in the system. A more advanced solution is follow the software defined storage principle and build a policy engine to translate high level, human friendly policies to storage machineries under the hood. IOFlow is an example along the direction.
A few directions have been explored to leverage DRAM on HDFS servers for better performance. HDFS read caching follows a similar philosophy (and provides similar semantics) as traditional OS cache. In contrast, HDFS discardable distributed memory (DDM) uses memory as a type of storage medium, which is similar to the Tachyon project.
Smart block data retention policy
HDFS already has a Trash mechanism to protect against mis-deletions on the file level. However, a block-level solution is desirable for a few reasons. One fundamental problem is to sort out the priorities of blocks in the context of data retention. Existing caching eviction policies might not work well.
Tachyon is a distributed caching layer on top of disk-based file systems such as HDFS or GlusterFS. The core idea is to use lineage to provide lightweight, optimistic fault tolerance.
More importantly, due to the inherent bandwidth limitations of replication, a lineage-based recovery model might be the only way to make cluster storage systems match the speed of in-memory computations in the future.
Extending from Spark, Tachyon also assumes an elegant Lambda-calculus-flavored model, where new datasets are generated by closed-form operators (e.g., map and group-by) from existing ones. This empowers Spark and Tachyon to persist the operator binaries – whose sizes are negligible – instead of replicating the datasets themselves. As quoted above, this might indeed be the only way to write data with local memory speed – from an information theoretic perspective, how else could cheap redundancy be achieved?
Under this elegant assumption, Spark allows you to program with PB-sized variables; and Tachyon provides a name space for those variables to be shared among applications. The programming model becomes much more flexible than MapReduce.
Of course the binaryProgram needs to be deterministic and inputFiles need to be immutable.
Files are first stored in the memory-based lineage layer and then checkpointed in the disk-based persistence layer. If the lineage layer gets full, files are also evicted to the persistence layer following LRU policy by default.
Checkpoints are created in the background. Edge files and hot files are given priority to be checkpointed.
In principle, Tachyon should work well as long as the basic assumption holds: datasets are connected by closed-form jobs. In general, this should hold for most analytical workloads. But how about transactional workloads such as the ones supported by HBase? The job binary – describing a newly inserted value – will be as large as the data itself.
As mentioned in the paper, data serialization is also a fundamental bottleneck. Here’s a back-of-the-envelop calculation based on Figures 7 and 8 from the paper: if the theoretical upper bound of replication is 2.5x faster than MemHDFS (0.5GBps / 0.14GBps), then it should be able to catch up with Tachyon in the real workload tests. Time for some sort of short-circuit writing?
With Tachyon, the main overhead was serialization and de-serialization since we used the Hadoop TextInputFormat. With a more efficient serialization format, the performance gap is larger.
The lineage-based recovery model also contradicts with a file system user’s expectations. For example, it would surprise many applications to get a 1-minute latency (even occasionally) to read a small piece from a file.
The initial idea of Spark was similar to MapReduce Online. However, the lineage-based optimistic fault tolerance model greatly generalized and externalized intra-job intermediate results, bringing them to the unprecedented stage of programmable units. This is a breakthrough in distributed computing.
Tachyon itself makes 2 new contributions, as outlined in the abstract:
The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers.
Both are solid, incremental contributions around better prioritization of fault tolerance tasks. Certainly not as fundamental as Spark, but critical in adopting the main idea to the file system context.
Starting this August I’ll join Cloudera as a Software Engineer in the HDFS team. Waving good-bye to my 4.5 year tenure as a Researcher, first at Oak Ridge National Lab and then at IBM Watson Lab. At this point I’d like to log my thoughts behind this choice. Around every job change there are two factors: compensation and happiness. The second factor, which includes sense of accomplishment, recognition from other people, and pure interest, will be the covered in this post.
I didn’t actually get particularly bad big company politics at IBM. Most of my Research or non-Research colleagues are very reasonable. However, in a company with 400,000 employees, it is very easy to step on other people’s toes. You’ll find yourself spending more time than you want on planning rather than doing the real work.
Another consequence of the large size is that executives know relatively little about the technical details, including the state-of-the-art. Meanwhile your performance score depends on how they like your demo. Therefore, the incentive mechanism is sometimes misaligned with what you know as the right things to do.
Breadth vs. Depth
During my 3.5 years at IBM Research I have worked on VM provisioning, cloud BFT, VM caching, software license management, and a little bit of HDFS as a side project. From my observation that’s more or less the case for many researchers here. IBM’s business model decides that we do a lot of “integration of XXX with YYY”, and “XXX-as-a-service” here. I totally see the value-add and the technical challenges of those projects. However, I just prefer to go “crazy deep” on the XXX problem itself before moving to the next subject.
Things I will miss about IBM Research:
The “never-jam” Taconic Parkway. Seriously, I will miss Westchester – easy commute to NYC, (relatively) cheap housing, Bear Mountain, …
The “real scientists” running around the lab in white coats. Seriously, I will miss the resourcefulness – you can grab an expert in any area (you name it, material science, partial differential equations, …) if your project needs.
The close connections with academia.
The T. J. Watson Center. The remarkable building, the beautiful library, and neat offices.
Things I won’t miss about IBM Research:
“Welcome to the teleconference service, please enter your access code” … “There are 17 participants on the call, you are joining as a participant”. Seriously, there are just so many hour-long meetings where you only speak for 5 seconds.
Having to pay for coffee, having to use 4 year old laptop, having to try so hard to get a 23’’ monitor. The HR department just doesn’t consider it a priority to enhance employee morale and productivity.
Things I look forward to about Cloudera:
The “hacker mentality” and pride as a hardcore engineer.
A zoomed-in view of how people are using large distributed storage systems in production.