Zhe Zhang Systems Engineer and Researcher

      About       Research Profile       Blog       Archive       Feed

Notes from Hadoop Summit 2016

Having worked on Hadoop for 2 years, this is my first Hadoop Summit – will be the only one as well, because the conference will be rebranded as Dataworks Summit next year. It took place in San Jose Convention Center from Tuesday 6/28 to Thursday 6/30. This post summarizes my key takeaways.

Hadoop and Beyond

The rebranding as well as the conference agenda suggest a trend to focus more on the application layer (what can be done with Hadoop) instead of the low level technologies.

Unfortunately I couldn’t attend many of the application-focused talks (conference has 8~9 parallel sessions). I was impressed by the “Unresonable effectiveness of ACID” keynote talk from Microsoft – ACID here stands for algorithms, cloud, IoT, and data. The talk highlights the social aspect of big data with an example of preventing school dropouts through large scale data collection and machine learning. Another keynote talk by Ben Hammersley, my personal favorite from the conference, delivers a similar message in a humorous tune.


Above being said, a lot still remains to be done on Apache Hadoop, which has just turned 10 years old. From what I saw, scalability and cloud integration are the two focus areas this year.


Apache Hadoop was initially designed with a single-master architecture. Many global-scale companies are deploying clusters with 5k to 10k nodes, and scalability is becoming a severe constraint.

Multi-DC HDFS talk from Twitter:

  • Twitter’s production Hadoop environment has multiple logical clusters in each DC
  • Logical clusters are functionally-partitioned: ad hoc, stable production jobs, etc.
  • Each logical cluster has 3 nameservices: /tmp, /user, /log
  • Each DataNode belongs to all 3 nameservices
  • The replication protocol, Nfly, is only used by jobs with small data volumes. It has a /nfly/ prefix
  • Nfly could leave orphan temporary files behind. They’ll be cleaned up by retention program.

Multi-tenancy Support from HDFS talk from Hortonworks:

  • Mainly focusing on RPC scalability on NameNode
  • Interesting work on HADOOP-13128 about using “coupon”, or reservation, to achieve better SLAs.
  • Isolating applications from DataNode: HDFS-9239
  • FairCallQueue implements YARN-style fairness: HADOOP-9640
  • Community’s long term vision is to cosolidate HDFS requests under the control of YARN

Small file analysis talk from Expedia:

  • Built a tool, based on fsimage, to detect and categorize small files
  • Different approaches – compaction, deletion, archival – based on category

YARN federation talk from Microsoft:

  • MSFT has made a lot of efforts in YARN and most of them are open source
  • YARN-2915 tracks the effort
  • Target is 100k nodes
  • MSFT has several 50k-node clusters, and run many short-lived jobs (a few seconds)
  • Federation is not only for scalability, but also for cross-DC queries
  • Architecture is based on Router + StateStore
  • Every node has a AM-RM-Proxy
  • Different routing policies can be used, leading to interesting trade-offs

YARN timeline service v2 talk from Twitter:

  • TLS v1 has a single server and single LevelDB
  • TLS v2 uses HBase, has metrics aggregation, and offers richer query APIs


Although on-premise datacenters still run the lion’s share of Hadoop deployments, it is an obvious trend to move big data workloads to public or private cloud platforms. HDInsight was mentioned a lot.

HDFS tiered storage talk from Microsoft (AKA HDFS-as-a-cache):

  • Emphasizes the problem of multiple clusters (even before moving to cloud)
  • Compared to other approaches (DistCp, application-manage multi-DC access), transparency is big win
  • The community has long discussed approaches to “stage” or “page” part of the data / metadata to external store
  • The key idea here is to generalize the block concept and introduce a PROVIDED block type
  • DataNode will run daemon for the actual data transfer
  • When used together with HDInsight, computation and data “jobs” will be co-scheduled to achieve just-in-time data staging
  • Many smart algorithms can be considered for eviction and prediction-based prefetching

HDFS and object storage talk from Hortonworks:

  • Interesting summary of different usage patterns of Hadoop on cloud: 1) all I/O happens to HDFS, and HDFS stages data with blob store; 2) input data from blob store, output data written to HDFS and eventually copied to blob store; 3) both input and output I/O on blob store, HDFS only has temporary data.
  • Interesting summary of pros and cons of blob store and HDFS, including scalability, consistency, and locality
  • Connectors like s3a bridge the semantics gap between blob stores and HDFS, to some degree
  • Enhances consistency via a “secondary metadata store”
  • Hadoop compatible file system (HCFS), combined with file system contract tests, is key for extending to more cloud blob stores
  • HBase is more closely coupled with HDFS (instead of HCFS) than other applications

Operationalizing YARN in the cloud talk from Qubole: [to be added]

General Improvements

HDFS Optimization Stabilization and Supportability talk from Hortonworks has a good summary of recent detailed work on HDFS.

Over-committing YARN resources talk from Yahoo:

  • A lot of jobs are poorly configured, causing resources wastage
  • Static overcommit (configure containers to use more than OS offers) doesn’t work just like “over-selling flight tickets”
  • NodeManager reports utlization to ResourceManager via heartbeats, to facilitate dynamic overcommitting


Presto is gaining a lot of tractions, especially in global-scale Internet firms which directly use Apache Hadoop releases (instead of vendor distributions). The joint talk from Facebook and Teradata includes many exciting new features. The presenter acknowledged the performance advantage of Impala and mentioned the major win with Presto is the capability to operate on multiple data sotres, including S3.

The Hive HBase metastore talk was also interesting. It is somewhat surprising that HDFS and YARN scalability solutions are based on federation, but Hive relies on a Key-Value store.


Streaming is quickly emerging as a major topic in big data analytics. Unfortunatelly I didn’t have chance to attend many streaming talks. Will add details after watching some video recordings.

If you liked this post, you can share it with your followers or follow me on Twitter!

Why Kerberos? Simple Explanation in the HDFS Context

For people without a security background, Kerberos can be very hard to understand. I recently had to fix a number of Kerberos issues in HDFS, and that turned out to be very helpful for my education.

From a high level, every process should represent someone who is entitled to access resources and request services (principle in Kerberos terminology). The way it proves to a resource / service provider is to show a token. A token is a chunk of bytes that is hard to generate but easy to verify. Kerberos KDC is in charge of distributing tokens of this sort.

As a concrete example, in a Kerberized cluster, Balancer acquires a token from KDC saying that it represents the cluster administrator. It then uses the token to check with NameNode about block locations, and work with DataNodes to move those blocks.

Notes from AMPLab Winter Retreat

UC Berkeley AMPLab generously hosts annual retreats to present early-stage research work. It also serves as a venue for collaborators from the acdemia and industries to exchange ideas. This year’s winter retreat was held at the Granlibakken resort in Tahoe City. Very good ski resort for beginners and families with young kids.

This is my summary from attending the event:

New Lab

The most interesting topic was the annoucement of a new lab to succeed the current AMPLab. From my understanding, this succession includes a new name (to be determined) and a new theme. Ion Stoica presented a number of visions under three highlighted points:

  1. Decision latency
  2. Data freshness
  3. Strong security

Ion used the example of stock trading and envisions a broader range of applications to match the msec-level data freshness and microsec-level decision latency in the future. Nagivation and auto-driving was used as a second example, where the improved data freshness and decision latency could make a huge impact.

Their goal is to create a general platform – can be seen as next Spark – to supplement these three properties to upper level applications. Ion pointed out a few challenging trade-offs between conflicting goals, and drew a spectrum to position current systems in these trade-offs. Hadoop-based solutions was categorized as good-latency, bad-freshness, bad-security, OK-functionality. BDAS-based solution got similar scores except for OK-freshness, and somewhat better functionalities. So security has been identified as a common weak point. Despite storage-level encryption, many places in the compute pipe are unprotected. There are some security solutions to support computation on encrypted data (e.g. CryptDB, Mylar) but they only support simple algorithms.

So their goal is to win along every dimension on the spetrum (“Spark-like functionality with 100x lower latency on 10x fresher data, with strong security”).


Several talks and posters were dedicated to Succint, an in-memory compression format. The project already has a published paper so I won’t go into too much details. The presentation was of high quality and worth watching. First time for me to learn that Succint optimizes for point queries while sacrificing scan-based queries. This means users have to determine their workloads before-hand and decide whether to use Succint. Succint-Spark package is already GA. The current plan is to enable more features, including regex search, graph search and so forth. They are also building an enryption-compression solution. The technique is mini-batching (compress and encrypt a number of rows) with some empirical parameter tuning.

Machine Learning

I don’t have deep ML background and therefore received most ML talks conceptually. The Helio project aims to create a platform where a data owner can specify which portion (e.g. columns) of their data can be queries, and what degree of aggregation can be output. The CoCoA and Stumptown projects are optimizations which tunes the ratio of local computation and global communication. KeystoneML is a general platform for data scientists to easily write ML applications.

Dinner Discussion

It was a good idea that dinner tables were divided as “interest groups”. Myself and David Alves sat in a “System” table. Two topics that I found most interesting:

  1. What are good use cases of fresh data? In both stock trading and auto-driving, fresh data is very important. What properties in an application cause so? Can we discover more applications where fresh data makes big impact?
  2. When should we push computation to the “edge”? Edge computing is faster and more secure. But how to balance the model accuracy?

If you liked this post, you can share it with your followers or follow me on Twitter!

Java byte array vs. ByteBuffer

This is a quick note comparing arrays and (direct / indirect) ByteBuffer. So basically, Buffer is an abstraction using which to prepare data for channel-based I/O. ByteBuffer is a subtype offering byte-oriented operations.

The ByteBuffer API can be implemented by either wrapping a byte array (non-direct ByteBuffer) or allocating from off-heap memory (direct ByteBuffer).

Here’s the easiest way for me to understand ByteBuffer: 1) we introduced a way to operate on off-heap memory called DirectByteBuffer; 2) to unify its handling with on-heap byte arrays we introduced an abstraction called ByteBuffer.

If you liked this post, you can share it with your followers or follow me on Twitter!

My HDFS Research Agenda

Approaching its 10th or 4th year, depending on how you count it, and 9000 JIRA issues, HDFS is still a young project. Scores of interesting technical problems await solutions.

HDFS touches many active research areas in the academia: operating system, data storage, distributed computing, just to name the major ones. More joint research projects between universities and companies would benefit both sides in unique ways, and advance the entire big data industry.

Almost a year into HDFS development, below is my personal collection of open problems that should be studied more scientifically.

Storage policy / software defined storage

Heterogeneous storage management (HSM) was recently introduced to HDFS. While enhancing storage cost-efficiency, it requires system administrators to manually set the storage tier for each file. This burden scales with the amount of data in the system. A more advanced solution is follow the software defined storage principle and build a policy engine to translate high level, human friendly policies to storage machineries under the hood. IOFlow is an example along the direction.

In-memory caching

A few directions have been explored to leverage DRAM on HDFS servers for better performance. HDFS read caching follows a similar philosophy (and provides similar semantics) as traditional OS cache. In contrast, HDFS discardable distributed memory (DDM) uses memory as a type of storage medium, which is similar to the Tachyon project.

Smart block data retention policy

HDFS already has a Trash mechanism to protect against mis-deletions on the file level. However, a block-level solution is desirable for a few reasons. One fundamental problem is to sort out the priorities of blocks in the context of data retention. Existing caching eviction policies might not work well.

If you liked this post, you can share it with your followers or follow me on Twitter!

Paper review of Tachyon (SoCC '14)


Tachyon is a distributed caching layer on top of disk-based file systems such as HDFS or GlusterFS. The core idea is to use lineage to provide lightweight, optimistic fault tolerance.

More importantly, due to the inherent bandwidth limitations of replication, a lineage-based recovery model might be the only way to make cluster storage systems match the speed of in-memory computations in the future.

Extending from Spark, Tachyon also assumes an elegant Lambda-calculus-flavored model, where new datasets are generated by closed-form operators (e.g., map and group-by) from existing ones. This empowers Spark and Tachyon to persist the operator binaries – whose sizes are negligible – instead of replicating the datasets themselves. As quoted above, this might indeed be the only way to write data with local memory speed – from an information theoretic perspective, how else could cheap redundancy be achieved?

Under this elegant assumption, Spark allows you to program with PB-sized variables; and Tachyon provides a name space for those variables to be shared among applications. The programming model becomes much more flexible than MapReduce.


Tachyon has a simple API:

Signature Return
createDependency(inputFiles, outputFiles, binaryPrograms, config, dependencyType) lineageID
getDependency(lineageId) Dependency Info

Of course the binaryProgram needs to be deterministic and inputFiles need to be immutable.

Files are first stored in the memory-based lineage layer and then checkpointed in the disk-based persistence layer. If the lineage layer gets full, files are also evicted to the persistence layer following LRU policy by default.

Checkpoints are created in the background. Edge files and hot files are given priority to be checkpointed.


In principle, Tachyon should work well as long as the basic assumption holds: datasets are connected by closed-form jobs. In general, this should hold for most analytical workloads. But how about transactional workloads such as the ones supported by HBase? The job binary – describing a newly inserted value – will be as large as the data itself.

As mentioned in the paper, data serialization is also a fundamental bottleneck. Here’s a back-of-the-envelop calculation based on Figures 7 and 8 from the paper: if the theoretical upper bound of replication is 2.5x faster than MemHDFS (0.5GBps / 0.14GBps), then it should be able to catch up with Tachyon in the real workload tests. Time for some sort of short-circuit writing?

With Tachyon, the main overhead was serialization and de-serialization since we used the Hadoop TextInputFormat. With a more efficient serialization format, the performance gap is larger.

The lineage-based recovery model also contradicts with a file system user’s expectations. For example, it would surprise many applications to get a 1-minute latency (even occasionally) to read a small piece from a file.

My assessment

The initial idea of Spark was similar to MapReduce Online. However, the lineage-based optimistic fault tolerance model greatly generalized and externalized intra-job intermediate results, bringing them to the unprecedented stage of programmable units. This is a breakthrough in distributed computing.

Tachyon itself makes 2 new contributions, as outlined in the abstract:

The key challenge in making a long-running lineage-based storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under commonly used resource schedulers.

Both are solid, incremental contributions around better prioritization of fault tolerance tasks. Certainly not as fundamental as Spark, but critical in adopting the main idea to the file system context.

If you liked this post, you can share it with your followers or follow me on Twitter!

Leaving IBM, joining Cloudera

Starting this August I’ll join Cloudera as a Software Engineer in the HDFS team. Waving good-bye to my 4.5 year tenure as a Researcher, first at Oak Ridge National Lab and then at IBM Watson Lab. At this point I’d like to log my thoughts behind this choice. Around every job change there are two factors: compensation and happiness. The second factor, which includes sense of accomplishment, recognition from other people, and pure interest, will be the covered in this post.

Politics etc.

I didn’t actually get particularly bad big company politics at IBM. Most of my Research or non-Research colleagues are very reasonable. However, in a company with 400,000 employees, it is very easy to step on other people’s toes. You’ll find yourself spending more time than you want on planning rather than doing the real work.

Another consequence of the large size is that executives know relatively little about the technical details, including the state-of-the-art. Meanwhile your performance score depends on how they like your demo. Therefore, the incentive mechanism is sometimes misaligned with what you know as the right things to do.

Breadth vs. Depth

During my 3.5 years at IBM Research I have worked on VM provisioning, cloud BFT, VM caching, software license management, and a little bit of HDFS as a side project. From my observation that’s more or less the case for many researchers here. IBM’s business model decides that we do a lot of “integration of XXX with YYY”, and “XXX-as-a-service” here. I totally see the value-add and the technical challenges of those projects. However, I just prefer to go “crazy deep” on the XXX problem itself before moving to the next subject.

Things I will miss about IBM Research:

  1. The “never-jam” Taconic Parkway. Seriously, I will miss Westchester – easy commute to NYC, (relatively) cheap housing, Bear Mountain, …
  2. The “real scientists” running around the lab in white coats. Seriously, I will miss the resourcefulness – you can grab an expert in any area (you name it, material science, partial differential equations, …) if your project needs.
  3. The close connections with academia.
  4. The T. J. Watson Center. The remarkable building, the beautiful library, and neat offices.

Things I won’t miss about IBM Research:

  1. “Welcome to the teleconference service, please enter your access code” … “There are 17 participants on the call, you are joining as a participant”. Seriously, there are just so many hour-long meetings where you only speak for 5 seconds.
  2. Having to pay for coffee, having to use 4 year old laptop, having to try so hard to get a 23’’ monitor. The HR department just doesn’t consider it a priority to enhance employee morale and productivity.

Things I look forward to about Cloudera:

  1. The “hacker mentality” and pride as a hardcore engineer.
  2. A zoomed-in view of how people are using large distributed storage systems in production.

If you liked this post, you can share it with your followers or follow me on Twitter!