Notes from AMPLab Winter Retreat

19 Jan 2016

UC Berkeley AMPLab generously hosts annual retreats to present early-stage research work. It also serves as a venue for collaborators from the acdemia and industries to exchange ideas. This year’s winter retreat was held at the Granlibakken resort in Tahoe City. Very good ski resort for beginners and families with young kids.

This is my summary from attending the event:

New Lab

The most interesting topic was the annoucement of a new lab to succeed the current AMPLab. From my understanding, this succession includes a new name (to be determined) and a new theme. Ion Stoica presented a number of visions under three highlighted points:

Decision latency
Data freshness
Strong security

Ion used the example of stock trading and envisions a broader range of applications to match the msec-level data freshness and microsec-level decision latency in the future. Nagivation and auto-driving was used as a second example, where the improved data freshness and decision latency could make a huge impact.

Their goal is to create a general platform – can be seen as next Spark – to supplement these three properties to upper level applications. Ion pointed out a few challenging trade-offs between conflicting goals, and drew a spectrum to position current systems in these trade-offs. Hadoop-based solutions was categorized as good-latency, bad-freshness, bad-security, OK-functionality. BDAS-based solution got similar scores except for OK-freshness, and somewhat better functionalities. So security has been identified as a common weak point. Despite storage-level encryption, many places in the compute pipe are unprotected. There are some security solutions to support computation on encrypted data (e.g. CryptDB, Mylar) but they only support simple algorithms.

So their goal is to win along every dimension on the spetrum (“Spark-like functionality with 100x lower latency on 10x fresher data, with strong security”).

Succint

Several talks and posters were dedicated to Succint, an in-memory compression format. The project already has a published paper so I won’t go into too much details. The presentation was of high quality and worth watching. First time for me to learn that Succint optimizes for point queries while sacrificing scan-based queries. This means users have to determine their workloads before-hand and decide whether to use Succint. Succint-Spark package is already GA. The current plan is to enable more features, including regex search, graph search and so forth. They are also building an enryption-compression solution. The technique is mini-batching (compress and encrypt a number of rows) with some empirical parameter tuning.

Machine Learning

I don’t have deep ML background and therefore received most ML talks conceptually. The Helio project aims to create a platform where a data owner can specify which portion (e.g. columns) of their data can be queries, and what degree of aggregation can be output. The CoCoA and Stumptown projects are optimizations which tunes the ratio of local computation and global communication. KeystoneML is a general platform for data scientists to easily write ML applications.

Dinner Discussion

It was a good idea that dinner tables were divided as “interest groups”. Myself and David Alves sat in a “System” table. Two topics that I found most interesting:

What are good use cases of fresh data? In both stock trading and auto-driving, fresh data is very important. What properties in an application cause so? Can we discover more applications where fresh data makes big impact?
When should we push computation to the “edge”? Edge computing is faster and more secure. But how to balance the model accuracy?

If you liked this post, you can share it with your followers or follow me on Twitter!