Interview: Matei Zaharia on Spark and machine learning

Sometimes things happen in the lab that you wouldn’t expect. Back in 2009, when was a grad student at UC Berkeley’s AMPLab, he started a project called to serve as a pilot workload for Mesos, an open-source project to manage clusters. Since then, Mesos has faded, while has become the widely adopted successor to the Hadoop distributed processing framework—faster, smarter, and, unlike its predecessor, a robust platform for streaming analytics and .

Today Zaharia is CTO of Databricks, a cloud-based provider of Spark and machine learning as a service, though he still keeps one foot in academia as an assistant professor of computer science at Stanford. One testament to his ingenuity: According to Databricks CEO Ali Ghodsi, Zaharia once informed him that he had an interest in biology and was taking a class. Not long after a project emerged that he created in collaboration with AMPLap colleagues: the Scalable Nucleotide Alignment Program (SNAP), a sequence aligner that is three to 20 times faster than competing sequencing solutions.

In this with IDG’s Eric Knorr, Zaharia expounds on the reasons Spark has become the big data framework of choice and, among other topics, why he thinks his company’s melding of Spark and machine learning delivers unique value. Zaharia, who hold a PhD in computer science from UC Berkeley and an ACM Doctoral Dissertation Award for his research on large-scale computer systems, serves as vice president of the Apache Spark project and has worked on key Spark components, including MLlib, Spark Streaming, and Spark SQL. Few have contributed so much in so short a time to the advancement of big data analytics and machine learning.

