Facebook generates high volumes of data at a rapid pace.
Dhruba Borthakur joined Facebook in 2008 to work on data infrastructure. His early projects at Facebook were around Hadoop, the distributed file system and MapReduce computation platform that laid the foundation for the “big data” movement.
At the time, Facebook was generating as much data as any other startup, and the company needed to stay at the leading edge of scalability techniques for its Hadoop Distributed File System (HDFS) cluster.
Traditionally, Hadoop managed its file system by synchronizing the coordination of the different data nodes with the help of a single master node. At Facebook, the scale of the data was such that the HDFS cluster had thousands of data nodes, which was too much volume for a single master node to handle. Dhruba helped implement redundancy at the master node to create a more resilient system.
The early days of the big data movement was focused on batch processing. A company like Facebook would gather large amounts of data into databases and HDFS, and run offline analytics workloads to gather reports on an hourly, daily, or weekly basis.
Over time, data infrastructure has moved closer to a “real time” processing model. Data infrastructure does not only support batch offline reporting–it also supports machine learning jobs that need to be run on a more frequent basis. These jobs have lower latency requirements, and have driven the adoption of in-memory stream processing systems like Spark and Flink.
Dhruba joins the show to discuss his time at Facebook building data infrastructure. He takes us through the major projects he worked on, including the early Hadoop infrastructure, the refactoring of online user workloads to be more “pull” based than “push” based, and the creation of RocksDB, a storage engine he helped create at Facebook.
Today, Dhruba is the CTO and co-founder of Rockset, a company that builds data infrastructure and database APIs on top of RocksDB. Rockset is building infrastructure for modern technology companies–many of which are facing problems that bear significant resemblance to the ones Facebook encountered as it scaled.
The post Facebook Data Infrastructure with Dhruba Borthakur appeared first on Software Engineering Daily.
Podchaser is the ultimate destination for podcast data, search, and discovery. Learn More