The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.
In this episode of the O'Reilly Data Show, I sat down with Jai Ranganathan
, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time
, and hardware trends:
Large-scale machine learning
This sounds a bit like this should already exist in really good form right now, but one of the things that I'm really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. ... It's not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. ... And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that's the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That's just a matter of time and investment because these are non-trivial problems, but they are things that people are working on.
Architecting data applications in the cloud
There are some fundamental design principles behind the original HDFS implementation, which don't actually work in the cloud. For example, this notion that data locality is fundamental to this system design; it starts changing in the cloud when you're looking at these large cloud providers — they are doing all these software-defined networking tricks and they can do bisectional bandwidth, like 40 gigs per second, across their data center ... suddenly, you're talking about moving hundreds of terabytes of data back and forth from a storage to a compute layer without any huge performance penalties. Suddenly, their performance is disadvantageous to this, but it's not as bad as you think. Some of the core design principles in Hadoop have to change when you think about this kind of new data center design. ... The cloud part is really interesting, but really what to me is interesting is there's a fundamental shift in the way data centers are being designed, which we have to make sure that Hadoop stays designed to capitalize on.
A lot of the work we do on the cloud is to optimize working with these object stores effectively. Obviously, you still need some local storage for things like spill, but that's not really the same as a distributed file system. Then, it's really a question of getting all the frameworks to run really effectively against an object store.
Paying attention to hardware trends
When I joined Cloudera, a customer who was going crazy and buying the most expensive hardware was buying 64 gigabytes of RAM. On that 64 gigabytes of RAM, they also had 12 disk spindles with two terabytes each and 24 terabytes of disk. At this point, today, many of my customers buy 246 gigabytes of RAM or even potentially 384 gigabytes to 512 gigabytes of RAM. The amount of disk is still exactly the same. Because disks don't spin faster and you still want a certain level of throughput, you're still looking at 24 terabytes of disk in your machine. Already in just two years, we have seen it go from 64 to 512, potentially. I don't think this trend is going to stop, and we are suddenly going to be looking at, within three years, one-terabyte RAM machines.
What we're finding is that in a lot of the things we do at Cloudera, like Kudu or Impala, fundamentally, we really care about wringing performance out of the CPU. A lot of this will be like, 'can I do vectorize operations?' and 'can I make sure to take advantage of my L2 cache mode effectively?' because that allows my CPU to spend more efficiently. It really changes the bottleneck from the I/O subsystem to the CPU subsystem, and everything you can do to eke out performance there really matters. ... Project Tungsten is basically in the Spark community to do more CPU-efficient things, whether that's vectorizing stuff, whether that's actually effectively moving away from managed memory to managing by buffers, so you can actually have much more efficient handling of memory, so you can get better CPU efficiency as well. Subscribe to the O'Reilly Data Show Podcast: Stitcher