O'Reilly Data Show - O'Reilly Media Podcast Podcast Image

O'Reilly Data Show - O'Reilly Media Podcast

A Society, Culture and Business podcast
Good podcast? Give it some love!

Best Episodes of O'Reilly Data Show

Mark All
Search Episodes...
Building intelligent applications with deep learning and TensorFlow
The O’Reilly Data Show Podcast: Rajat Monga on the current state of TensorFlow and training large-scale deep neural networks.In this episode of the O’Reilly Data Show, I spoke with Rajat Monga, who serves as a director of engineering at Google and manages the TensorFlow engineering team. We talked about how he ended up working on deep learning, the current state of TensorFlow, and the applications of deep learning to products at Google and other companies.Here are some highlights from our conversation: Deep learning at Google There's not going to be too many areas left that run without machine learning that you can program. The data is too much, there's just too much for humans to handle. …  Over the last few years, and this is something we've seen at Google, we've seen hundreds of products move to deep learning, and gain from that. In some cases, these are products that were actually applying machine learning that had been using traditional methods for a long time and had experts. For example, search, we had hundreds of signals in there, and then we applied deep learning. That was the last two years or so. For somebody who is not familiar with deep learning, my suggestion would be to start from an example that is closest to your problem, and then try to adapt it to your problem. Start simple, don't go to very complex things, there are many things you can do, even with simple models. TensorFlow makes deep learning more accessible At Google, I would say there are the machine learning researchers who are pushing machine learning research, then there are data scientists who are focusing on applying machine learning to their problems ... We have a mix of people—some are people applying TensorFlow to their actual problems. They don't always have a machine learning background. Some of them do, but a large number of them don't. They're usually developers who are good at writing software. They know maybe a little bit of math so they can pick it up, in some cases not that much at all, but who can take these libraries if there are examples. They start from those examples, maybe ask a few questions on our internal boards, and then go from there. In some cases they may have a new problem, they want some inputs on how to formulate that problem using deep learning, and we might guide them or point them to an example of how you might approach their problem. Largely, they've been able to take TensorFlow and do things on their own. Internally, we are definitely seeing these tools and techniques being used by people who have never done machine learning before. Synchronous and asynchronous methods for training deep neural networks When we started out back in 2011, everybody was using stochastic gradient descent. It's extremely efficient in what it does, but when you want to scale beyond 10 or 20 machines, it makes it hard to scale, so what do we do? At that time there were a couple of papers. One was on the HOGWILD! approach that people had done on a single machine … That was very interesting. We thought, can we make this work across the network, across many, many machines? We did some experiments and started tuning it, and it worked well. We were actually able to scale it to a large number of workers, hundreds of workers in some cases across thousands of machines, and that worked pretty well. Over time, we'd always had another question: is the asynchronous nature actually helping or making things worse? Finally last year, we started to experiment and try to understand what's happening, and as part of that, we realized if we could do synchronous well, it actually is better. … With the asynchronous stuff, we had these workers and they would work completely independently of each other. They would just update things on the parameter server when they had gradients, they would send it back to the parameter server, it would update, and then fetch the next set of parameters. … From a systems perspective, it's nice, because it scales very, very well. It's okay if a few workers died, that's fine, all the others will continue to make progress. Now, with the synchronous approach, what we want to do is to send parameters out to all the workers, have them compute gradients, send those back, combine those together, and then apply them. Now, across many machines, you can do this, but the issue is if some of them start to slow down or fail, what happens then? That's always a tricky thing with the synchronous approach, and that's hard to scale. That's probably the biggest reason people hadn't pushed toward this earlier. Related resources: Hello, TensorFlow: Building and training your first TensorFlow graph from the ground up TensorFlow for poets: How to build your own image classifier with no coding In my conversation with Rajat Monga, I alluded to these recent papers on Asynchronous and Synchronous methods for training deep neural networks: (1) Revisiting Distributed Synchronous SGD, (2) Asynchrony begets Momentum, with an Application to Deep Learning, (3) Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs
Using AI to build a comprehensive database of knowledge
The O’Reilly Data Show Podcast: Mike Tung on large-scale structured data extraction, intelligent systems, and the importance of knowledge databases.Extracting structured information from semi-structured or unstructured data sources (“dark data”) is an important problem. One can take it a step further by attempting to automatically build a knowledge graph from the same data sources. Knowledge databases and graphs are built using (semi-supervised) machine learning, and then subsequently used to power intelligent systems that form the basis of AI applications. The more advanced messaging and chat bots you’ve encountered rely on these knowledge stores to interact with users. In this episode of the Data Show, I spoke with Mike Tung, founder and CEO of Diffbot - a company dedicated to building large-scale knowledge databases. Diffbot is at the heart of many web applications, and it’s starting to power a wide array of intelligent applications. We talked about the challenges of building a web-scale platform for doing highly accurate, semi-supervised, structured data extraction. We also took a tour through the AI landscape, and the early days of self-driving cars. Here are some highlights from our conversation: Building the largest structured database of knowledge If you think about the Web as a virtual world, there are more pixels on the surface area of the Web than there are square millimeters on the surface of the earth. As a surface for computer vision and parsing, it's amazing, and you don't have to actually build a physical robot in order to traverse the Web. It is pretty tricky though. … For example, Google has a knowledge graph team—I'm sure your listeners are aware from a startup that was building something called Freebase, which is  crowdsourced, kind of like a Wikipedia for data. They've continued to build upon that at Google adding more and more human curators. … It's a mix of software, but there's definitely thousands and thousands of people that actually contribute to their knowledge graph. Whereas in contrast, we are a team of 15 of the top AI people in the world. We don't have anyone that's curating the knowledge. All of the knowledge is completely synthesized by our AI system. When our customers use our service, they're directly using the output of the AI. There's no human involved in the loop of our business model. … Our high level goal is to build the largest structured database of knowledge. The most comprehensive map of all of the entities and the facts about those entities. The way we're doing it is by combining multiple data sources. One of them is the Web, so we have this crawler that's crawling the entire surface area of the Web. Knowledge component of an AI system If you look at other groups doing AI research, a lot of them are focused on very much the same as the academic style of research, which is coming out of new algorithms and publishing to sort of the same conferences. If you look at some of these industrial AI labs—they're doing the same kind of work that they would be doing in academia—whereas what we're doing, in terms of building this large data set, would not have been created otherwise without starting this effort. … I think you need really good algorithms, and you also need really good data. … One of the key things we believe is that it might be possible to build a human-level reasoning system. If you just had enough structured information to do it on. … Basically, the semantic web vision never really got fully realized because of the chicken-and-egg problem. You need enough people to annotate data, and annotate it for the purpose of the semantic web—to build a comprehensiveness of knowledge—and not for the actual purpose, which is perhaps showing web pages to end users. Then, with this comprehensiveness of knowledge, people can build a lot of apps on top of it. Then the idea would be this virtuous cycle where you have a bunch of killer apps for this data, and then that would prompt more people to tag more things. That virtuous cycle never really got going in my view, and there have been a lot of efforts to do that over the years with RDS/RSS and things like that. … What we're trying to do is basically take the annotation aspect out of the hands of humans. The idea here is that these AI algorithms are good enough that we can actually have AI build the semantic web. Leveraging open source projects: WebKit and Gigablast … Roughly, what happens when our robot first encounters a page is we render the page in our own customized rendering engine, which is a fork of WebKit that's basically had its face ripped off. It doesn't have all the human niceties of a web browser, and it runs much faster than a browser because it doesn't need those human-facing components. … The other difference is we've instrumented the whole rendering process. We have access to all of the pixels on the page for each XY position. … [We identify many] features that feed into our semi-supervised learning system. Then millions of lines of code later, out comes knowledge. … Our VP of search, Matt Wells, is the founder of the Gigablast search engine. Years ago, Gigablast competed against Google and Inktomi and AltaVista and others. Gigablast actually had a larger real-time search index than Google at that time. Matt is a world expert in search and has been developing his C++ crawler Gigablast for, I would say, almost a decade. … Gigablast scales much, much better than Lucene. I know because I’m a former user of Lucene myself. It's a very elegant system. It's a fully symmetric, masterless system. It has its own UDP-based communications protocol. It includes a full web crawler, indexer. It has real-time search capability. Editor’s note: Mike Tung is on the advisory committee for the upcoming O’Reilly Artificial Intelligence conference. Related resources: Hadoop co-founder Mike Cafarella on the Data Show: From search to distributed computing to large-scale information extraction Up and running with deep learning: Tools, techniques, and workflows to train deep neural networks Building practical AI systems Using computer vision to understand big visual data
Data science for humans and data science for machines
The O’Reilly Data Show Podcast: Michael Li on the state of data engineering and data science training programs.In this episode of the O’Reilly Data Show, I spoke with Michael Li, cofounder and CEO of the Data Incubator. We discussed the current state of data science and data engineering training programs, Apache Spark, quantitative finance, and the misunderstanding around the term “data science.”Here are some highlights from our conversation: Wall Street quants and data science When I think about finance, I often think of it like data science 1.0 or maybe even data science 2.0, and what we call data science now is really more like data science 2.0 or 3.0. It's the next wave of data science, so it means that when people were practicing data science on Wall Street, they had much more primitive tools in the ‘80s and the early ‘90s than what we're using now, so they were kind of scraping by. But because they've been practicing data science for so much longer, there's just so much more of a built-up understanding of how this works. ...A lot of what I was doing at Foursquare was taking basic things that I learned on Wall Street, applying them toward monetization, and it did pretty well. I think there's a lot that data science can learn from finance and vice versa. Data science for humans and data science for machines There is a distinction between data science for humans versus data science for machines. I think that a lot of people just think, ‘Oh, they're data scientists. They just look at data,’ but it really depends. The kind of person you're looking to hire really depends on whether the output of his or her analysis is meant to be given to human decision makers or whether that output is meant to be handed to a machine that will then process everything. I did a little bit of both at Foursquare, but the two approaches required very different skill sets. For one of them, I have a metric, and I need to improve that metric. Let me just turn this dial and make it as complex as possible. For the other one, you have to realize that a human has to understand this, so you have to make this model simple enough that humans can look at it and really wrap their minds around it. I think this distinction is very important. Apache Spark training We talk to a lot of hiring companies. We always want to understand what's interesting to them. Just to give you a few examples, when we started the Data Incubator, I think Spark still wasn't a very big thing, but now we're seeing this kind of huge demand for Spark, and that's one of the things that our corporate training partners are really asking for. It's one of our most popular modules. …Last year is about when we started building out the Spark courses, but we've really seen that take off in the past year. ... It's been great to see Spark evolve to the point where we're collaborating with Databricks to do trainings and see this huge demand in industry. Related resources: 5 secrets for writing the perfect data scientist resume 3 ideas to add to your data science toolkit Accelerating Spark workloads for GPUs Structured streaming comes to Apache Spark 2.0
The key to building deep learning solutions for large enterprises
The O’Reilly Data Show Podcast: Adam Gibson on the importance of ROI, integration, and the JVM.As data scientists add deep learning to their arsenals, they need tools that integrate with existing platforms and frameworks. This is particularly important for those who work in large enterprises. In this episode of the Data Show, I spoke with Adam Gibson, co-founder and CTO of Skymind, and co-creator of Deeplearning4J (DL4J). Gibson has spent the last few years developing the DL4J library and community, while simultaneously building deep learning solutions and products for large enterprises.Here are some highlights: DL4J in 2016 I would say our biggest thing was our C++ rewrite. Over the course of 2014 and 2015, we had tried to use existing matrix packages and Java, but ended up writing our own (called ND4J). At first, we had a Java implementation of the internals. We were doing a little bit of CUDA C back then. Then eventually, when we ported everything to one C++ code base, it sped up the code base by a factor of 10. From there, we've only added on things like a new user interface, among other things. I would like to also mention our ”Keras for production” tool. This allows Python users to talk to their data engineering team and say: 'I know you code in Java. I did this Python thing, but you can use this import tool and take my model to production. You don't have to worry about trying to get Python code in there.' Our Apache Spark integration has been fairly stable for a while now. ... We have been using it in production, and it's been fine. Adoption of machine learning and deep learning in large companies Everything in the enterprise space is ROI driven. They don't know that the newest deep learning paper just came out from Google. They're not going to clone some random GitHub repository and try it out, and just try to put it in production. They don't do that. They want to understand ROI. They work a job, they have a goal, and they have a budget. They need to figure out what to do with that budget as it relates to their job at their company. Their company is usually a for-profit corporation trying to make money, or trying to increase margins for shareholders. ... Frankly, they don't care if it's linear regression, or random forest, either. ... Machine learning has barely penetrated the Fortune 2000. Despite all these tools existing, most of them don't have it in production because they don't see a point in adopting it. I think Intel said this right: as far as enterprise adoption is concerned, it's still fairly early for machine learning. I think what we're starting to see in deep learning is enough of a bump in accuracy for some problems, especially anything related to behavior, that it's worth finally considering. That's kind of the update we're seeing. Deep learning for time-series analysis and anomaly detection I wrote the first Word2vec implementation for Java, and in early 2013, a lot of our growth was from Word2vec and text analysis. I'd say about mid-2014 we started doing a lot more anomaly detection. And we've been doing mostly time-series analysis now at this point. It turns out, a lot of people have times-series data. They still have a hard time doing feature engineering on that kind of thing. A lot of organizations are interested in applying deep learning to see if they can maybe just come up with a baseline feature vector, and then they don't have to worry about trying to come up with more advanced features. They can just use deep learning to learn patterns. That's been the bulk of our activity since then. Related resources: There are many interesting talks on the applications of deep learning at Strata + Hadoop World San Jose in March 2017, including a tutorial entitled “Scalable deep learning for the enterprise with DL4J” Adam Gibson is the co-author of the upcoming book Deep Learning: A practitioner's approach Why businesses should pay attention to deep learning Use deep learning on data you already have How big compute is powering the deep learning rocket ship
Creating large training data sets quickly
The O’Reilly Data Show Podcast: Alex Ratner on why weak supervision is the key to unlocking dark data.In this episode of the Data Show, I spoke with Alex Ratner, a graduate student at Stanford and a member of Christopher Ré’s Hazy research group. Training data has always been important in building machine learning algorithms, and the rise of data-hungry deep learning models has heightened the need for labeled data sets. In fact, the challenge of creating training data is ongoing for many companies; specific applications change over time, and what were gold standard data sets may no longer apply to changing situations.Ré and his collaborators proposed a framework for quickly building large training data sets. In essence, they observed that high-quality models can be constructed from noisy training data. Some of these ideas were discussed in a previous episode featuring Mike Cafarella (jump to minute 24:16 for a description of an earlier project called DeepDive). By developing a framework for mining low-quality sources in order to build high-quality machine learning models, Ré and his collaborators help researchers extract information previously hidden in unstructured data sources (so-called “dark data” buried in text, images, charts, and so on). Here are some highlights from my conversation with Ratner: Weak supervision and transfer learning Weak supervision is a term that people have used before, especially around Stanford, to talk about methods where we have lower-quality training data, or noisier training data. ... At a high level, machine learning models are meant to be robust to some noise in the distribution they're trained on. ... One of the really important trends we've seen is that more people than ever are using deep learning models. Deep learning models can automate the feature engineering process, but they are more complex and they need more training data to fit to their parameters. If you look at the very remarkable, empirical successes that deep learning has had over the last few years, they have been mostly (or almost entirely) predicated on these large label training sets that took years to create. ... Our motivation with weak supervision is really: how do we weaken this bottleneck? ... For weak supervision, our ultimate goal is to make it easier for the human to provide supervision to the model. That's where the human comes into the loop. This might be an iterative process. ... In the standard transfer learning paradigm, you'd take one nicely collecting training set, and you'd train your model on that in the standard way. Then you just try to apply your model to a new data distribution. Data programming Data programming is a general, flexible framework for using weak supervision to train some end model that you want to train without necessarily having any hand-labeled training data. The basic way it works is, we actually have two modeling stages in this pipeline. The first is that we get input from the domain expert or user in the form of what we call labeling functions. Think of them as Python functions. ... The user writes a bunch of labeling functions, which are just black box functions that take in a data point, take in one of these objects, and output a label, or they could abstain. These labeling functions can encode all the types of weak supervision, like distant supervision, or crowd labels, or various heuristics. There's a lot of flexibility because we don't make any assumptions about what is inside them. In our first modeling stage, we use a generative model to learn which of the labeling functions are more or less accurate by observing where they overlap, where they agree and disagree. Intuitively, if we have 20 labeling functions from a user and we see that one labeling function is always agreeing with its co-labelers on various data points, we think we should trust it. When a labeling function is always disagreeing in a minority, then we downweight this. Basically, we learn this model that tells us how to weight the difference labeling functions the user has provided. Then, the output of this model is a set of probabilistic training labels. Then we feed these into the end model we're trying to train. To give you some intuition on the probabilistic labels: all we're basically saying is that we want the end model to learn more from data points that got a lot of high confidence votes, rather than the ones that were sort of in contention, from the labeling functions that the user provided. ... One goal is to generate data, but often our ultimate goal is to train some end discriminative model, say to do image classification. ... Snorkel is a system for using this data programming technique to quickly generate training data. A lot of the tooling and the use cases that are publicly part of Snorkel right now are around text extraction. Data Programming in Snorkel. Slide from Alex Ratner, used with permission. Related resources: From search to distributed computing to large-scale information extraction: a conversation with Mike Cafarella (jump to minute 24:16 for a description of an earlier project called DeepDive) Data preparation in the age of deep learning Adam Marcus: Building human-assisted AI applications
A scalable time-series database that supports SQL
The O’Reilly Data Show Podcast: Michael Freedman on TimescaleDB and scaling SQL for time-series.In this episode of the Data Show, I spoke with Michael Freedman, CTO of Timescale and professor of computer science at Princeton University. When I first heard that Freedman and his collaborators were building a time-series database, my immediate reaction was: “Don’t we have enough options already?” The early incarnation of Timescale was a startup focused on IoT, and it was while building tools for the IoT problem space that Freedman and the rest of the Timescale team came to realize that the database they needed wasn’t available (at least out in open source). Specifically, they wanted a database that could easily support complex queries and the sort of real-time applications many have come to associate with streaming platforms. Based on early reactions to TimescaleDB, many users concur.Here are some highlights from our conversation: The need for a time-series database We initially were developing a platform to collect and store and analyze IoT data, and certainly a lot of IoT data is time-series in nature. We found ourselves struggling. The reason a lot of people adopt NoSQL was they thought it offered scale in the ways that more traditional relational databases did not—yet, they often gave up a lot of the rich query language, optimized complex queries, joins, and an ecosystem that you get in these more traditional relational databases. Customers who were using our platform kept wanting all these ways to query the data, and we couldn't do it with the existing NoSQL database we were using. It just didn't support those types of queries. We ended up building one, in fact, based on top of Postgres. Architecting Postgres in a very particular way for time-series workloads, we came to realize that this is not just a problem limited to us. We think there is an important space still in the market where people either use a Vanilla relational database that does have scaling problems, or they go to something like NoSQL because a lot of the time-series data came from one particular use case, things like server metrics. People's needs are much broader than just server metrics, so we actually thought there was an important area that's somewhat missing from what people had before. ... The interesting thing about a time-series database is sometimes that data starts in one part of your organization, and then different parts of your organization quickly find a use for that data. ... In many cases, the people who are asking questions actually know SQL already; some of them may not but are using existing tools that support SQL. So, if you have a database that doesn’t support SQL, then those existing tools often can't directly work with it. You would have to integrate them. You'd have to build special connectors. That was one of the things we wanted when we set out to build Timescale. We wanted to give the appearance that this looks like Postgres. It just looks like a traditional relational database. If you have any of those existing tools and business applications, you could just speak directly to it as if it's a traditional database. It just happens to be much more efficient and much more scalable for time-series data. Column-oriented and row-oriented databases In the beginning, we weren't setting out to build our own time-series database. ... A lot of the time-series databases, particularly on the market now, are column-oriented, because that allows you to do very fast aggregations on a single column. TimescaleDB also allows you to define a schema and different metrics could be in their own column. There is a difference between what are known as column-oriented databases and traditional SQL databases, which are row-oriented. This is related to the ways that they store data on disk—that is, are all of the values in a row stored contiguously on disk? In a column-oriented database, even though you might have every metric, or if a bunch of metrics belong to the same row, they're actually going to be stored almost separately. It's like every column becomes its own table. For example, columns make it really easy and fairly efficient to scan a single column. If all you want to do is take the average of the CPU, that's efficient; but if you want to ask a question, this is called a rich predicate. A predicate is that WHERE clause in SQL. If you want to ask a question like: "Tell me the average temperature of all devices where the CPU is above a certain threshold, or the free memory is below something"—internally with column-oriented databases, each of those WHERE clauses is a different column, almost a different table that the database needs to scan and then do a JOIN on. While the column-oriented databases might be very efficient for just rolling up a single column, if you want to ask anything richer, it becomes a lot more expensive. In some of these databases, they don't have indexes for these WHERE clauses, so any time you ask a question, it actually takes a full table scan. If only 1% of devices have a high CPU and you say, "Tell me all the statistics where the device has a high CPU," in some of these time-series databases that lack this indexing on columns, you end up actually scanning all of the data, not just the 1%. If you have something like TimescaleDB or someone who could build these efficient secondary indices, you could quickly focus in on the important data, so the only thing we need to touch is that 1% of data, not all the data. Related resources: Understanding anomaly detection Architecting and building end-to-end streaming applications Learning Path: Getting started with Kudu Twitter's Real-Time Data Stack Hybrid transactional/analytic systems and the quest for database nirvana
Building a next-generation platform for deep learning
The O’Reilly Data Show Podcast: Naveen Rao on emerging hardware and software infrastructure for AI.In this episode of the Data Show, I speak with Naveen Rao, VP and GM of the Artificial Intelligence Products Group at Intel. In an earlier episode, we learned that scaling current deep learning models requires innovations in both software and hardware. Through his startup Nervana (since acquired by Intel), Rao has been at the forefront of building a next generation platform for deep learning and AI. I wanted to get his thoughts on what the future infrastructure for machine learning would look like. At least for now, we’re seeing a variety of approaches, and many companies are using heterogeneous processors (even specialized ones) and proprietary interconnects for deep learning. Nvidia and Intel Nervana are set to release processors that excel at both training and inference, but as Rao pointed out, at large-scale there are many considerations—including utilization, power consumption, and convenience—that come into play.Here is a partial list of the items we discussed: Deep learning in comparison to other machine learning algorithms Key features and the current status of Intel Nervana’s Lake Crest technology Deep learning frameworks and related software tools including Nervana Graph. Building next-generation hardware and software components for deep learning An overview of the major AI initiatives within Intel (including the establishment of a new AI Research Lab that Rao is leading) Related resources: Deep learning at scale and use cases: Naveen Rao’s keynote at the inaugural O’Reilly Artificial Intelligence Conference. How big compute is powering the deep learning rocket ship TensorFlow for Deep Learning Fundamentals of Deep Learning
Bringing AI into the enterprise
The O’Reilly Data Show Podcast: Kris Hammond on business applications of AI technologies and educating future AI specialists.In this episode of the Data Show, I spoke with Kristian Hammond, chief scientist of Narrative Science and professor of EECS at Northwestern University. He has been at the forefront of helping companies understand the power, limitations, and disruptive potential of AI technologies and tools. In a previous post on machine learning, I listed types of uses cases (a taxonomy) for machine learning that could just as well apply to enterprise applications of AI. But how do you identify good use cases to begin with? A good place to start for most companies is by looking for AI technologies that can help automate routine tasks, particularly low-skill tasks that occupy the time of high-skilled workers. An initial list of candidate tasks can be gathered by applying the following series of simple questions: Is the task data-driven? Do you have the data to support the automation of the task? Do you really need the scale that automation can provide? We discussed other factors companies should consider when thinking through their AI strategies, education and training programs for AI specialists, and the importance of ethics and fairness in AI and data science. Here are some highlights from our conversation: It begins with finding use cases I've been interacting more and more with the companies that are thinking about AI solutions; they often won't have gotten to the place where they can talk about what they want to do. It's an odd thing because there's so much data out there and there's so much hunger to derive something from that data. The starting point is often bringing an organization back down to, "So what do you want and need to do? What kind of decision-making do you want to support? What kinds of predictions would you like to be able to make?" Identifying which tasks can be automated Sometimes, you see a decision being made and, from an organizational point of view, everyone agrees that this decision is really strongly data driven. But it's not strongly data driven. It's data driven based upon the historical information that two or three people are using. It looks like they're looking at data and then making a decision, but, in fact, what they're doing is, they're looking at data and they're remembering one of 2,000 past examples in their heads and coming out with a decision. ... There are sets of tasks in almost any organization that nobody likes to have anything to do with. In the legal profession, there are tasks around things like discovery where you actually need to be able to look through a corpus of documents, but you need to have also some idea of the semantic relationships between words. This is totally learnable using existing technologies. ... It's not as though tasks that can be automated don't exist. They do, and, in fact, they not only exist, but they're easily doable with current technologies. It's a matter of understanding where to draw the line. It's sometimes easy for organizations to look at the problem and sort of hallucinate that there is not a different kind of reasoning going on in the heads of the people who are solving the problem. ... You have to be willing to look at that and say, "Oh, I'm not going to replace the smartest person in the company, but, you know, I will free up the time of some of our smartest people by taking these tasks on and having the machine do them.” Related resources: Here and now - Bringing AI into the enterprise: Kris Hammond’s tutorial at the 2017 AI conference in San Francisco. Vertical AI - Solving full stack industry problems using subject-matter expertise, unique data, and AI to deliver a product's core value proposition: Bradford Cross at the 2017 AI conference in San Francisco. Demystifying the AI hype: Kathryn Hume at the 2017 AI conference in NYC. "6 practical guidelines for implementing conversational AI": Susan Etlinger on how organizations can create more fluid interactions between humans and machines. "How companies can navigate the age of machine learning": to become a “machine learning company,” you need tools and processes to overcome challenges in data, engineering, and models.
Enabling end-to-end machine learning pipelines in real-world applications
The O’Reilly Data Show Podcast: Nick Pentreath on overcoming challenges in productionizing machine learning models.In this episode of the Data Show, I spoke with Nick Pentreath, principal engineer at IBM. Pentreath was an early and avid user of Apache Spark, and he subsequently became a Spark committer and PMC member. Most recently his focus has been on machine learning, particularly deep learning, and he is part of a group within IBM focused on building open source tools that enable end-to-end machine learning pipelines.We had a great conversation spanning many topics, including: AI Fairness 360 (AIF360), a set of fairness metrics for data sets and machine learning models. Adversarial Robustness Toolbox (ART), a Python library for adversarial attacks and defenses. Model Asset eXchange (MAX), a curated and standardized collection of free and open source deep learning models. Tools for model development, governance, and operations, including MLflow, Seldon Core, and Fabric for deep learning Reinforcement learning in the enterprise, and the emergence of relevant open source tools like Ray. Related resources: "Modern Deep Learning: Tools and Techniques"—a new tutorial at the Artificial Intelligence conference in San Jose Harish Doddi on “Simplifying machine learning lifecycle management” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” “Managing risk in machine learning”: considerations for a world where ML models are becoming mission critical “The evolution and expanding utility of Ray” "Local Interpretable Model-Agnostic Explanations (LIME): An Introduction” Forough Poursabzi Sangdeh on why “It’s time for data scientists to collaborate with researchers in other disciplines”
Why companies are in need of data lineage solutions
The O’Reilly Data Show Podcast: Neelesh Salian on data lineage, data governance, and evolving data platforms.In this episode of the Data Show, I spoke with Neelesh Salian, software engineer at Stitch Fix, a company that combines machine learning and human expertise to personalize shopping. As companies integrate machine learning into their products and systems, there are important foundational technologies that come into play. This shouldn’t come as a shock, as current machine learning and AI technologies require large amounts of data—specifically, labeled data for training models. There are also many other considerations—including security, privacy, reliability/safety—that are encouraging companies to invest in a suite of data technologies. In conversations with data engineers, data scientists, and AI researchers, the need for solutions that can help track data lineage and provenance keeps popping up. There are several San Francisco Bay Area companies that have embarked on building data lineage systems—including Salian and his colleagues at Stitch Fix. I wanted to find out how they arrived at the decision to build such a system and what capabilities they are building into it.Here are some highlights from our conversation: Data lineage Data lineage is not something new. It's something that is borne out of the necessity of understanding how data is being written and interacted with in the data warehouse. I like to tell this story when I'm describing data lineage: think of it as a journey for data. The data takes a journey entering into your warehouse. This can be transactional data, dashboards, or recommendations. What is lost in that collection of data is the information about how it came about. If you knew what journey and exactly what constituted that data to come into being into your data warehouse or any other storage appliance you use, that would be really useful. ... Think about data lineage as helping issues about quality of data, understanding if something is corrupted. On the security side, think of GDPR ... which was one of the hot topics I heard about at the Strata Data Conference in London in 2018. Why companies are suddenly building data lineage solutions A data lineage system becomes necessary as time progresses. It becomes easier for maintainability. You need it for audit trails, for security and compliance. But you also need to think of the benefit of managing the data sets you're working with. If you're working with 10 databases, you need to know what's going on in them. If I have to give you a vision of a data lineage system, think of it as a final graph or view of some data set, and it shows you a graph of what it's linked to. Then it gives you some metadata information so you can drill down. Let's say you have corrupted data, let's say you want to debug something. All these cases tie into the actual use cases for which we want to build it. Related resources: “Deep automation in machine learning” Vitaly Gordon on “Building tools for enterprise data science” “Managing risk in machine learning” Haoyuan Li explains why “In the age of AI, fundamental value resides in data” “What machine learning means for software development” Joe Hellerstein on how "Metadata services can lead to performance and organizational improvements"
How to train and deploy deep learning at scale
The O’Reilly Data Show Podcast: Ameet Talwalkar on large-scale machine learning.In this episode of the Data Show, I spoke with Ameet Talwalkar, assistant professor of machine learning at CMU and co-founder of Determined AI. He was an early and key contributor to Spark MLlib and a member of AMPLab. Most recently, he helped conceive and organize the first edition of SysML, a new academic conference at the intersection of systems and machine learning (ML). We discussed using and deploying deep learning at scale. This is an empirical era for machine learning, and, as I noted in an earlier article, as successful as deep learning has been, our level of understanding of why it works so well is still lacking. In practice, machine learning engineers need to explore and experiment using different architectures and hyperparameters before they settle on a model that works for their specific use case. Training a single model usually involves big (labeled) data and big models; as such, exploring the space of possible model architectures and parameters can take days, weeks, or even months. Talwalkar has spent the last few years grappling with this problem as an academic researcher and as an entrepreneur. In this episode, he describes some of his related work on hyperparameter tuning, systems, and more. Here are some highlights from our conversation: Deep learning I would say that you hear a lot about the modeling of problems associated with deep learning. How do I frame my problem as a machine learning problem? How do I pick my architecture? How do I debug things when things go wrong? ... What we've seen in practice is that, maybe somewhat surprisingly, the biggest challenges that ML engineers face actually are due to the lack of tools and software for deep learning. These problems are sort of like hybrid systems/ML problems. Very similar to the sorts of research that came out of the AMPLab. ... Things like TensorFlow and Keras, and a lot of those other platforms that you mentioned, are great and they're a great step forward. They're really good at abstracting low-level details of a particular learning architecture. In five lines, you can describe how your architecture looks and then you can also specify what algorithms you want to use for training. There are a lot of other systems challenges associated with actually going end to end, from data to a deployed model. The existing software solutions don't really tackle a big set of these challenges. For example, regardless of the software you're using, it takes days to weeks to train a deep learning model. There's real open challenges of how to best use parallel and distributed computing both to train a particular model and in the context of tuning hyperparameters of different models. We also found out the vast majority of organizations that we’ve spoken to in the last year or so who are using deep learning for what I'd call mission-critical problems, are actually doing it with on-premise hardware. Managing this hardware is a huge challenge and something that folks like me, if I'm working at a company with machine learning engineers, have to figure out for themselves. It's kind of a mismatch between their interests and their skills, but it's something they have to take care of. Understanding distributed training To give a little bit more background, the idea behind this work started about four years ago. There was no deep learning in Spark MLlib at the time. We were trying to figure out how to perform distributed training of deep learning in Spark. Before actually getting our hands really dirty and trying to actually implement anything we wanted to just do some back-of-the-envelope calculations to see what speed-ups you could hope to get. ... The two main ingredients here are just computation and communication. ... We wanted to understand this landscape of distributed training, and, using Paleo, we've been able to get a good sense of this landscape without actually running experiments. The intuition is simple. The idea is that if we're very careful in our bookkeeping, we can write down the full set of computational operations that are required for a particular neural network architecture when it's performing training. [Full disclosure: I’m an advisor to Determined AI.] Related resources: “Introducing RLlib—A composable and scalable reinforcement learning library”: this new software makes the task of training RL models much more accessible “Scaling machine learning”: Reza Zadeh on deep learning, hardware/software interfaces, and why computer vision is so exciting “How big compute is powering the deep learning rocket ship”: Greg Diamos on building computer systems for deep learning and AI. “We need to build machine learning tools to augment machine learning engineers” “How machine learning will accelerate data management systems”: Tim Kraska on why ML will change how we build core algorithms and data
The importance of transparency and user control in machine learning
The O’Reilly Data Show Podcast: Guillaume Chaslot on bias and extremism in content recommendations.In this episode of the Data Show, I spoke with Guillaume Chaslot, an ex-YouTube engineer and founder of AlgoTransparency, an organization dedicated to helping the public understand the profound impact algorithms have on our lives. We live in an age when many of our interactions with companies and services are governed by algorithms. At a time when their impact continues to grow, there are many settings where these algorithms are far from transparent. There is growing awareness about the vast amounts of data companies are collecting on their users and customers, and people are starting to demand control over their data. A similar conversation is starting to happen about algorithms—users are wanting more control over what these models optimize for and an understanding of how they work. I first came across Chaslot through a series of articles about the power and impact of YouTube on politics and society. Many of the articles I read relied on data and analysis supplied by Chaslot. We talked about his work trying to decipher how YouTube’s recommendation system works, filter bubbles, transparency in machine learning, and data privacy.Here are some highlights from our conversation: Why YouTube’s impact is less understood My theory why people completely overlooked YouTube is because on Facebook and Twitter, if one of your friends posts something strange, you'll see it. Even if you have 1,000 friends, if one of them posts something really disturbing, you see it, so you're more aware of the problem. Whereas on YouTube, some people binge watch some very weird things that could be propaganda, but we won’t know about it because we don't see what other people see. So, YouTube is like a TV channel that doesn't show the same thing to everybody and when you ask YouTube, "What did you show to other people?" YouTube says, ‘I don't know, I don't remember, I don't want to tell you.’ Downsides of optimizing only for watch time When I was working on the YouTube algorithm and our goal was to optimize watch time, we were trying to make sure that the algorithm kept people online the longest. But what I realized was that we were so focused on this target of watch time that we were forgetting a lot of important things and we were seeing some very strange behavior of the algorithm. Each time we were seeing this strange behavior, we just blamed it on the user. It shows violent videos; it must be because users are violent, so it's not our fault; the algorithm is just a mirror of human society. But if I believe the algorithm is a mirror of human society, I think it's also not a flat mirror; it's a mirror that emphasizes some aspects of life and makes some other aspects overlooked. ... The algorithm that is behind YouTube and the Facebook news feeds are very complex, deep learning systems that will take a lot into account, including user sessions, what they've watched. It will try to find the right content to show to users to get them to stay online the longest and interact as much as possible with the content. So, this can seem neutral at first, but it might not be neutral. For instance, if you have content that says ‘The media is lying,’ whether it's on Facebook or on YouTube, what will happen is that this content will naturally, if it manages to convince the user that the media is lying, the content will be very efficient at keeping the user online because the user won't go to other media and will spend more time on YouTube and more time on Facebook. ... In my personal opinion, the current goal of maximizing watch time means that any content that is really good at captivating your attention for a long time will perform really well. This means extreme content will actually perform really well. But say you had another goal—for instance, the goal to maximize likes and dislikes, or another system of rating like when you would be asked some question like, ‘Did you enjoy this video? Was it helping you in your life?’ Then this kind of extreme content will not perform as well. So, there are many other options, and it's not that YouTube is failing at exploring these other options; it's that they don't even try. Related resources: "We need to build machine learning tools to augment machine learning engineers" "Defining responsible data practices": Natalie Evans Harris discusses the community principles on ethical data practices (CPEDP), a code of ethics for data collection, sharing, and utilization. "Inclusivity for the greater good": Ajey Gore explains why GO-JEK is focusing its attention beyond urban Indonesia to help people across the country’s rural areas. "Ethics in data project design: It’s about planning" "On computational ethics" Ethics of Big Data: Doug Patterson and Kord Davis on balancing risk and innovation.
Tools for generating deep neural networks with efficient network architectures
The O’Reilly Data Show Podcast: Alex Wong on building human-in-the-loop automation solutions for enterprise machine learning.In this episode of the Data Show, I spoke with Alex Wong, associate professor at the University of Waterloo, and co-founder of DarwinAI, a startup that uses AI to address foundational challenges with deep learning in the enterprise. As the use of machine learning and analytics become more widespread, we’re beginning to see tools that enable data scientists and data engineers to scale and tackle many more problems and maintain more systems. This includes automation tools for the many stages involved in data science, including data preparation, feature engineering, model selection, and hyperparameter tuning, as well as tools for data engineering and data operations. Wong and his collaborators are building solutions for enterprises, including tools for generating efficient neural networks and for the performance analysis of networks deployed to edge devices. Here are some highlights from our conversation: Using AI to democratize deep learning Having worked in machine learning and deep learning for more than a decade, both in academia as well as industry, it really became very evident to me that there's a significant barrier to widespread adoption. One of the main things is that it is very difficult to design, build, and explain deep neural networks. I especially wanted to meet operational requirements. The process just involves way too much guesswork, trial and error, so it's hard to build systems that work in real-world industrial systems. One of the out-of-the-box moments we had—pretty much the only way we could actually do this—was to reinvent the way we think about building deep neural networks. Which is, can we actually leverage AI itself as a collaborative technology? Can we build something that works with people to design and build much better networks? And that led to the start of DarwinAI—our main vision is pretty much enabling deep learning for anyone, anywhere, anytime. Generative synthesis The general concept of generative synthesis is to find the best generative model that meets your particular operational requirements (which could be size, speed, accuracy, and so forth). So, the intuition behind that is that we treat it as a large constrained optimization problem where we try to identify the generative machine that will actually give you the highest performance. We have a unique way of having an interplay between a generator and an inquisitor where the generator will generate networks that the inquisitor probes and understands. Then it learns intuition about what makes a good network and what doesn't. Related resources: Vitaly Gordon on “Building tools for enterprise data science” “What machine learning means for software development” “We need to build machine learning tools to augment machine learning engineers” Tim Kraska on “How machine learning will accelerate data management systems” “Building tools for the AI applications of tomorrow”
Machine learning on encrypted data
The O’Reilly Data Show Podcast: Alon Kaufman on the interplay between machine learning, encryption, and security.In this episode of the Data Show, I spoke with Alon Kaufman, CEO and co-founder of Duality Technologies, a startup building tools that will allow companies to apply analytics and machine learning to encrypted data. In a recent talk, I described the importance of data, various methods for estimating the value of data, and emerging tools for incentivizing data sharing across organizations. As I noted, the main motivation for improving data liquidity is the growing importance of machine learning. We’re all familiar with the importance of data security and privacy, but probably not as many people are aware of the emerging set of tools at the intersection of machine learning and security. Kaufman and his stellar roster of co-founders are doing some of the most interesting work in this area. Here are some highlights from our conversation: Running machine learning models on encrypted data Four or five years ago, techniques for running machine learning models on data while it's encrypted were being discussed in the academic world. We did a few trials of this and although the results were fascinating, it still wasn't practical. ... There have been big breakthroughs that have led to it becoming feasible. A few years ago, it was more theoretical. Now it's becoming feasible. This is the right time to build a company. Not only because of the technology feasibility but definitely because of the need in the market. From inference to training A classical example would be model inference. I have data; you have some predictive model. I want to consume your model. I'm not willing to share my data with you, so I'll encrypt my data; you'll apply your model to the encrypted data, so you'll never see the data. I will never see your model. The result that comes out of this computation, which is encrypted as well, will be decrypted only by me, as I have the key. This means I can basically utilize your predictive insight, you can sell your model, and no one ever exchanged data or models between the parties. ... The next frontier of research is doing model training with these type of technologies. We have some great results, and there are others who are starting to do and implement some things in hardware. ... Some of our recent work around applying deep learning to encrypted data combines different methods. Homomorphic encryption has its pros and cons; secure multi-party computation has other advantages and disadvantages. We basically mash various methods together to derive very, very interesting results. ... For example, we have applied algorithms to genomic data at scale and we obtained impressive performance. Related resources: Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models” Chang Liu on “How privacy-preserving techniques can lead to more robust machine learning models” “How to build analytic products in an age when data privacy has become critical” “Data collection and data markets in the age of privacy and machine learning” “What machine learning means for software development” “Lessons learned turning machine learning models into real products and services”
Why your attention is like a piece of contested territory
The O’Reilly Data Show Podcast: P.W. Singer on how social media has changed, war, politics, and business.In this episode of the Data Show, I spoke with P.W. Singer, strategist and senior fellow at the New America Foundation, and a contributing editor at Popular Science. He is co-author of an excellent new book, LikeWar: The Weaponization of Social Media, which explores how social media has changed war, politics, and business. The book is essential reading for anyone interested in how social media has become an important new battlefield in a diverse set of domains and settings.We had a great conversation spanning many topics, including: In light of the 10th anniversary of his earlier book Wired for War, we talked about progress in robotics over the past decade. The challenge posed by the fact that social networks reward virality, not veracity. How the internet has emerged as an important new battlefield. How this new online battlefield changes how conflicts are fought and unfold. How many of the ideas and techniques covered in LikeWar are trickling down from nation-state actors influencing global events, to consulting companies offering services that companies and individuals can use. Here are some highlights from our conversation: LikeWar We spent five years tracking how social media was being used all around the world. ... We looked at everything from how was it being used by militaries, by terrorist groups, by politicians, by teenagers—you name it. The finding of this project is sort of a two-fold play on words. The first is, if you think of cyberwar as the hacking of networks, LikeWar is its twin. It's the hacking of people on the networks by driving ideas viral through a mix of likes and lies. ... Social media began as a space for fun, for entertainment. It then became a communication space. It became a marketplace. It's also turned it into a kind of battle space. It's simultaneously all of these things at once, and you can see, for example, Russian information warriors who are using digital marketing techniques and teenage jokes to influence the outcomes of elections. A different example would be ISIS' top recruiter, Junaid Hussain, mimicking how Taylor Swift built her fan army. A common set of tactics The second finding of the project was that when you look across all these wildly diverse actors, groups, and organizations, they turned out to be using very similar tactics, very similar approaches. To put it a different way: it's a mode of conflict. There's ways of “winning” that all the different groups are realizing. More importantly, the groups that understand these new rules of the game are the ones that are winning their online wars and having a real effect, whether that real effect is winning a political campaign, winning a corporate marketing campaign, winning a campaign to become a celebrity, or to become the most popular kid in school. Or “winning” might be to do the opposite—to sabotage someone else's campaign to become a leading political candidate. Related resources: Siwei Lyu on “The technical, societal, and cultural challenges that come with the rise of fake media” Supasorn Suwajanakorn on “Building artificial people: Endless possibilities and the dark side” Guillaume Chaslot on “The importance of transparency and user control in machine learning” “Overcoming barriers to AI adoption” Alon Kaufman on “Machine learning on encrypted data” Sharad Goel and Sam Corbett-Davies on “Why it’s hard to design fair machine learning models”
Data regulations and privacy discussions are still in the early stages
The O’Reilly Data Show Podcast: Aurélie Pols on GDPR, ethics, and ePrivacy.In this episode of the Data Show, I spoke with Aurélie Pols of Mind Your Privacy, one of my go-to resources when it comes to data privacy and data ethics. This interview took place at Strata Data London, a couple of days before the EU General Data Protection Regulation (GDPR) took effect. I wanted her perspective on this landmark regulation, as well as her take on trends in data privacy and growing interest in ethics among data professionals.Here are some highlights from our conversation: GDPR is just the starting point GDPR is not an end point. It's a starting point for a journey where a balance between companies and society and users of data needs to be redefined. Because when I look at my children, I look at how they use technology, I look at how smart my house might become or my car or my fridge, I know that in the long run this idea of giving consent to my fridge to share data is not totally viable. What are we going to be build for the next generations? ... I've been teaching privacy and ethics in Madrid at the IE Business School, one of the top business schools in the world. I’ve been teaching in the big data and analytics graduate program. I see the evolution as well. Five years ago, they looked at me like, 'What is she talking about?' Three years ago, some of the people in the room started to understand. ... Last year it was like 'We get it.' Privacy by design It's defined as data protection by design and by default as well. The easy part is more the default settings: when you create systems, it's the question I ask 20 times a week: 'Great. I love your system. What data do you collect by default and what do you pass on by default?' Then you start turning things off and then we'll see who takes on the responsibility to turn things on again. That's a default. Privacy by design was pushed by Ann Cavoukian from Ottawa in Canada more than 10 years ago. These principles are finding themselves within the legislation. Not only in GDPR—for example, Hong Kong is starting to talk about this and Japan as well. One of these principles is about positive-sum, not zero-sum. It's not 'I win and you lose.' It's 'we work together and we both win.' That's a very good principle. There are interesting challenges within privacy by design to translate these seven principles into technical requirements. I think there are opportunities as well. It talks about traceability, visibility, transparency. Which then comes back again to, we're sitting on so much data; how much data do we want to surface and are data subjects or citizens ready to understand what we have, and are they able to make decisions based on that? ... Hopefully this generation of more ethically minded engineers or data scientists will start thinking in that way as well. Related resources: "The data subject first?": Aurélie Pols draws a broad philosophical picture of the data ecosystem and then hones in on the right to data portability. “How to build analytic products in an age when data privacy has become critical” "Managing risk in machine learning models": Andrew Burt and Steven Touw on how companies can manage models they cannot fully explain. “Building tools for the AI applications of tomorrow” “Toward the Jet Age of machine learning” "The real value of data requires a holistic view of the end-to-end data pipeline": Ashok Srivastava on the emergence of machine learning and AI for enterprise applications "Bringing AI into the enterprise": Kris Hammond on business applications of AI technologies and educating future AI specialists.
Semi-supervised, unsupervised, and adaptive algorithms for large-scale time series
The O’Reilly Data Show Podcast: Ira Cohen on developing machine learning tools for a broad range of real-time applications.In this episode of the O’Reilly Data Show, I spoke with Ira Cohen, co-founder and chief data scientist at Anodot (full disclosure: I’m an advisor to Anodot). Since my days in quantitative finance, I’ve had a longstanding interest in time-series analysis. Back then, I used statistical (and data mining) techniques on relatively small volumes of financial time series. Today’s applications and use cases involve data volumes and speeds that require a new set of tools for data management, collection, and simple analysis.On the analytics side, applications are also beginning to require online machine learning algorithms that are able to scale, are adaptive, and free of a rigid dependence on labeled data. I talked with Cohen about the challenges in building an advanced analytics system for intelligent applications at extremely large scale. Here are some highlights from our conversation: Surfacing anomalies A lot of systems have a concept called dashboarding, where you put your regular things that you look at—the total revenue, the total amount of traffic to my website. … We have a parallel concept that we called Anoboard, which is an anomaly board. An anomaly board is basically showing you only the things that right now have some strange patterns to them. … So, out of the millions, here are the top 20 things you should be looking at because they have a strange behavior to them. … The Anoboard is something that gets populated by machine learning algorithms. … We only highlight the things that you need to look at rather than the subset of things that you're used to looking at, but that might not be relevant for discovering anything that's happening right now. Adaptive, online, unsupervised algorithms at scale We are a generic platform that can take any time series into it, and we'll output anomalies. Like any machine learning system, we have success criteria. In our case, it's that the number of false positives should be minimal, and the number of true detections should be the highest possible. Given those constraints and given that we are agnostic to the data so we're generic enough, we have to have a set of algorithms that will fit almost any type of metrics, any type of time series signals that get sent to us. To do that, we had to observe and collect a lot of different types of time series data from various types of customers. … We have millions of metrics in our system today. … We have over a dozen different algorithms that fit different types of signals. We had to design them and implement them, and obviously because our system is completely unsupervised, we also had to design algorithms that know how to choose the right one for every signal that comes in. … When you have millions of time series and you're measuring a large ecosystem, there are relationships between the time series, and the relationships and anomalies between different signals do tell a story. … There are a set of learning algorithms behind the scene that do this correlation automatically. … All of our algorithms are adaptive, so they take in samples and basically adapt themselves over time to fit the samples. Let's say there is a regime change. It might trigger an anomaly, but if it stays in a different regime, it will learn that as the new normal. … All our algorithms are completely online, which means they adapt themselves as new samples come in. This actually addresses the second part of the first question, which was scale. We know we have to be adaptive. We want to track 100% of the metrics, so it's not a case where you can collect a month of data, learn some model, put it in production and then everything is great and you don't have to do anything. You don't have to relearn anything. … We assume that we have to relearn everything all the time because things change all the time. Discovering relationships among KPIs and semi-supervised learning We find relationships between different KPIs and show it to a user; it's often something they are not aware of and are surprised to see. … Then, when they think about it and go back, they realize, 'Oh, yeah. That's true.' That completely changes their way of thinking. … If you're measuring all sorts of business KPIs, nobody knows the relationships between things. They can only conjecture about them, but they don't really know it. … I came from a world of semi-supervised learning where you have some labels, but most of the data is unlabeled. I think this is the reality for us as well. We get some feedback from users, but it's a fraction of the feedback you need if you want to apply supervised learning methods. Getting that feedback is actually very, very helpful. … Because I'm from the semi-supervised learning world, I always try to see where I can get some inputs from users, or from some oracle, but I never want to rely on it being there. Editor’s note: Ira Cohen will present a talk entitled Analytics for large-scale time-series and event data at Strata + Hadoop World London 2016. Related resources: Building self-service tools to monitor high-volume time-series data (a previous episode of the Data Show) Introduction to Apache Kafka Introduction to time series with Team Apache How intelligent data platforms are powering smart cities
Jai Ranganathan on architecting big data applications in the cloud
The O’Reilly Data Show podcast: The Hadoop ecosystem, the recent surge in interest in all things real time, and developments in hardware.In this episode of the O'Reilly Data Show, I sat down with Jai Ranganathan, senior director of product management at Cloudera. We talked about the trends in the Hadoop ecosystem, cloud computing, the recent surge in interest in all things real time, and hardware trends: Large-scale machine learning This sounds a bit like this should already exist in really good form right now, but one of the things that I'm really interested in is expanding the set of capabilities for distributed machine learning. While there are systems out there today that do do this, I think relative to what you can experience from a singular environment learning scikit-learn or R, the set of things you can do in a distributed fashion is limited. ... It's not easy to distribute various algorithms and model-building techniques. I think there is still a lot of work for us to do to improve that experience. ... And I do want to have good open source options like MLlib. MLlib may be the right answer. I would be perfectly happy if that's the final answer, but we do need systems just to provide the kind of depth that you typically are used to in the singular environment. That's just a matter of time and investment because these are non-trivial problems, but they are things that people are working on. Architecting data applications in the cloud There are some fundamental design principles behind the original HDFS implementation, which don't actually work in the cloud. For example, this notion that data locality is fundamental to this system design; it starts changing in the cloud when you're looking at these large cloud providers — they are doing all these software-defined networking tricks and they can do bisectional bandwidth, like 40 gigs per second, across their data center ... suddenly, you're talking about moving hundreds of terabytes of data back and forth from a storage to a compute layer without any huge performance penalties. Suddenly, their performance is disadvantageous to this, but it's not as bad as you think. Some of the core design principles in Hadoop have to change when you think about this kind of new data center design. ... The cloud part is really interesting, but really what to me is interesting is there's a fundamental shift in the way data centers are being designed, which we have to make sure that Hadoop stays designed to capitalize on. ... A lot of the work we do on the cloud is to optimize working with these object stores effectively. Obviously, you still need some local storage for things like spill, but that's not really the same as a distributed file system. Then, it's really a question of getting all the frameworks to run really effectively against an object store. Paying attention to hardware trends When I joined Cloudera, a customer who was going crazy and buying the most expensive hardware was buying 64 gigabytes of RAM. On that 64 gigabytes of RAM, they also had 12 disk spindles with two terabytes each and 24 terabytes of disk. At this point, today, many of my customers buy 246 gigabytes of RAM or even potentially 384 gigabytes to 512 gigabytes of RAM. The amount of disk is still exactly the same. Because disks don't spin faster and you still want a certain level of throughput, you're still looking at 24 terabytes of disk in your machine. Already in just two years, we have seen it go from 64 to 512, potentially. I don't think this trend is going to stop, and we are suddenly going to be looking at, within three years, one-terabyte RAM machines. ... What we're finding is that in a lot of the things we do at Cloudera, like Kudu or Impala, fundamentally, we really care about wringing performance out of the CPU. A lot of this will be like, 'can I do vectorize operations?' and 'can I make sure to take advantage of my L2 cache mode effectively?' because that allows my CPU to spend more efficiently. It really changes the bottleneck from the I/O subsystem to the CPU subsystem, and everything you can do to eke out performance there really matters. ... Project Tungsten is basically in the Spark community to do more CPU-efficient things, whether that's vectorizing stuff, whether that's actually effectively moving away from managed memory to managing by buffers, so you can actually have much more efficient handling of memory, so you can get better CPU efficiency as well. Subscribe to the O'Reilly Data Show Podcast: Stitcher, TuneIn, iTunes, SoundCloud, RSS Related resources: Jai Ranganthan will be speaking at Strata + Hadoop World Singapore: Hadoop in the cloud — an architectural how-to Why the data center needs an operating system by Benjamin Hindman, creator of Apache Mesos Showcasing the real-time processing revival: Tools and learning resources for building intelligent, real-time products (sessions at Strata + Hadoop World NYC) Apache Spark: Powering applications on-premise and in the cloud, a Data Show episode featuring Spark's release manager, Patrick Wendell.
Building human-assisted AI applications
The O’Reilly Data Show Podcast: Adam Marcus on intelligent systems and human-in-the-loop computing.In this episode of the O’Reilly Data Show, I spoke with Adam Marcus, co-founder and CTO of B12, a startup focused on building human-in-the-loop intelligent applications. We talked about the open source platform Orchestra,for coordinating human-in-the-loop projects; the current wave of human-assisted AI applications; best practices for reviewing and scoring experts; and flash teams. Here are some highlights from our conversation: Orchestra: A platform for building human-assisted AI applications I spent a total of three years doing web-scale structured data extraction. Toward the end of that period, I started speaking with Nitesh Banta, my co-founder at B12, and we said, ‘Hey, it's really awesome that you can coordinate all of these experts all over the world and give them all of these human-assisted AIs to take a first pass at work so that a lot of the labor goes away and you can use humans where they're uniquely positioned.’ But we really only managed to make a dent in data extraction and data entry. We thought that an interesting work model was emerging here, where you had human-assisted AIs and they were able to help experts do way more interesting knowledge work tasks. We're interested, at B12, about pushing all of this work up the knowledge work stack. The first stage in this process is to build out the infrastructure to make this possible. This is where Orchestra comes in. It's completely open source, it's available for anyone to use on GitHub and contribute to. What Orchestra does is basically serve as the infrastructure for building all sorts of human-in-the-loop and human-assisted AI applications. It essentially helps coordinate teams of experts who are working on really challenging workflows and pairs them up with all sorts of automation, custom-user interfaces, and tools to make them a lot more effective at their jobs. The first product that we built on top of Orchestra is an intelligent website product: a client will come to us and say that they'd like to get their web presence set up. Orchestra will quickly recruit the best designer, the best client executive, the best copywriter onto a team and it will follow a predefined workflow. The client executive will be scheduled to interview the client. Once an interview is completed, a designer is then staffed onto the project automatically. Human-assisted AI, essentially an algorithmic design, is run so that we can take some of the client's preferences and automatically generate a few initial passes at different websites for them, and then the designer is presented with those and gets to make the critical creative design decisions. Other folks are brought onto the project by Orchestra as needed. If we need a copywriter, if we need more expertise, then Orchestra can recruit the necessary staff. Essentially, Orchestra is a workflow management tool that brings together all sorts of experts, automates a lot of the really annoying project management functionality that you typically have to bring project managers onboard to do, and empowers the experts with all sorts of automation so they can focus on what they're uniquely positioned to do. Bots and data flow programming for human-in-the-loop projects Your readers are probably really familiar with things like data flow and workflow programming systems, and systems like that. In Orchestra, you declaratively describe a workflow, where various steps are either completed by humans or machines. It's Orchestra's job at that point, when it's time for a machine to jump in (and in our case its algorithmic design) to take a first pass at designing a website. It's also Orchestra's job to look at which steps in the workflow have been completed and when it should do things like staff a project, notice that the people executing the work are maybe falling off course on the project and that we need more active process management, bring in incentives, and so forth. The way we've accomplished all of this project automation in Orchestra is through bots; the super popular topic right now. The way it works for us is that Orchestra is pretty tightly integrated with Slack. At this point, probably everyone has used Slack for communicating with some kind of organization. Whenever an expert is brought into a project that Orchestra is working on, it will invite that expert to a Slack channel, where all of the other experts on his or her team are as well. Since the experts on our platform are using Orchestra and Slack together, we've created these bots that help automate process and project automation. All sorts of things like staffing, process management, incentives, and review hierarchies are managed through conversation. I'll give you an example in the world of staffing. Before we added staffing functionality to Orchestra, whenever we wanted to bring a designer onto a project, we'd have to send a bunch of messages over Slack, 'Hey, is anyone available to work on a project?' The designers didn't have a lot of context, so sometimes it would take about an hour of work for us to actually do the recruiting, and experts wouldn't get back to us for a day or two. We built a staffbot into Orchestra in response to this problem, and now the staffbot has a sense of how well experts have completed various tasks in the past, how much they already have on their plates, and the staffbot can create a ranking of the experts on the platform and reach out to the ones who are the best matches. ...Orchestra reaches out to the best expert matches over Slack and sends a message along the lines of, ‘Hey, here's a client brief for this particular project. Would you like to accept the task and join the team?’ An expert who is interested just has to click a button, and then he or she is integrated into the Orchestra project and folded into the Slack group that's completing that task. We've reduced the time to staff a project from a few days down to a little less than five minutes. Related resources: Crowdsourcing at GoDaddy: How I Learned to Stop Worrying and Love the Crowd (a presentation by Adam Marcus) Why data preparation frameworks rely on human-in-the-loop systems Building a business that combines human experts and data science Metadata services can lead to performance and organizational improvements
Enabling enterprise adoption of AI technologies
The O’Reilly Data Show Podcast: Jana Eggers on building applications that rely on synaptic intelligence.In this episode of the O’Reilly Data Show, I spoke with Jana Eggers, CEO of Nara Logics. Eggers’ involvement with AI dates back to her days as a researcher at the Los Alamos National Laboratory. Most recently she has been helping companies across many industries adopt AI technologies as a way to enable a range of intelligent data applications.Here are some highlights from our conversation: Design and UX for AI and data products I remember the day, because I'm old enough, that UX was not really even part of the team for software developers. If you needed that, the pretty pictures could be layered on top later, right? It was really just about how the system worked. Obviously, we all think of the huge impact that UX should and does have these days and how UX is a critical part of the team. I see AI products the same way—that we need to start bringing more people to AI like we brought them to UX. I don't think we have integration with machine learning (ML) as much. We still have a lot of engineering teams that are the ML teams that just provide access to whatever ML algorithms they have, and they don't have a full team wrapped around ML, which you need. You even need it if you're just providing APIs. You need someone who's thinking about what this is, what is the UX? Even if the UX is exemplified as an API. You need someone thinking about how this is going to express itself, and what's the best way to help people use this very effectively and for the right problems. I think with our black boxes, sometimes we get people misapplying machine learning. Synaptic vs. artificial intelligence One of the things that we at Nara Logics say is that we're synaptic intelligence, and the reason why we chose that phrasing in particular is because the synapse is the connection between two neurons in our brain, or in our body, really. The synapse is the mechanism that allows those neurons to communicate. An important part of what we're doing is deciding the strength of things and how they are associated with each other via that strength, so to us, our artificial intelligence is based on some neuroscience research. That's its foundation, and thus we use the term ‘synaptic intelligence’ versus ‘artificial intelligence.’ However, we do have those other aspects, which people very understandably expect from an artificial intelligence company, which is you have to have the ability to learn, so we have to take in new data and be able to make those shifts. One of the questions with many AI companies is, how quickly do you shift? That really depends on the environment that you're in. Sometimes you want to learn, but you want to take a longer time period to learn. Sometimes you want to be very responsive. You really need to understand the problem. … We typically describe ourselves in the context of our specific features. … One of these features is that we associate information so that we can provide recommendations for...an action to take in a particular situation. The other big differentiator that we have is the ability to tell you why we do what we do, where a lot of traditional AI and machine learning platforms are more of a black box, and they don't have this ability to give you an idea of why a given answer is coming out. Those are two of our big differentiators. Interpretability I was using AI back in the early 1990s, and I was then trying to explain to chemists what they could do to take this information and produce a better material that was more conductive. When I’m working out of a black box, I have to make guesses. I'm trying to reverse engineer. Well, why did it come out this way? The answer? Well, I provided a different set of data and it's different. Now that I make comparisons on that data and try to figure out that way—and it was really cumbersome and difficult, so what I saw the team had done here was really have a keen focus on how can we produce a why for these results. You'll notice when people give you recommendations—think about me recommending this restaurant to you. In general, you're going to ask me why? The human brain...we just don't ask the question. Sometimes you ask the question of why, and people say, ‘Oh, you know, I don't really know, but let me think about it.’ Then they'll start telling you, ‘You know, when I walked into that restaurant, actually it smelled like my grandmother's cooking, and I love her, and I love her cooking, and it just made me instantly go back to my childhood. Even though the service was crappy, I'm just now realizing I couldn't get over the fact that it reminded me of my grandma's cooking.’ You get things like that and you don't even realize until you stop and think. Your brain is actually much less of a black box than we give it credit for. Related resources: When A.I. joins the team: Jana Eggers’ keynote at Strata + Hadoop World (Singapore, 2015) What is Artificial Intelligence? Commercial speech recognition systems in the age of big data and deep learning Deep learning in production at Google: a conversation with TensorFlow co-creator Rajat Monga
The technology behind self-driving vehicles
The O’Reilly Data Show Podcast: Shaoshan Liu on perception, knowledge, reasoning, and planning for autonomous cars.Ask a random person for an example of an AI system and chances are he or she will name self-driving vehicles. In this episode of the O’Reilly Data Show, I sat down with Shaoshan Liu, co-founder of PerceptIn and previously the senior architect (autonomous driving) at Baidu USA. We talked about the technology behind self-driving vehicles, their reliance on rule-based decision engines, and deploying large-scale deep learning systems.Here are some highlights from our conversation: Advanced sensors for mapping, localization, and obstacle avoidance The first part is sensing. How do you gather data about the environment? You have different types of sensors. The main type of sensor used in today's autonomous driving is LIDAR, a laser-based radar. A main problem with LIDAR is cost. However, there are startups that are working on low-cost LIDAR systems. Then, of course, there is GPS, and in addition there is a sensor called the inertial measurement unit (IMU). People today usually combine the data from GPS, IMU, and LIDAR to localize the vehicle to centimeter accuracy. There's one more sensor—a radar— used for obstacle avoidance. It's a reactive mechanism. If all of the above sensors fail to recognize that there's an object in front of you, then this sensor can detect objects five to 10 meters away from you. This radar is hooked up directly to the control system, such that when it detects there's an object in front of you it can drive the car away from the object autonomously. Sophisticated machine learning pipelines for perception To me, perception has three major components. The first component is how you localize your vehicle, and then based on localization information, you can make decisions about where to navigate. The second component is object recognition. Here, deep learning technology is commonly used to take camera data and recognize the objects around your vehicle. The third component is object tracking. You might be in a car on a highway, for example. You want to know what the car next to you is doing. … A deep learning-based object-tracking mechanism is what you would normally use to track the car or the objects next to you. Largely rule-based decision engines The decision pipeline normally includes a few major components. The first one is path planning. How do you want to go from point A to point B and plan your path? How do you issue instructions to the vehicle to go from point A to point B? There are many research papers and algorithms on route planning; the famous A* algorithm is often impractical. The second part is prediction. We discussed that as part of the perception pipeline—there's object tracking to track nearby objects. Then, we have a prediction algorithm based on the tracking results. The algorithm measures the likelihood of crashing into or avoiding nearby objects. Based on these predictions, we derive the object- or obstacle-avoidance decisions. How do we drive away from these obstacles or moving objects such that we don't get into an accident? Today, you’ll find largely rule-based engines, but there are many research projects on the of use reinforcement learning and deep learning networks to make autonomous decisions about prediction, obstacle avoidance, path bending, and so on. Related resources: What is Artificial Intelligence? The New Artificial Intelligence Market Data, Technology and the Future of Play: Understanding the Smart Toy landscape Accelerating big data workloads with Alluxio (Tachyon)
Data architectures for streaming applications
The O’Reilly Data Show Podcast: Dean Wampler on streaming data applications, Scala and Spark, and cloud computing.In this episode of the O’Reilly Data Show I sat down with O’Reilly author Dean Wampler, big data architect at Lightbend. We talked about new architectures for stream processing, Scala, and cloud computing.Our interview dovetailed with conversations I’ve had lately, where I've been emphasizing the distinction between streaming and real time. Streaming connotes an unbounded data set, whereas real time is mainly about low latency. The distinction can be blurry, but it’s something that seasoned solution architects understand. While most companies deal with problems that fall under the realm of “near real time” (end-to-end pipelines that run somewhere between five minutes to an hour), they still need to deal with data that is continuously arriving. Part of what’s interesting about the new Structured Streaming API in Apache Spark is that it opens up streaming (or unbounded) data processing to a much wider group of users (namely data scientists and business analysts). Here are some highlights from our conversation: The growing interest in streaming There are two reasons. One is that it's an emerging area, where people are struggling to figure out what to do and how to make sense of all these tools. It does also raise the bar in terms of production issues compared to a batch job that runs for a couple of hours. … But a streaming job is supposed to run reliably for months, or whatever. Suddenly, you're now out of the realm of the back office and into the realm of the bleeding edge, always-on, Internet that your distributed computing friends are fretting over all the time. The problems are harder. The last point I'll make is that streaming fits the model of stuff that Lightbend has traditionally worked on, which has been more highly available, highly reliable systems and tools to support that; so, it's a more natural fit than just the general data science and data engineering problems. Stream processing frameworks We're seeing this same sort of Cambrian explosion of different options like we saw with the NoSQL databases in the 2000s. A lot of these will fall by the wayside, I think. … What I encourage people to do is, first, make sure you're picking something that really has a vibrant community, like Spark, where it's clear that it's going to be around for a while, that people are going to keep moving forward. But then make sure you understand what the strengths and weaknesses of the system are. Just enough Scala for Spark In my view, no tool highlights the advantages of Scala and hides the disadvantages of Scala better than Spark does. I'm thinking more of the old RDD API than the newer Dataset API, but when you're writing, effectively what are data flows, in Spark, it's just a natural, very elegant way to express them when you use the Scala API. I think even more expressive and clean than the Python API, which is historically a very concise and great way to write data science apps. ... Certainly, at Lightbend, we've seen that big data and the fact that it's being used a lot for tools like Spark and Kafka has driven interest in Scala in general. There were a lot of people who were actually starting to use Scala for the first time because of Spark; they didn't really want to become Scala experts, but needed to know enough to be productive, and also wanted to learn some of the cool tricks that make it so elegant. That's really the genesis of the tutorial that I'm going to give at Strata + Hadoop World New York and Singapore, and we're also working on video training for O'Reilly on the same material, designed to help you avoid the dark corners, just give you the key things that are really so useful, and then you can take it from there. Related resources: Just Enough Scala for Spark: Dean Wampler is teaching this new three-hour tutorial at Strata + Hadoop World New York (Sept 27, 2016) and Singapore (Dec 6, 2016). Uber’s case for incremental processing on Hadoop Making sense of stream processing Analyzing data in the Internet of Things This post and podcast is part of a collaboration between O’Reilly and Lightbend. See our statement of editorial independence.
Building a business that combines human experts and data science
The O’Reilly Data Show podcast: Eric Colson on algorithms, human computation, and building data science teams.I spoke with Eric Colson, chief algorithms officer at Stitch Fix, and former VP of data science and engineering at Netflix. We talked about building and deploying mission-critical, human-in-the-loop systems for consumer Internet companies. Knowing that many companies are grappling with incorporating data science, I also asked Colson to share his experiences building, managing, and nurturing, large data science teams at both Netflix and Stitch Fix.Augmented systems: “Active learning,” “human-in-the-loop,” and “human computation” We use the term ‘human computation’ at Stitch Fix. We have a team dedicated to human computation. It's a little bit coarse to say it that way because we do have more than 2,000 stylists, and these are very much human beings that are very passionate about fashion styling. What we can do is, we can abstract their talent into—you can think of it like an API; there's certain tasks that only a human can do or we're going to fail if we try this with machines, so we almost have programmatic access to human talent. We are allowed to route certain tasks to them, things that we could never get done with machines. ... We have some of our own proprietary software that blends together two resources: machine learning and expert human judgment. The way I talk about it is, we have an algorithm that's distributed across the resources. It's a single algorithm, but it does some of the work through machine resources, and other parts of the work get done through humans. ... You can think of even the classic recommender systems, collaborative filtering, which people recognize as, ‘people that bought this also bought that.’ Those things break down to nothing more than a series of rote calculations. Being a human, you can actually do them by hand—it'll just take you a long time, and you'll make a lot of mistakes along the way, and you're not going to have much fun doing it—but machines can do this stuff in milliseconds. They can find these hidden relationships within the data that are going to help figure out what's relevant to certain consumer's preferences and be able to recommend things. Those are things that, again, a human could, in theory, do, but they're just not great at all the calculations, and every algorithmic technique breaks down to a series of rote calculations. ... What machines can't do are things around cognition, things that have to do with ambient information, or appreciation of aesthetics, or even the ability to relate to another human—those things are strictly in the purview of humans. Those types of tasks we route over to stylists. ... I would argue that our humans could not do their jobs without the machines. We keep our inventory very large so that there are always many things to pick from for any given customer. It's so large, in fact, that it would take a human too long to sift through it on her own, so what machines are doing is narrowing down the focus. Combining art and science Our business model is different. We are betting big on algorithms. We do not have the barriers to competition that other retailers have, like Wal-Mart has economies of scale that allow them to do amazing things; that's their big barrier. ... What is our protective barrier? It's [to be the] best in the world at algorithms. We have to be the very best. ... More than any other company, we are going to suffer if we're wrong. ... Our founder wanted to do this from the very beginning, combine empiricism with what can't be captured in data, call it intuition or judgment. But she really wanted to weave those two things together to produce something that was better than either can do on their own. She calls it art and science, combining art and science. Defining roles in data science teams [Job roles at StitchFix are] built on three premises that come from Dan Pink's book Drive. Autonomy, mastery, purpose—those are the fundamental things you need to have for high job satisfaction. With autonomy, that's why we dedicate them to a team. You're going to now work on what's called ‘marketing algorithms.’ You may not know anything about marketing to begin with, but you're going to learn it pretty fast. You're going to pick up the domain expertise. By autonomy, we want you to do the whole thing so you have the full context. You're going to be the one sourcing the data, building pipelines. You're going to be applying the algorithmic routine. You're going to be the one who frames that problem, figures out what algorithms you need, and you're going to be the one delivering the output and connecting it back to some action, whatever that action may be. Maybe it's adjusting our multi-channel strategy. Whatever that algorithmic output is, you're responsible for it. So, that's mastery. Now, you're autonomous because you do all the pieces. You're getting mastery over one domain, in that case, say marketing algorithms. You're going to be looked at as you're the best person in the company to go talk about how these things work; you know the end-to-end. Then, purpose—that's the impact that you're going to make. In the case that we gave, marketing algorithms, you want to be accountable. You want to be the one who can move the needle when it comes to how much we should do. What channels are more effective at acquiring new customers? Whatever it is, you're going to be held accountable for a real number, and that is motivating, that's what makes people love their jobs. Editor’s note: Eric Colson will speak about augmenting machine learning with human computation for better personalization, at Strata + Hadoop World in San Jose this March. Related resources: Minds and machines—Humans where they're best, robots for the rest: Adam Marcus’ presentation at Hardcore Data Science (Strata + Hadoop World NYC 2015) Fashioning Data (a free O’Reilly Data report) Marketing and Consumer Research (an O’Reilly Learning Path)
Building the next-generation big data analytics stack
The O’Reilly Data Show Podcast: Michael Franklin on the lasting legacy of AMPLab.In this episode I spoke with Michael Franklin, co-director of UC Berkeley’s AMPLab and chair of the Department of Computer Science at the University of Chicago. AMPLab is well-known in the data community for having originated Apache Spark, Alluxio (formerly Tachyon) and many other open source tools. Today marks the start of a two-day symposium commemorating the end of AMPLab, and we took the opportunity to reflect on its impressive accomplishments. AMPLab is the latest in a series of UC Berkeley research labs each designed with clear goals, a multidisciplinary faculty, and a fixed timeline (for more details, see David Patterson’s interesting design document for research labs). Many of AMPLab’s principals were involved in its precursor, the RAD Lab. As Franklin describes in our podcast episode: The insight that Dave Patterson and the other folks who founded the RAD Lab had was that modern systems were so complex that you needed serious machine learning—cutting-edge machine learning—to be able to do that [to basically allow the systems to manage themselves]. You couldn't take a computer systems person, give them an intro to machine learning book, and hope to solve that problem. They actually built this team that included computer systems people sitting next to machine learning people. ... Traditionally, these two groups had very little to do with each other. That was a five-year project. The way I like to say it is—they spent at least four of those years learning how to talk to each other. Toward of the end of the RAD Lab, we had probably the best group in the world of combined systems and machine learning people, who actually could speak to each other. In fact, Spark grew out of that relationship, because there were machine learning people in the RAD Lab who were trying to run iterative algorithms on Hadoop and were just getting terrible performance. ... AMPLab in some sense was a flip of that relationship. If you considered RAD Lab as basically a setting where “machine learning people were consulting for the systems people”, in AMPLab, we did the opposite—machine learning people got help from the systems people in how to make these things scale. That's one part of the story. In the rest of this post, I’ll describe some of my interactions with the AMPLab team. These recollections are based on early meetups, retreats, and conferences. The speed gains were addictive I first tried Spark around the version 0.4 and 0.5 releases. At the time, I was using Hive and Pig for data processing, while evaluating Mahout for machine learning. Other than being a bit resistant to having to learn a new programming language—Scala—which I later came to love, I immediately became a user and fan of Spark. The speed gains were addicting! AMPLab was also starting to roll out useful examples and libraries at a steady pace, and I soon found myself finding reasons to use Spark on more tasks and projects. Interacting with and getting feedback from developers at local meetups was important to the students and Professors of AMPLab. Around mid 2012, there was a San Francisco meetup where the audience got to see a preview of Spark Streaming. I remember the reaction to the presentation very clearly. There was immediate interest and enthusiasm, and it was clear to me that Spark Streaming was going to be popular. At the time, many in the audience used Storm, and the prospect of a simplified infrastructure (due to Spark’s ability to handle both batch and streaming) was attractive to many in attendance. It was at this meetup that I first broached the idea of a Spark book to Matei Zaharia (the creator of Spark). That initial conversation led to the popular O’Reilly title, Learning Spark. Discoveries at AMP Camp In the fall of 2012, I was fortunate enough to get invited to the first AMP Camp, and while I was enroute to that event, I wrote my first post on Spark (“Seven reasons why I like Spark”). AMP Camps combined talks as well as hands-on tutorials and in the early days of Spark they became the defacto community gathering for users. A few things stood out for me in that first AMP Camp. First—the tutorials were cloud-friendly from the beginning: AMP Camp tutorials provided tools to help users play with Spark on AWS. Second, the unveiling of Pyspark came at a time when most of the early users had JVM (Java, Scala, Clojure) backgrounds. This opened up Spark to the large number of data scientists who use Python as their primary language. This has worked out extremely well—the most recent user survey suggests Python and Scala have the same number of users in the Spark community. Finally, machine learning was featured prominently at that first AMP Camp. From the early days of Spark, many users, including myself, were drawn to its potential for machine learning tasks. While Spark is the project it’s always identified with, AMPLab has always been about building the next-generation big data analytics stack. As Franklin noted in our conversation, prior to the establishment of AMPLab, both he and his co-director Ion Stoica spent time on separate startups. Their experiences helped inform the initial design of what became known as the Berkeley Data Analytics Stack. I was fortunate enough to attend several AMPLab retreats where many BDAS components were first revealed. Following that first AMP Camp in 2012, I wrote about a few other projects that caught my attention: Alluxio (formerly Tachyon) is a storage-backed, distributed, shared memory system, that cuts across compute frameworks. In recent months, I’ve come across several companies—here and in Asia—that are starting to use Alluxio in production. BlinkDB was a query engine built with the idea that in many situations approximate answers suffice. It inspired the introduction of approximate algorithms in later versions of Spark. KeystoneML was about reproducible and interpretable end-to-end machine learning pipelines, with some notion of auto-tuning (systems optimizations) and error bounds. The early results were promising, but the project never quite caught on with the external community. This is one of my favorite projects out of AMPLab and it inspired ML Pipelines in Spark. Succinct is a “compressed” data store that enables a wide range of point queries (search, count, range, random access) directly on a compressed representation of input data. What’s ahead As I look to the future and to its successor (the RISE Lab), I’m thankful for having had a front row seat to the projects at AMPLab. The model of a university research lab listening to and working with industrial partners, while continuing to produce highly cited academic papers is something that other institutions should emulate. Apache Spark has emerged as the most popular open source project in big data, adopted and promoted by companies across many countries and industries. Many of the other AMPLab projects have influenced other aspects of Spark or other open source projects. In the case of Alluxio, we may have yet another AMPLab project that emerges to a be a popular project in its own right. Full disclosure: Michael Franklin and I are advisors to both Databricks and Alluxio, companies created by current and former members of the AMPLab. Related resources: The Spark video collection: 2016 Apache Spark 2.0: introduction to structured streaming KeystoneML: Optimized large-scale machine learning pipelines on Apache Spark Accelerating Spark workloads with GPUs Running Spark on Alluxio with S3
Deep learning that's easy to implement and easy to scale
The O’Reilly Data Show Podcast: Anima Anandkumar on MXNet, tensor computations and deep learning, and techniques for scaling algorithms.In this episode of the Data Show, I spoke with Anima Anandkumar, a leading machine learning researcher, and currently a principal research scientist at Amazon. I took the opportunity to get an update on the latest developments on the use of tensors in machine learning. Most of our conversation centered around MXNet—an open source, efficient, scalable deep learning framework. I’ve been a fan of MXNet dating back to when it was a research project out of CMU and UW, and I wanted to hear Anandkumar’s perspective on its recent progress as a framework for enterprises and practicing data scientists. Here are some highlights from our conversation: MXNet: An efficient, fast, and easy-to-use framework for deep learning MXNet ships with many popular deep learning architectures that have been predefined, and optimized to a great degree. If you look at benchmarks, and I'll be showing them at Strata, you get 90% efficiency on multiple GPUs, multiple instances. These scale up much better than the other packages. The idea is if you are enabling deep learning on the cloud, efficiency becomes a very important criterion and will result in huge cost savings to the customer. In addition, MXNet is much easier to program in terms of giving users more flexibility. There are a range of different front-end languages the user can employ and still get the same performance. … For instance in addition to Python, you can code in R, or even Javascript if you want to run this on the browser. ... At the same time, there is also the mixed programming paradigm, which means you can have both declarative and imperative programming. The idea is you need declarative programming if you want to do optimizations because you need the computation graph to figure out how and where to do the optimizations. On the other hand, imperative programming is easier to write, easier to debug, easier for the programmer to think sequentially. Because both options are available, the user can decide what is best to suit their needs, and which part of the program will require optimization and which parts are amenable as imperative programs. In the benchmarks that I'll show, it's not just about multiple GPUs on the same machine, but also multiple different instances. MXNet has parameter servers in the back end, which allows it to seamlessly distribute across either multiple GPUs or multiple machines. Tensor computations, deep learning, and hardware On one hand, if you think about the tensor operations, what we call tensor contractions are extensions of matrix products. And if you look into deep learning computations, they involve tensor contractions. It becomes very important, then, to ask if you can beyond the usual matrix computations and be able to efficiently parallelize along different hardware architectures. For instance, if you think about BLAS operations, the BLAS Level 1 are just scalar operations. BLAS Level 2 are matrix, vector operations. If you go to BLAS Level 3, you are looking at matrix, matrix operations. By going to higher level BLAS, you're able to block operations together and get better efficiency. If you go to tensors, which are extensions of the matrices, you need the higher level BLAS operations. In a recent paper, we defined such extensions to BLAS, which have been added to cuBLAS 8.0. To me, this is an exciting research area: how can we enable hardware optimizations for various tensor operations and how would that improve efficiency of deep learning and other machine learning algorithms? Academia and industry The opportunity here at AWS as a principal scientist has been a very timely and an exciting opportunity. I've been given a lot of freedom to explore and to push ahead and to make these algorithms available on the AWS cloud for everybody to use, and we'll be pushing ahead with many more such capabilities. And at the same time, we're also, in a way, doing research here and asking how we can think about new algorithms, how do we benchmark them with large-scale experiments, and talk about it at various conferences and other peer-reviewed venues. So, it's definitely a mix of research and development here that excites me, and at the same time, I continue to advise students and continue to push the research agenda. Amazon is enabling me to do that and supporting me in that, so I see this as a joint partnership. I expect this to continue. I'll be joining Caltech as an endowed chair, and I'm looking forward to more such engagements between industry and academia. Related resources: A tensor renaissance in data science Let’s build open source tensor libraries for data science How big compute is powering the deep learning rocket ship The Deep Learning Video Collection (Strata + Hadoop World 2016)
Rate Podcast

Join Podchaser to...

  • Rate podcasts and episodes
  • Follow podcasts and creators
  • Create podcast and episode lists
  • & much more
Podcast Details
Sep 10th, 2015
Latest Episode
Oct 10th, 2019
Release Period
No. of Episodes
Avg. Episode Length
39 minutes

Podcast Tags

Do you host or manage this podcast?
Claim and edit this page to your liking.
Are we missing an episode or update?
Use this to check the RSS feed immediately.