Software is eating the world.

This ruthless feasting is facilitated by the fact that virtually everything in the world can be reduced to data, which, unfortunately, we humans really suck at, as the tech evangelist James A. Whittaker puts it. That is, the availability of data makes it possible for software to conquer the world. Importantly, within the software, what makes the data actionable is machine learning: algorithms finding patterns in the data and that can be used in taking actions. Before machines learn to train themselves, those people able to harness the power of data with AI are highly wanted. The rest of us remain in the passenger seat while the pedal gets put to the metal.

Not all data is useful

I don’t necessary agree with the often-heard comment that data is the new oil—data has no value unless utilized cleverly to gain knowledge, but nonetheless, AI is fueled by data. And here is actually quite a direct analogy with oil: neither cars nor AI run on unrefined raw material. Presently, over 90% of AI solutions are based on supervised learning models that require labelled data for training. This means that the algorithms need a large set of readily-interpreted examples that contain the “correct” answer, e.g., whether a picture contains a cat or a dog, whether a customer (in history data) churned or not, or what was the order base in the last month.

Fair enough, but have you ever thought where this labelled data comes from? Is there a machine that can be used to put the label on examples? Yes and no. In the case of image classification and annotation, for example, Google and Microsoft have API’s with pre-trained CNN models that one can use for general problems, such as what elements appear in pictures, with some predicted probability.

However, if the business case is even slightly more specific than finding the people in a photo or differentiating between cats and dogs (or apples and oranges), e.g., finding faulty products on a conveyor belt, you’d better roll up your sleeves. The machines only handle cases they have been trained to handle.

Human-in-the-loop approach

Humans still need to refine the raw data to be usable for the machines. Great, we’re not useless after all! But wait, training a deep-learning neural network model takes a lot of training examples, a lot. Who’s going to go through all that data, especially if it is unstructured—pictures, video, audio, or text? During our recent trip to the San Francisco bay area, our last business meeting was with Figure Eight, a company that has come up with an ingenious solution to the data labeling problem. In fact, the system is so good that several successful tech companies rely on it to make their virtual assistants happen.

This is how it works

Figure Eight provides a “human-in-the-loop” platform that customers can use to set out tasks, and contributors can participate these tasks. The global contributor pool is ranked and leveled via in-house algorithms.

The customer needs to provide detailed instructions on how the data needs to be handled and specify the compensation per processed item (e.g., 5 cents). The beauty of the platform comes here: anyone can sign up as a contributor and earn money by performing labeling/annotation tasks. This creates a win-win situation where customers can get a huge data set annotated with reasonable cost and people can earn extra, which can be rather significant for many people in the world.

A great thing about this human CPU is that it scales rather freely with demand; there are, after all, over 7 billion people on the planet today. If a customer is in a hurry with their task, all they need to do is to increase the reward per item, which is basically like lining up more virtual machines in the cloud. The customer can also select the level of performance of contributors, again, akin to choosing the performance of the virtual machines.

But wait, people make mistakes, right? Of course, by definition (errare humanum est). Figure Eight has taken this into account by letting the customer choose how many times each piece of data needs to be processed, say 10 times. The answers are then aggregated, which ensures that the labeling is as accurate as possible.

 

Human-in-the-loop approach for labeling/annotating unstructured data.

Bottom line

Again, refining unstructured data to make it useful for machines is not a marginal business; software is eating the world, remember? Figure Eight has customers such as Adobe, eBay, Oracle, SAP, and Spotify, to name just a few. Given the enormous expansion of embedded AI in all kinds of software within just one year, most related to image/text/speech recognition, the need for data labeling is going to explode.

I can think of several applications for the Figure Eight service right away. It could be that at some point, machines acquire “knowledge” of everything and can thereafter train themselves. This, however, is unlikely to take place in the immediate future and until then, we need to do the work. So if you thought AI is going to put people out of work, think again 😉