There are a couple of concepts that make Artificial Intelligence, and more specifically Deep Learning, successful these days. They are data, data and data. Big tech companies use tremendous amounts of data to feed their algorithms and provide optimal results for their users. For example, Google is able to feed its search algorithm with over 3.5 billion searches every day, thereby reinforcing the algorithm with clicks that people perform after a certain query and optimizing the ranked results for users. Another example is Facebook, which is nearing human levels in face recognition by using over 4 million face images annotated by people (remember when you told Facebook your friend Jeff was in the picture you uploaded?). They used this data to train their neural networks and are now able to tell you all individuals that exist in an image (as long as they are in their verification set of course). Time after time, all the big tech companies prove that having a huge amount of data is the key to success in Artificial Intelligence. This justifies a question asked from the audience at the World Summit AI in Amsterdam last year. A big Google chief was asked whether or not big tech companies should open source their (anonymised) data rather than their algorithms and computing power, which was answered by some mumbling about privacy and business models.
At Media Distillery we also have data. Lots of it. Every moment during the day we record somewhere between 250 and 300 radio and TV channels and if we do the math for only the TV channels we can calculate that, if we cut 1 frame per second of roughly 400 kb (depending on the amount of jpeg compression) and most of our channels being video data, we have ~250 (channels) * 60 (s) * 60 (m) * 24 (h) = 21600000 frames, which, when multiplied by the 400 kb mentioned above, equals around 8640000000 kb of video frames flowing through our system every day! This is roughly 8.6 TB of data for the video frames only! All these frames are not just taken from a broadcast stream, but are also processed by multiple Deep Networks that perform the analyses for us in a scalable way making our platform pretty robust.
You might think that, if you have so much data you must have pretty accurate detection algorithms! While the latter is not untrue, this is only for a fraction because of the amount of data that we process every day. The reason for this is the fact that all this video data does not contain any labels, or as we call it, is unsupervised. The Deep Learning models (more specifically Convolutional Neural Networks) that we apply to the incoming video data are trained in a supervised fashion. This means that we have datasets that contain examples of the objects, faces or logos we want to detect, and using these examples we teach algorithms what pixels to identify as what objects.
So if we want to use our huge amounts of data to train our networks, we need to tell the algorithms what’s in the incoming images. In order to do so, we figured out a pipeline with a so-called human-in-the-loop, which, in our case, is a mechanical-turk-in-the-loop. For example, if we want to recognize a new logo we gather a number of images from the Internet that we can use to train our algorithms. Based on these images a new model is trained, which now contains the new logo, and this new model is immediately deployed on our production servers to quickly get feedback from its performance on our production data (which is the video frames I mentioned earlier). The frames it classifies are then stored on our servers and fed to an Amazon service called ‘Mechanical Turk’, which sends our frames to humans all over the world that will check whether the decisions our algorithm made are actually correct: it reinforces the correct decisions and punishes the wrong ones. The different outcomes of these human-in-the-loop analyses are then fed back into the algorithm, creating a better, updated version of our model. If we do this process, which we call an Iteration, a couple of times we create a new, high-quality logo model that can feed our search engine with accurate logo detections which our customers use to monitor their brands on television, for instance.
Turning unsupervised data into supervised data is the holy grail of AI, and we are looking for it.
As we have seen, one of the limitations of the current state-of-the-art in Deep Learning is the huge amounts of data that it needs. While having huge amounts of data, one of the limitations of Media Distillery is the fact that most of our data is unsupervised. If we can come up with strategies to turn this unsupervised data into supervised data with semi-supervised learning strategies, as described above, or by using things like ensembles of models (which are multiple models that reinforce themselves by using majority voting, for example), the world would be surprised by the amount of awesomeness we would get from our models. And, even better, we would bestow upon our customers and stakeholders the holy grail of Deep Learning!
January 25, 2018