Bram Zijlstra, Machine Learning Engineer: Video Service Providers have a lot of video content that they can offer to viewers, but they struggle to provide a user experience that their viewers have come to expect, nowadays. One of the challenges faced by the Video Service Providers is that a lot of their content has a stock image as an image. This is a problem because viewers’ attention is shrinking every year. When browsing through a catalogue, users decide in a split second whether they want to watch a programme or not. To make their content more appealing, Video Service Providers should focus on improving their user experience – including the images exposed on their interface.
Joanna Krajewska, Machine Learning Engineer: Our Image Distillery™ solution solves this problem for TV operators. For each programme, Image Distillery™ provides relevant and appealing smart images from the video source. Thanks to a fully automated technology, unique images are delivered in real-time.
Bram: What makes something appealing is pretty subjective, as beauty is in the eye of the beholder. However, there are definitely general differences between pretty and ugly images.
A visually appealing image tends to be bright, sharp, colourful, and elicits an emotion. Our eyes are caught by other people’s faces, groups of people, animals, and bright colours – in that order. We use machine learning to quantify this, so we can find the frames that catch the users’ attention.But when an image does not reflect the actual content, viewers become annoyed quickly. Being “pretty” is not enough! Viewers want an image to be informative as well. For example, a movie should have a lead actress in the image, the news should show the main topic, and a nature documentary should show a beautiful landscape. The image also needs to convey an emotion that suits the show; an emotional documentary should not have a funny image.
Joanna: I agree with Bram’s description. I can only add that a good image should be easily associated with an episode, programme, or film it represents. One quick glance at it should tell the user if she/he already watched it or not. If you are a TV show addict like I am, you know why it’s important. When we talk about “catching users’ attention” we do not mean selecting images which are clickbait; we do not want to lie to users, just make their life easier.
Joanna: In order to find a good image, you first need to be able to separate images that belong to the content from images that do not. The EPG start time and end time are rarely correct, so we used our EPG Correction Distillery™ technology to find the boundaries of a video. In addition, we also don’t get information about ad breaks, so we need to detect when advertisements start and end to filter these out.
Then we detect channel overlays to remove visual cues from each channel. We use logo detection to locate the channel logo locations and text detection to detect subtitles. We remove these parts from the image and rate the image on a scale from 1-10. This rating is based on the composition of a shot, distortions in the video source, faces we detect, and other factors.
Bram: One of the challenges that we have faced was to combine several Machine Learning models together. Our API does a lot of different detections, and combining them all was not an easy task.
Another challenge was to make sure we could do the detections in time. For each TV channel that uses Image Distillery™, we need to detect 1 frame per second. Using a handful of machine learning models in one second was not easy to implement, so scalability was a big challenge for us.
Additionally, the fact that there is so much different content out there is a major challenge! It is hard to create a “one size fits all” solution when content differs so much. For example, for a movie you want the main actor, but with the TV news that would mean that you always pick the anchor-woman. And while we usually prefer images of faces, with a nature documentary you would like to see a nature shot.
Bram: Image Distillery™ is a unique solution for Video Service Providers because we provide our stills fully-automated. Usually, companies that create images keep a human-in-the-loop, but this is a lot of manual work to do if you have over 50 channels per client.
Another thing that makes our solution unique is that we don’t gather user data. We provide images that we think work best, without tracking user engagement. Providing images based on engagement has been a sensitive topic in the past.
Joanna: I find it important to highlight that we develop a solution for TV and not for video-on-demand, which makes it 10 times harder to execute. When you compare our methodology and our product to that of Netflix or YouTube, they don’t even have to think about our biggest challenges. For example, Netflix knows exactly what videos they have and for which videos they need images. They know exactly which frames belong to the programme, and can even manually annotate the content they have because they have a limited number of daily new shows, and they know exactly when it will be put online so they can prepare beforehand. Additionally, they collect user data, they know which images you like, and on what images you are most likely to click. They obviously use this information to create images “just for you”.
Video Service Providers provide constant incoming streams of frames, with some metadata about what’s inside, but usually, it’s not 100% correct and has some missing information. On our end, we must find the exact beginning and end of a programme, where the ads are within that programming, and clear remaining frames from additional “noise.”. By noise, we mean things like logo, programme-specific overlays (i.e.: a graphic with information that shows who is currently speaking in the talk show), subtitles, and much more.
Joanna: To create an effective image, the best practice is to combine as many analyses together. Every type of analysis can provide some meaningful information to see if a frame in the video could be a good image. It can be an analysis for detecting people, detecting the color composition, etc. As a developer, it’s easy to stare blindly at one piece and try to perfect it. But what we have found is that combining all these factors together and performing as many different operations as possible in a second works a lot better than trying to do everything perfectly. This means that sometimes we miss a perfect composition of a person, but by iterating over our analyses we notice that these happen less and less often.
Bram: We want to have a real understanding of the content and select based on this information. Our Machine Learning team is currently working on Topic Detection, and we want to incorporate that with our current methodology and process. (So, a talk show that is 70% about football should show an image of football.)
Joanna: Yes, having contextual images would be great. And we also would like to develop an algorithm for enhancing the image quality and the composition of the image to represent all the best practises used by photographers.
October 1, 2020