Who wouldn’t want a crystal ball from Lord of the Rings? A device so powerful it can predict everything. If this is your dream, look no further. The ball is basing its performance on the idea of anticipating temporal occurrences of activities. In plain English – software is predicting what the subject (chef in the case of this research) will do next. It can look into the future just by analysing behavioural patterns. Accuracy rate: over 40%.
The working group of prof. dr Jürgen Gall from University of Bonn wants to revolutionize artificial intelligence (AI) by going beyond pattern recognition. Imagine that the program learns the typical sequence of actions like cooking, delivered by video materials. Based on this knowledge, the program can accurately predict what the chef will do next. It’s impressive, even at the early stage. Team’s findings were presented at the world’s largest Conference on Computer Vision and Pattern Recognition that took place in June 19-21 in Salt Lake City, Utah, USA.
Yazan Abu Farha, Alexander Richard and Juergen Gall developed self-learning predictive software that estimates the timing and duration of future activities. Its predicting capabilities stretch across several minutes. Accuracy depends on how much into the future researchers want to go. Over 40% accuracy rate is achieved for short forecast periods, while it drops to 15% when it comes to looking into more than 3 minutes.
The training data for algorithm included 40 videos. In each of them chefs prepared different salads. Each of the recordings was about 6 minutes long and contained an average of 20 different actions. The videos had also embedded info about details of what time given action started and how long it took to complete it. The predictive software “watched” all the videos, totalling around 4 hours. The algorithm has learned about sequences of actions – what actions typically follow each other and how long did they last. More interestingly, each chef has his or her different style in which they prepare meals. Plus sequences in question may have vary on the recipe. After all, the same humble brownie can taste differently in New York and in Santa Cruz, California.
Team uses two systems for predictions. The first is based on a recurrent neural network (RNN) and the second is based on a convolutional neural network (CNN). Just like with a popular Cable News Network the information is given; the rest is up to the audience.
With RNN, the algorithm can predict future as a recursive sequence. At input sequence, the RNN obtains all observed segments and predicts the remainder of the last segment as well as the next segment. This procedure is repeated until the desired amount of future frames is reached. With CNN anticipation the algorithm aims at predicting all actions in a single step. It does not ‘want’ to rely on a recursive strategy, such as RNN.
Although training data consisted of 40 videos, the experimentational phase has had 1,712 videos and 52 actors making breakfast. Overall, there were 47 fine-grained action classes and 6 action instances for each video. The average duration of the video was 2.3 minutes.
In order to objectify the results, researchers confronted the software with video it had not seen before. This new material also showed the salad preparation. For the purpose of the test, the software was told what is displayed in the first 20% or 30% of one of the new videos. Then the algorithm had to predict what is happening in the rest of the movie.
In the presented research it’s not the algorithm that is novel, but rather the area of application. From AI’s point of view it doesn't matter if it’s being trained based on the sound, image or strings of words. Nevertheless, the spectrum of training data in contemporary AI is still quite limited.
From the professional environment dr Gall and his colleagues expect nothing more than just simple understandment of a new field that is the ‘activity prediction’. There is a lot to do and the whole thing is at the level of infancy. Researchers want the software to perform better when it comes to self-recognition of what is happening in the first part of the video. Right now the data is provided by humans.
The team from Bonn introduced two methods for future action prediction in videos. This is an interesting idea that hasn’t been explored before. Most prediction approaches focus on early anticipation of an ongoing action or predict an action possible in the future. Methods proposed by dr Gall’s team were concentrated on minutes-ahead content. Gathered data suggest the need to optimize prediction at an early stage of ongoing video. Then the research can get beyond an experiment with impressive results but probably limited usage.
The software requires optimization, but it’s already interesting. Recurrent neural network (RNN) and convolutional neural network (CNN) are used in robotics, for example for path planning. Algorithms allowing to predict the future could be used in autonomous vehicles for example. Imagine that there is a car accident on the motorway. When your Google map detects the traffic jam, it’s already too late to find an alternative way. In most cases, anyway. If autonomous cars of the future will detect car accidents, they will be able to share that knowledge with others in real-time, and AI will be able to predict whether it’s more profitable to be waiting in the traffic jam, or to choose an alternative route. In such an application AI will be trained based on different accident data, traffic data following the accident and in correlation with time of the day. That will give the software the "predictive edge" it needs right now.