Final version

After some more tweaking and setting parameters I thought I should never touch, I managed to create a prediction/detection class that calculates probabilities for different numbers of motifs and maximum lengths of these motifs.

Here are the results, for 14 timesteps, for number = 5 and length = 10 (top) and length = 20 (bottom), for a static scene (left) and a nonstatic scene (right).

See git for the pdf of the final version of my thesis!

Test results

I’ve created 10 reactions based on found optical flow patterns that anticipate (or should anticipate) a motif. To find the optical flow that each frame in a motif describes, I’ve used a very unorthodox method based on the probability of a word given a motif.. I’m not sure if it’s correct, but it gets me results and that’s what counts.

I’ve tried penalty shootouts for combinations of N=5, N=10 (number of motifs) and T=10, T=20 (the maximum length of these motifs). Of these shootouts, N=5 and T=10 worked the best. I would’ve tried more combinations, but training time remains high, even for very short motifs. Here are found motifs for N=10, T=20:

It is clearly visible that motifs are found that are much shorter than 20 timesteps. Paenalty shooutouts typically last 15 seconds or more, so I guess this was to be expected. It seems that, if this method is to be used by DNT, fps should be even higher to gain more timesteps per motif… However, the comparison (event detection in real-life images, as a realtime application) still takes about 0.7 seconds.

Matlab vs Octave

It seems that my python implementation of plsm, based on the Matlab version I received earlier, does not behave as it should. It seems to find only 3 motifs when I ask for 5, whereas the Matlab implementation is able to find all 5.. I’ve therefore decided to stick with Matlab.. for now. 

I’ve managed to create motifs that clearly occur in the temporal document (images will follow), however, the found motifs are still dependent on absolute location. I’m still not quite sure how to handle the location “invariance” that occurs: Each activity can occur on multiple locations in the image, but will be seen as different because of the location words. Removing them entirely will not work, as an action on the left side of the image and the same action on the right side of the image ask for a different reaction. I’m thinking of either reducing the number of cells to a much smaller number (instead of quantizing to every 10×10 pixels, dividing the entire image in about 16 quadrants seems just about right. This will not only greatly reduce the vocabulary size, but will also take care of the problem that occurs now.. e.g. every time an activity occurs in the top left corner, it is seen as completely different from the SAME activity when it occurs slightly lower).


I’ve also noted, and this makes a huge difference, that Octave – on the same system – is about 200 times slower than Matlab when it comes to the PLSM implementation….. Which probably means that I’ve been wasting some time. Oh well. 

Connected component analysis

As a topic may consist of several individual activities (which perhaps co-occur by chance), they can be split using Connected component analysis.

It surprises me that I have not seen this method of blob recognition before, as it is a very basic one – especially the two-pass algorithm, which is extremely easy to implement. Perhaps this method can be applied to enhance the existing vision functions of the Dutch Nao Team, especially when it comes to object classification.


The PLSM code I received from J. Varadarajan, R. Emonet and M. Odobez (many thanks to them, all credit for this method goes to the authors. If you’re interested in their (other) works, the respective homepages are,, has been ported to Python. It took me longer than I expected, as Matlab and numpy are not very compatible with eachother (off-by-1 errors, functions with the same name but different uses, etc.). I can say that this page was a great help.

As the PLSM method can now be used, all that is left is to create temporal documents of size V x T (V = vocabulary, T = number of timesteps) that will serve as input for the main function. The current matrices representing temporal documents are still filled with low-level features (categorized optical flow vectors quantized to 10×10 cells in the image). As the images are of size 320 x 240, this gives us 32 x 24 x 5 (static, up, down, left, right categories) visual words. And that’s a lot.

The solution to this is the use of PLSA combined with connected component analysis. Thanks to M. Blondel, I now have an implementation of the PLSA algorithm. To use it, I will have to map the current representation (an array of size V x T x D) to a usable format, namely an array of size V x D. If I’m correct, this is done by taking a certain number of frames f at an absolute time and using the occurring words in this to form a count vector of size V. However, I’m still not entirely sure how to keep track of the temporal information here..