# Breaking the 4th dimensional wall of object tracking

Breaking the 4th dimensional wall of object tracking.

My first Monday morning in Copenhagen, Konstantinos and I met the guys working at the Niels Bohr Institute and we had a very delicious breakfast together. Seriously, Danish pastries are so good I can understand why the Danes named them after themselves!* Soon after that I was handed by my supervisor a paper with some notes about my project “Tracing in 4D data” (Check out my introduction blog post if you didn’t read it!).

By the afternoon on the same day I already completed 4 over 6 points of the notes’ list, getting familiar with my project, OpenCV and its implementation of various 2D object tracking algorithms, and having fun testing them with my laptop’s webcam. The next point was “Move to 3D”. That’s what I have been doing so far regarding my project and what I am going to talk about in this blog post.

In the next two sections you’ll find a description of the object tracking algorithms I implemented during my project. If you’re not interested in the math, you can skip directly to the last section where you can see some nice visualizations of the testing results of the algorithm!

*This statement is not true but made only for comical purposes. Use this joke at your own risk. Actually Danish pastries are called like this everywhere but in Denmark, where they are called Vienna Bread (wienerbrød), as they were first made in Denmark in 1840 by Viennese chefs! Click here to know more about this.

## Median Flow

Forward-backward consistency assumption. [1]

Median Flow is the algorithm I finally chose to generalise to 4D data among the ones from OpenCV after reading some academic papers. It [1] is mainly based on one idea: the algorithm should be able to track the object regardless of the direction of time-flow. This concept is well explained by the image on the right. Point 1 is visible in both images and the tracker is able to track it correctly: tracking this point forward and backward results in identical trajectories. On the other hand Point 2 is occluded in the second image and the tracker localizes a different point. Tracking this point backward the tracker finds a different location than the original one. In the algorithm the discrepancies between these forward and backward trajectories are measured and if they differ significantly, the forward trajectory is considered as incorrect. This Forward-Backward error penalizes these inconsistent trajectories and enables the algorithm to reliably detect tracking failures and select reliable trajectories in video sequences.

Block diagram of Median Flow. [1]

But how do you actually track the objects? Well, the algorithm’s block diagram is shown on the image on the left. The tracker accepts a pair of images (the current frame and the next consecutive frame of a video) and a bounding box that locates the object to track in the first frame, and it outputs a new bounding box that locates the object in the next frame. Initially, a set of points is initialized on a rectangular grid within the bounding box. These points identify the object and they are the actual elements tracked in the video. To track these points Median Flow relies on a sparse optical flow feature tracker called Lucas-Kanade [2]. You can read more about it on the next section. This is the ‘Flow’ part in ‘Median Flow’. The quality of the point predictions is then estimated and each point is assigned an error (a combination of Forward-Backward error and other measures). 50% of the worst predictions are filtered out and only the remaining predictions are used to estimate the displacement of the bounding box and scale change using median over each spatial dimension. This is the ‘Median’ part!

Median Flow at work. Red points are correctly tracked and the ones chosen to update the bounding box for the next frame.

To improve the accuracy of Median Flow we can use other error measuring approaches like normalized cross-correlation, which turns out it’s complementary to the forward-backward error! [1] The name sounds fancy but again the concept is very simple: we compare a small region around the tracked points in the current frame and the next one, and measure how similar they are. If the points are tracked correctly the regions should be identical. This is true under the assumption already made by optical flow based tracking.

## Lucas-Kanade feature tracker

The Lucas-Kanade tracker is a widely used differential method for sparse optical flow estimation. The concept of optical flow is quite old and it extends outside the field of Computer Vision. In this blog you can think of optical flow as a vector that contains for every dimension the number of pixels a point has moved between one frame and the next. To compute optical flow firstly we have to make an important assumption: the brightness of the region we are tracking in the video doesn’t change between consecutive frames. Mathematically this translates to:

$I(x,y,t)=I(x+\Delta x,y+\Delta y, t+\Delta t)$

Then if the movement is small, we can develop the term $I(x+\Delta x,y+\Delta y, t+\Delta t)$ using the first order from the Taylor series and we get:

$I(x+\Delta x,y+\Delta y, t+\Delta t) = I(x,y,t) + \frac{\partial I}{\partial x} \Delta x+ \frac{\partial I}{\partial y} \Delta y+ \frac{\partial I}{\partial t} \Delta t$

Now from the original equation it follows that:

$\frac{\partial I}{\partial x} \Delta x+ \frac{\partial I}{\partial y} \Delta y+ \frac{\partial I}{\partial t} \Delta t = 0$

Dividing by $\Delta t$ we finally obtain the optical flow equation:

$\frac{\partial I}{\partial x} V_{x}+ \frac{\partial I}{\partial y} V_{y}+ \frac{\partial I}{\partial t} = 0$

where $V_{x}$ and $V_{y}$ are the velocities, or the components of the optical flow. We can also write this equation in such a way that it’s still valid for 3 dimensions in the following way:

$\nabla I^{T} \cdot \vec{V} = - I_{t}$

The problem now is that this is an equation in two (or three) unknowns and cannot be solved as such. This is known as the aperture problem of optical flow algorithms, which basically means that we can estimate optical flow only in the direction of the gradients and not in the direction perpendicular to it. To actually solve this equation we need some additional constraint and there are various approaches and algorithms for estimating the actual flow. The Lucas-Kanade tracker applies the least square principle to solve the equation. Making the same assumptions as before (brightness consistency constraint and small movement) for the neighborhood of the point to track we can apply the optical flow equation to all the pixels $q_1, q_2, \dots, q_n$ in the window centered around the point. We obtain a system of equations that can be written in a matrix form $Av = b$ where $v$ is the optical flow and:

$A = \begin{bmatrix} \nabla I(q_1) \\ \nabla I(q_2) \\ \vdots \\ \nabla I(q_n) \end{bmatrix}$   and   $b = \begin{bmatrix} - I_t (q_1) \\ - I_t (q_2) \\ \vdots \\ - I_t (q_n) \end{bmatrix}$

And the solution is: $v = (A^T A)^{-1} A^T b$

Finally! Now we understand the classical Lucas-Kanade algorithm. But when we have to implement it we still have some ways to improve the accuracy and robustness of this original version of the algorithm, for example adding the words ‘iterative’ and ‘pyramidal’ to it [2].

Pyramidal implementation of Lucas-Kanade feature tracker.

By ‘iterative’ I mean just iterate the classical Lucas-Kanade multiple times, using every time the solution found in the previous iteration and letting the algorithm converge. This is what we need to do in practice to obtain an accurate solution. The second trick is to build pyramidal representations of the images: this means to resize every image halving the size multiple times. So for example, for an image of size 640×480 and a pyramid of 3 levels we would also have images of size 320×240, 160×120 and 80×60. The reason behind this is to allow the algorithm to handle large pixel motion. The movement of a pixel is a lot smaller on the top level of the pyramid (divided by $2^L$ where $L$ is the top level of the pyramid, exactly) . We apply the iterative algorithm to every level of the pyramid using the solution of the current level as initial guess of optical flow for the next one.

Now the Lucas-Kanade feature tracker works quite well!

## Reinventing the wheel to move forward

If you read the previous sections and you thought that there is nothing that cannot be extended to more dimensions, you are correct! Every step described up to now can be easily generalized to three dimensional volumes, Lucas-Kanade included The proof is left as an exercise for the reader . Everything I described is already well implemented by the OpenCV library for typical 2D videos. That’s why at first I started diving into the code of the implementations of OpenCV. But for me it was too much optimized to make something useful out of it. Let’s remember that the OpenCV project was started 18 years ago and the main contributors were a group Russian optimization experts who worked at Intel, and they also had 18 years to optimize it even more! As an epigraph written and highlighted just outside my office says: “Optimization hinders evolution”, and I guess I experienced this firsthand. To be fair to the OpenCV project, I don’t think it was ever the intention of the developers to implement 3D versions of the algorithms.

But enough ramblings! Long story short, I reinvented the wheel and did my implementation of these algorithms in Python using Numpy, then generalized them to work on 3D movies. To test the new Median Flow I generated my own 4D data to make the problem very easy for the algorithm, like for example a volume of a ball, and then applying small translations to the ball.

3D Median Flow test results.

It works! Now we can try to make the problem a little more difficult and also have a bit of fun. For example, let’s add gravity to the ball! Or as my professor would say, let’s use a numerical integrator to solve the system of ordinary differential equations of the laws of kinematics for the ball subject to a gravitational field!

It took me way longer than I care to admit to make this.

But it’s not just balls! The beautiful thing about the algorithm is that it doesn’t care about the object it is tracking, but we can apply it to any sort of data, for example 3D MRI data! Let’s generate some 3D movie applying again some translations to make the head move and then track something, like the nose!

Having fun with 3D Median Flow. Database source: University of North Carolina

These toy problems are quite easy to handle for the algorithm because the object moves very slowly between one frame and the next one, and also they are quite easy to handle for my laptop since the sizes of the volumes is quite small. This is not the case for real world research data. Increasing the size of the data increases the computation required by the algorithm. Also when the object moves faster between frames we need to increase the total number of levels of the pyramidal representation and the size of the integration window for the Lucas-Kanade algorithm. This means even more computation, a lot more!

That’s why this is a problem for supercomputers and why I am moving onto the next point of my notes’ list: “High performance computing”. But this a topic for another blog post.

## References

[1] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Forward-backward error: Automatic detection of tracking failures. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 2756–2759. IEEE, 2010.

[2] Jean-Yves Bouguet. Pyramidal implementation of the Lucas Kanade feature tracker. Description of the algorithm. Intel Corporation, 5, 2001.