In this assignment we use matlab to implement the required algorithm: Dynamic Textures.
Dynamic textures are sequences of images of moving scenes that exhibit temporal regularity, intended in a statistical sense, like sea-waves, smoke, foliage, whirlwind but also talking faces, traffic scenes etc.We present a characterization of dynamic textures that poses the problems of modeling, learning, and synthesis of this type of sequences. Once learned, a model has predictive power and can be used for extrapolating synthetic sequences to infinite length with negligible computational cost.
To conclude the thesis, here are the main contribution of their approach:
- Representation: first they made a novel definition of dynamic texture that is general (even the simplest instance can capture the second-order statistics of a video signal), and precise (it allows making analytical statements and drawing from the rich literature on system identification).
- Learning: two criteria are proposed : total likelihood or prediction error. For the simplest instance of the model, we used a closed-form solution.
- Recognition: they found that textures alike tend to cluster in model space, and assessed potential for recognition of dynamic visual processes.
- Synthesis: they found that even the simplest model (first-order autoregressive moving average model with Gaussian input) captures a wide range of natural phenomena.
- Implementation: the algorithm is simple to implement, efficient to learn and fast to simulate; it allows one to generate infinitely long sequences from short input sequences. (And implementation of this algorthm is exactly what we'll do in this projet.)
Under the hypothesis of temporal stationarity, Soatto adopted a linear dynamic system (LDS) to analyze and synthesize dynamic texture. The statespace representation of their model is given by x_t = Ax_(t−1) + v_t, v_t ∼ N (0, Σ_v) y_t = Cx_t + w_t, w_t ∼ N (0, Σ_w) where y_t ∈ R^n is the observation vector; x_t ∈ R^r is the hidden state vector; A is the system matrix; C is the output matrix and v_t, w_t are Gaussian white noises.Inplementation
in this project we analyze sequences of images of moving scenes solely as visual signals, and interpreting and understanding a signal amounts to inferring a stochastic model that generates it. We used Matlab to define the dynamic texture model, use linear filters to to dimensionality reduction, do the learning process, compress and denoise, then output as an extended video. Details are discussed in the thesis and source code is included on the head of this page.Result
Here are the videos we tested below, we have the original video and the synthesized video, the synthesized video was extended as well. In the table, you may look through our results and compare those videos. The original video is working as a training sequence in the learning procedure to extract the parameters of the model. We then simulate the model to synthesize new video sequences. Note that the learning procedure has been applied directly to the raw data, and no preprocessing has been performed.
The program learned a sequence of 100 frames takes about 1.5 minutes on a 2.2 GHz Intel Core i7 PC. Synthesis can be performed at frame rate. In our implementation we have used n between 20 and 50 and k between 10 and 30.
Here we can see that most videos have great results, they are pretty natual that you can hardly realize that they're actually "fake". But it's simply transferring original video into a texture and obviously it cannot "predict" anything not given. We can see that the algorithm tries to find a short clip that seem to be repeating, blurred and reduced a little bit of the difference on movements in each clip, and create a texture based on the clip.
We got some failed results in our test. They are listed below and we can see that there are they have different points that acts not so natual. We're going to discuss about why they don't have perfect result in the analysis part below, but first please look these videos through.
First in the video of the trees, we can see that the result video we got is a little disjointed, there's a jump in the flow which is not smooth, and it looks like there's a certain loop in it that you can easily discover. This is because each of the trees in the original video shakes seperately and when the algorithm calculates it's hard to get cordination in the overall perspective.
As for the spin video, first it's quite blurry, with an obvious offset from the original outline, leaves a shadow of past movement, it looks more like vibrating instead of slight movement. This is because the algorithm calculates the whole screen of the video, trying to reduce difference, but the position of the spin is somewhat random and the movement is pretty slightly, so the result turned out to have an obvious white shadow and the the spin vibrates.
In the firework video, the old clips of fireworks don’t fade away, which lead to a blurred video. Eventually, the screen is just a smear of color that has small movements in it. The problem of this one is quite like a combination of the first two videos, the movement of the firework is not that regular in computational perspective then we though it is in human mind, and the algorthm just have to blur a lot.Conclusions
Overall, we believe this is a successful implementation. We really like the result of the mountain and cloud video and the disk playing video. Though there are several imperfect results, we can see that the algorthm do have some limitations and is not so robust when the original video is less repetitive and predictable.
Bell & Whistles(Extra Points)Varying the parameters can potentially help to get better results
Let us see how the results change when we vary the k parameter. The thesis suggests us to use k in [10,20].
|n = 50, k = 1||n = 50, k = 20|
|n = 50, k = 30||n = 50, k = 40|
|n = 50, k = 1||n = 50, k = 30|
We can see in this case, videos with smaller k values look more blurry and lose some details. When k is bigger than some value (such as 30), the results turn blurry again. So the conclusion is that in most cases, choosing k between [20,30] can get the best result.
|n = 50, k = 1||n = 50, k = 10|
|n = 50, k = 15||n = 50, k = 20|
|n = 50, k = 30||n = 50, k = 40|
As k increases, we can clearly notice some black parts in the right corner of the video. We think it is because this video is not strictly a second order stationary process. Simple models as mentioned before (such as hidden Markov model (HMM) and linear dynamical system (LDS) are efficient and easily learned but limited in their expressiveness for complex motions. In those situations, small k value gives better results while in other videos, a little bit bigger k value might give better results.
After we finished this project, here are several thoughts and opions we have:
- Though video and images are different forms of media, ideas of processing them have a lot in common.
- The algorithm we used in this projected is mainly from the thesis of Soatto, and we got the tool box included as well. This is very helpful our work on this project.
- Maybe we can continue optimizing the code and shorten the running time. For optimizing, a good direction is triying to reduce some sudden move and image blurry appears in synthesizing
-  S. Soatto, G. Doretto, and Y.N. Wu. Dynamic textures. In IEEE Conf. on Computer Vision and Pattern Recognition, 2001.