Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon

Datasets and deep learning methodologies to extend image-based applications to videos

Save for later
  • 4 min read
  • 05 Apr 2018

article-image
In today’s tutorial, we will extend image based application to videos, which will include pose estimation, captioning, and generating videos.

Extending image-based application to videos


Images can be used for pose estimation, style transfer, image generation, segmentation, captioning, and so on. Similarly, these applications find a place in videos too. Using the temporal information may improve the predictions from images and vice versa. In this section, we will see how to extend these applications to videos.

Regressing the human pose


Human pose estimation is an important application of video data and can improve other tasks such as action recognition. First, let's see a description of the datasets available for pose estimation:


Pfister et al. proposed a method to predict the human pose in videos. The following is the pipeline for regressing the human pose:

deep-learning-methodologies-extend-image-based-application-videos-img-0

The frames from the video are taken and passed through a convolutional network. The layers are fused, and the pose heatmaps are obtained. The pose heatmaps are combined
with optical flow to get the warped heatmaps. The warped heatmaps across a timeframe are pooled to produce the pooled heatmap, getting the final pose.

Tracking facial landmarks


Face analysis in videos requires face detection, landmark detection, pose estimation, verification, and so on. Computing landmarks are especially crucial for capturing facial animation, human-computer interaction, and human activity recognition. Instead of computing over frames, it can be computed over video. Gu et al. proposed a method to use a joint estimation of detection and tracking of facial landmarks in videos using RNN. The results outperform frame wise predictions and other previous models. The landmarks are computed by CNN, and the temporal aspect is encoded in an RNN. Synthetic data was used for training.

Segmenting videos


Videos can be segmented in a better way when temporal information is used. Gadde et al. proposed a method to combine temporal information by warping. The following image demonstrates the solution, which segments two frames and combines the warping:

deep-learning-methodologies-extend-image-based-application-videos-img-1

The warping net is shown in the following image:

deep-learning-methodologies-extend-image-based-application-videos-img-2

Reproduced from Gadde et al


The optical flow is computed between two frames, which are combined with warping. The warping module takes the optical flow, transforms it, and combines it with the warped representations.

Captioning videos


Captions can be generated for videos, describing the context. Let's see a list of the datasets available for captioning videos:


Yao et al. proposed a method for captioning videos. A 3D convolutional network trained for action recognition is used to extract the local temporal features. An attention mechanism is then used on the features to generate text using an RNN. The process is shown here:

deep-learning-methodologies-extend-image-based-application-videos-img-3

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €18.99/month. Cancel anytime

Reproduced from Yao et al


Donahue et al. proposed another method for video captioning or description, which uses LSTM with convolution features.

This is similar to the preceding approach, except that we use 2D convolution features over here, as shown in the following image:

deep-learning-methodologies-extend-image-based-application-videos-img-4

Reproduced from Donahue et al


We have several ways to combine text with images, such as activity recognition, image description, and video description techniques. The following image illustrates these techniques:

deep-learning-methodologies-extend-image-based-application-videos-img-5

Reproduced from Donahue et al


Venugopalan et al. proposed a method for video captioning using an encoder-decoder approach. The following is a visualization of the technique proposed by him:

deep-learning-methodologies-extend-image-based-application-videos-img-6

Reproduced from Venugopalan et al





The CNN can be computed on the frames or the optical flow of the images for this method.


 

Generating videos


Videos can be generated using generative models, in an unsupervised manner. The future frames can be predicted using the current frame. Ranzato et al. proposed a method for generating videos, inspired by language models. An RNN model is utilized to take a patch of the image and predict the next patch.

To summarize, we learned about video-based solutions in various scenarios such as action recognition, gesture recognition, security applications, and intrusion detection.


You read an excerpt from a book written by Rajalingappaa Shanmugamani titled, Deep Learning for Computer Vision. This book will help you learn to model and train advanced neural networks for implementation of Computer Vision tasks.

deep-learning-methodologies-extend-image-based-application-videos-img-7