================华丽分割线=================这部分来自知乎====================
链接:http://www.zhihu.com/question/33272629/answer/60279003
有关action recognition in videos, 最近自己也在搞这方面的东西,该领域水很深,不过其实主流就那几招,我就班门弄斧说下video里主流的:
Deep Learning之前最work的是INRIA组的Improved Dense Trajectories(IDT) + fisher vector, paper and code:
LEAR - Improved Trajectories Video Description
基本上INRIA的东西都挺work 恩..
然后Deep Learning比较有代表性的就是VGG组的2-stream:
http://arxiv.org/abs/1406.2199
其实效果和IDT并没有太大区别,里面的结果被很多人吐槽难复现,我自己也试了一段时间才有个差不多的数字。
然后就是在这两个work上面就有很多改进的方法,目前的state-of-the-art也是很直观可以想到的是xiaoou组的IDT+2-stream:
http://wanglimin.github.io/papers/WangQT_CVPR15.pdf
还有前段时间很火,现在仍然很多人关注的G社的LSTM+2-stream:
http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43793.pdf
然后安利下zhongwen同学的paper:
http://www.cs.cmu.edu/~zhongwen/pdf/MED_CNN.pdf
最后你会发现paper都必需和IDT比,
Deep Learning之前最work的是INRIA组的Improved Dense Trajectories(IDT) + fisher vector, paper and code:
LEAR - Improved Trajectories Video Description
基本上INRIA的东西都挺work 恩..
然后Deep Learning比较有代表性的就是VGG组的2-stream:
http://arxiv.org/abs/1406.2199
其实效果和IDT并没有太大区别,里面的结果被很多人吐槽难复现,我自己也试了一段时间才有个差不多的数字。
然后就是在这两个work上面就有很多改进的方法,目前的state-of-the-art也是很直观可以想到的是xiaoou组的IDT+2-stream:
http://wanglimin.github.io/papers/WangQT_CVPR15.pdf
还有前段时间很火,现在仍然很多人关注的G社的LSTM+2-stream:
http://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/43793.pdf
然后安利下zhongwen同学的paper:
http://www.cs.cmu.edu/~zhongwen/pdf/MED_CNN.pdf
最后你会发现paper都必需和IDT比,
================华丽分割线=================这部分也来自知乎====================
链接:http://www.zhihu.com/question/33272629/answer/60163859
视频方面的不了解,可以聊一聊静态图像下的~
[1] Action Recognition from a Distributed Representation of Pose and Appearance, CVPR,2010
[2] Combining Randomization and Discrimination for Fine-Grained Image Categorization, CVPR,2011
[3] Object and Action Classification with Latent Variables, BMVC, 2011
[4] Human Action Recognition by Learning Bases of Action Attributes and Parts, ICCV, 2011
[5] Learning person-object interactions for action recognition in still images, NIPS, 2011
[6] Weakly Supervised Learning of Interactions between Humans and Objects, PAMI, 2012
[7] Discriminative Spatial Saliency for Image Classification, CVPR, 2012
[8] Expanded Parts Model for Human Attribute and Action Recognition in Still Images, CVPR, 2013
[9] Coloring Action Recognition in Still Images, IJCV, 2013
[10] Semantic Pyramids for Gender and Action Recognition, TIP, 2014
[11] Actions and Attributes from Wholes and Parts, arXiv, 2015
[12] Contextual Action Recognition with R*CNN, arXiv, 2015
[13] Recognizing Actions Through Action-Specific Person Detection, TIP, 2015
2010之前的都没看过,在10年左右的这几年(11,12)主要的思路有3种:1.以所交互的物体为线索(person-object interaction),建立交互关系,如文献5,6;2.建立关于姿态(pose)的模型,通过统计姿态(或者更广泛的,部件)的分布来进行分类,如文献1,4,还有个poselet上面好像没列出来,那个用的还比较多;3.寻找具有鉴别力的区域(discriminative),抑制那些meaningless 的区域,如文献2,7。10和11也用到了这种思想。
文献9,10都利用了SIFT以外的一种特征:color name,并且描述了在动作分类中如何融合多种不同的特征。
文献12探讨如何结合上下文(因为在动作分类中会给出人的bounding box)。
比较新的工作都用CNN特征替换了SIFT特征(文献11,12,13),结果上来说12是最新的。
静态图像中以分类为主,检测的工作出现的不是很多,文献4,13中都有关于检测的工作。可能在2015之前分类的结果还不够promising。现在PASCAL VOC 2012上分类mAP已经到了89%,以后的注意力可能会更多地转向检测。
[1] Action Recognition from a Distributed Representation of Pose and Appearance, CVPR,2010
[2] Combining Randomization and Discrimination for Fine-Grained Image Categorization, CVPR,2011
[3] Object and Action Classification with Latent Variables, BMVC, 2011
[4] Human Action Recognition by Learning Bases of Action Attributes and Parts, ICCV, 2011
[5] Learning person-object interactions for action recognition in still images, NIPS, 2011
[6] Weakly Supervised Learning of Interactions between Humans and Objects, PAMI, 2012
[7] Discriminative Spatial Saliency for Image Classification, CVPR, 2012
[8] Expanded Parts Model for Human Attribute and Action Recognition in Still Images, CVPR, 2013
[9] Coloring Action Recognition in Still Images, IJCV, 2013
[10] Semantic Pyramids for Gender and Action Recognition, TIP, 2014
[11] Actions and Attributes from Wholes and Parts, arXiv, 2015
[12] Contextual Action Recognition with R*CNN, arXiv, 2015
[13] Recognizing Actions Through Action-Specific Person Detection, TIP, 2015
2010之前的都没看过,在10年左右的这几年(11,12)主要的思路有3种:1.以所交互的物体为线索(person-object interaction),建立交互关系,如文献5,6;2.建立关于姿态(pose)的模型,通过统计姿态(或者更广泛的,部件)的分布来进行分类,如文献1,4,还有个poselet上面好像没列出来,那个用的还比较多;3.寻找具有鉴别力的区域(discriminative),抑制那些meaningless 的区域,如文献2,7。10和11也用到了这种思想。
文献9,10都利用了SIFT以外的一种特征:color name,并且描述了在动作分类中如何融合多种不同的特征。
文献12探讨如何结合上下文(因为在动作分类中会给出人的bounding box)。
比较新的工作都用CNN特征替换了SIFT特征(文献11,12,13),结果上来说12是最新的。
静态图像中以分类为主,检测的工作出现的不是很多,文献4,13中都有关于检测的工作。可能在2015之前分类的结果还不够promising。现在PASCAL VOC 2012上分类mAP已经到了89%,以后的注意力可能会更多地转向检测。
[1] http://lear.inrialpes.fr/software(干货较多,可以进去浏览浏览)
[2] Action Recognition Paper Reading
- Tian, YingLi, et al. "Hierarchical filtered motion for action recognition in crowded videos." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 42.3 (2012): 313-323.
- A new 3D interest point detector, based on 2D Harris and Motion History Image (MHI). Essentially, 2D Harris points with recent motion are selected as interest point.
- A new descriptors based on HOG on image intensity and MHI. Some filtering is performed to remove cluttered motion and normalize descriptors.
- KTH and MSR Action dataset
- Yuan, Junsong, Zicheng Liu, and Ying Wu. "Discriminative subvolume search for efficient action detection." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
- A discriminative matching techniques based on mutual information and nearest neighbor algorithm
- A better upper bound for Branching and Bounding to locate matched action that maximize mutual information
- The key idea is to decompose the search space into spatial and temporal.
- Lampert, Christoph H., Matthew B. Blaschko, and Thomas Hofmann. "Beyond sliding windows: Object localization by efficient subwindow search." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Code online: https://sites.google.com/site/christophlampert/software (Efficient Subwindow Search)
- Reducing the complexity of sliding window from n4 to averagely n2
- Branching and Bounding techniques
- Relies on a bounding funtion that gives a upper bound of the scoring function over a set of potential box
- works well with linear classifiers and BOW features.
- Li, Li-Jia, et al. "Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification." NIPS. Vol. 2. No. 3. 2010.
- Images are represented as a scale-invariant map of object detector response
- Detectors are applied to novel images in multiple scales. At each scale, a 3 level spatial pyramid is applied. Responses are concatenated to form the descriptors for the image.
- 200 objecst are selected from a 1000 objects pool
- Evaluated In Scene classification task
- L1 and L1/L2 regularized LR is applied to discover sparsity. The the L1/L2 group sparsity, group is defined for each object, hence object level sparsity. Bear in mind that there are multiple entries in the descriptors for each object. (marginal improvements)
- Wang, Heng, et al. "Dense trajectories and motion boundary descriptors for action recognition." International journal of computer vision 103.1 (2013): 60-79.
- Tracking over densely sampled points to get trajectories, in contrast with local representation. Not really dense sampling, grids are filtered by minEigen value criterion (Shi and Tomasi)
- Motion boundary (derivative over optical flow field), to overcome camera motion
- Code online: http://lear.inrialpes.fr/people/wang/dense_trajectories
- Optical Flow field is filtered by Median Filter. based on opencv
- Limit trajectory to overcome drift. Filter static point and error trajectories.
- Trajectory shape, HOG, HOF and MBH descriptors along the trajectory
- KTH (94.2%), Youtube (84.1%), Hollywood2 (58.2%), UCF Sports (88.0%), IXMAS (93.5%), UIUC (98.4%), Olympic Sports (74.1%), UCF50 (84.5%), HMDB51 (46.6%)
- Liang, Xiaodan, Liang Lin, and Liangliang Cao. "Learning latent spatio-temporal compositional model for human action recognition." Proceedings of the 21st ACM international conference on Multimedia. ACM, 2013.
- Laptev STIP with HOF and HOG, with BOW quantization
- Leaf node for detecting action parts
- Or node to account for intra-class variability
- And node to aggregate action in a frame
- Root node to identify temporal composition
- Contextual interaction (connecting leaf nodes)
- Everything is formulated in a latent SVM framework and solved by CCCP
- Since the leaf node can move around from one Or-node to another, a reconfiguration step is used to rearrange the feature vector
- UCF Youtube and Olympic Sports dataset
- Sadanand, Sreemanananth, and Jason J. Corso. "Action bank: A high-level representation of activity in video." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
- 98.2% KTH, 95.0% UCF Sports, 57.9% UCF50, 26.9% HMDB51
- 205 video clips used as template to detect action from novel video.
- Detectors are sampled from multi viewpoint and run with multiple scales
- Output of detectors are maxpooled for ST volume through various pooling unit
- "Action Spoting" for template detector
- Code online: http://www.cse.buffalo.edu/~jcorso/r/actionbank/
- Liu, Jingen, Benjamin Kuipers, and Silvio Savarese. "Recognizing human actions by attributes." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
- 22 manually selected action attributes as semantic representation
- Data Driven attributes as complementary information
- Attributes as latent variable, just the parts in DPM model
- Account for the class matching, attribute matching, attributes cooccurcance.
- STIP by 1D-Gabor detector. Gradient based + BOW over ST volume
- UIUC dataset, KTH, Olympic Sports Dataset
- Niebles, Juan Carlos, Hongcheng Wang, and Li Fei-Fei. "Unsupervised learning of human action categories using spatial-temporal words." International Journal of Computer Vision 79.3 (2008): 299-318.
- Unsupervised video categorizaton, using pLSA and LDA
- Action Localization
- Laptev's STIP is too sparse comparing with Dollar's
- Simple gradient based descriptors and PCA applied to reduce dimensionality --> rely on codebook to deal with invariance
- K-means with Euclidean distance metric
- pLSA or LDA on top of BOW (# topic is equal to the categories to be recognized)
- Each STIP is associated with a BOW, hence topic distribution, so it's trivial to perform Localization
- Laptev, Ivan, et al. "Learning realistic human actions from movies." Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
- Annotating videos by aligning transcriptes
- A movie dataset
- Space-Time interest points + HOG + HOF around a ST volume
- ST BOW. Given a video sequence, multiple way to segment it, each of which is called a channel
- Multi-Channel chi^2 kernel classification. Channel selection using greedy shrink
- KTH (91.8%) and Movie (18.2% ~ 53.3%) dataset
- STIP + HOG and HOF code: http://www.di.ens.fr/~laptev/download.html
Links to Datasets:
- "Free Viewpoint Action Recognition using Motion History Volumes (CVIU Nov./Dec. '06)."
D. Weinland, R. Ronfard, E. Boyer - "Actions as Space-Time Shapes (ICCV '05)."
M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri - "Recognizing Human Actions: A Local SVM Approach (ICPR '04)."
C. Schuldt, I. Laptev and B. Caputo - "Propagation Networks for Recognizing Partially Ordered Sequential Activity (CVPR
'04)."
Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa - "Tracking Multiple Objects through Occlusions (CVPR '05)."
Y. Huang, I. Essa - Sixth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (PETS - ECCV 2004)
Recent Action Recognition Papers:
- D. Weinland, R. Ronfard, E. Boyer (CVIU Nov./Dec. '06)
"Free Viewpoint Action Recognition using Motion History Volumes"
11 actors each performing 3 times 13 actions: Check Watch, Cross Arms, Scratch Head, Sit Down, Get Up, Turn Around, Walk, Wave, Punch, Kick, Point, Pick Up, Throw.
Multiple views of 5 synchronized and calibrated cameras are provided. - A. Yilmaz, M. Shah (ICCV '05)
"Recognizing Human Actions in Videos Acquired by Uncalibrated Moving Cameras"
18 Sequences, 8 Actions: 3 x Running, 3 x Bicycling, 3 x Sitting-down, 2 x Walking, 2 x Picking-up, 1 x Waving Hands, 1 x Forehand Stroke, 1 x Backhand Stroke - Y. Sheikh, M. Shah (ICCV '05)
"Exploring the Space of an Action for Human Action Recognition"
6 Actions: Sitting, Standing, Falling, Walking, Dancing, Running - M. Blank, L. Gorelick, E. Shechtman, M. Irani, R. Basri (ICCV '05)
"Actions as Space-Time Shapes"
81 Sequences, 9 Actions, 9 People: Running, Walking, Bending, Jumping-Jack, Jumping-Forward-On-Two-Legs, Jumping-In-Place-On-Two-Legs, Galloping-Sideways, Waving-Two-Hands, Waving-One-Hand Ballet - A. Yilmaz, M. Shah (CVPR '05)
"Action Sketch: A Novel Action Representation"
28 Sequences, 12 Actions: 7 x Walking, 4 x Aerobics, 2 x Dancing, 2 x Sit-down, 2 x Stand-up, 2 x Kicking, 2 x Surrender, 2 x Hands-down, 2 x Tennis, 1 x Falling - E. Shechtman, M. Irani (CVPR '05)
"Space-Time Behavioral Correlation"
Walking, Diving, Jumping, Waving Arms, Waving Hands, Ballet Figure, Water Fountain - Y. Shi, Y. Huang, D. Minnen, A. Bobick, I. Essa (CVPR '04)
"Propagation Networks for Recognition of Partially Ordered Sequential Actions"
Glucose Monitor Calibration - C. Schuldt, I. Laptev and B. Caputo (ICPR '04)
"Recognizing Human Actions: A Local SVM Approach."
6 Actions x 25 Subjects x 4 Scenarios - V. Parameswaran, R. Chellappa (CVPR '03)
"View Invariants for Human Action Recognition"
25 x Walk, 6 x Run, 18 x Sit-down - D. Minnen, I. Essa, T. Starner (CVPR '03)
"Expectation Grammars: Leveraging High-Level Expectations for Activity Recognition"
Towers of Hanoi (only hands) - A. Efros, A. Berg, G. Mori, J. Malik (ICCV '03)
"Recognizing Actions at a Distance"
Soccer, Tennis, Ballet
[4] CVPR 2014 Tutorial on Emerging Topics in Human Activity Recognition
[5] http://yangxd.org/projects/surveillance/SED13
[6] Recognition of human actions
Sample sequences for each action (DivX-compressed)
person15_walking_d1_uncomp.aviperson15_jogging_d1_uncomp.avi
person15_running_d1_uncomp.avi
person15_boxing_d1_uncomp.avi
person15_handwaving_d1_uncomp.avi
person15_handclapping_d1_uncomp.avi
Action database in zip-archives (DivX-compressed)
Note: The database is publicly available for non-commercial use. Please refer to [Schuldt, Laptev and Caputo, Proc.
ICPR'04, Cambridge, UK ] if you use this database in your publications.
jogging.zip (168Mb)
running.zip (149Mb)
boxing.zip (194Mb)
handwaving.zip (218Mb)
handclapping.zip (176Mb)
Related publications "Recognizing Human Actions: A Local SVM Approach",
Christian Schuldt, Ivan Laptev and Barbara Caputo; in Proc. ICPR'04, Cambridge, UK. [Abstract PDF]"Local Spatio-Temporal Image Features for Motion Interpretation",
Ivan Laptev; PhD Thesis, 2004, Computational Vision and Active Perception Laboratory (CVAP), NADA, KTH, Stockholm [Abstract, PDF]"Local Descriptors for Spatio-Temporal Recognition",
Ivan Laptev and Tony Lindeberg; ECCV Workshop "Spatial Coherence for Visual Motion Analysis" [Abstract, PDF]"Velocity adaptation of space-time interest points",
Ivan Laptev and Tony Lindeberg; in Proc. ICPR'04, Cambridge, UK. [Abstract, PDF]"Space-Time Interest Points",
I. Laptev and T. Lindeberg; in Proc. ICCV'03, Nice, France, pp.I:432-439. [Abstract, PDF]