TY  - JOUR
A1  - Long, Xiang
A1  - de Melo, Gerard
A1  - He, Dongliang
A1  - Li, Fu
A1  - Chi, Zhizhen
A1  - Wen, Shilei
A1  - Gan, Chuang
T1  - Purely attention based local feature integration for video classification
JF  - IEEE Transactions on Pattern Analysis and Machine Intelligence
N2  - Recently, substantial research effort has focused on how to apply CNNs or RNNs to better capture temporal patterns in videos, so as to improve the accuracy of video classification. In this paper, we investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we first propose Basic Attention Clusters (BAC), which concatenates the output of multiple attention units applied in parallel, and introduce a shifting operation to capture more diverse signals. Experiments show that BAC can achieve excellent results on multiple datasets. However, BAC treats all feature channels as an indivisible whole, which is suboptimal for achieving a finer-grained local feature integration over the channel dimension. Additionally, it treats the entire local feature sequence as an unordered set, thus ignoring the sequential relationships. To improve over BAC, we further propose the channel pyramid attention schema by splitting features into sub-features at multiple scales for coarse-to-fine sub-feature interaction modeling, and propose the temporal pyramid attention schema by dividing the feature sequences into ordered sub-sequences of multiple lengths to account for the sequential order. Our final model pyramidxpyramid attention clusters (PPAC) combines both channel pyramid attention and temporal pyramid attention to focus on the most important sub-features, while also preserving the temporal information of the video. We demonstrate the effectiveness of PPAC on seven real-world video classification datasets. Our model achieves competitive results across all of these, showing that our proposed framework can consistently outperform the existing local feature integration methods across a range of different scenarios.
KW  - Feature extraction
KW  - Convolution
KW  - Computational modeling
KW  - Plugs
KW  - Three-dimensional displays
KW  - Task analysis
KW  - Two dimensional displays
KW  - Video classification
KW  - action recognition
KW  - attention mechanism
KW  - computer vision
KW  - Algorithms
KW  - Neural Networks
KW  - Computer
Y1  - 2020
U6  - https://doi.org/10.1109/TPAMI.2020.3029554
SN  - 0162-8828
SN  - 1939-3539
SN  - 2160-9292
VL  - 44
IS  - 4
SP  - 2140
EP  - 2154
PB  - Inst. of Electr. and Electronics Engineers
CY  - Los Alamitos
ER  -