TY - JOUR A1 - Long, Xiang A1 - de Melo, Gerard A1 - He, Dongliang A1 - Li, Fu A1 - Chi, Zhizhen A1 - Wen, Shilei A1 - Gan, Chuang T1 - Purely attention based local feature integration for video classification JF - IEEE Transactions on Pattern Analysis and Machine Intelligence N2 - Recently, substantial research effort has focused on how to apply CNNs or RNNs to better capture temporal patterns in videos, so as to improve the accuracy of video classification. In this paper, we investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we first propose Basic Attention Clusters (BAC), which concatenates the output of multiple attention units applied in parallel, and introduce a shifting operation to capture more diverse signals. Experiments show that BAC can achieve excellent results on multiple datasets. However, BAC treats all feature channels as an indivisible whole, which is suboptimal for achieving a finer-grained local feature integration over the channel dimension. Additionally, it treats the entire local feature sequence as an unordered set, thus ignoring the sequential relationships. To improve over BAC, we further propose the channel pyramid attention schema by splitting features into sub-features at multiple scales for coarse-to-fine sub-feature interaction modeling, and propose the temporal pyramid attention schema by dividing the feature sequences into ordered sub-sequences of multiple lengths to account for the sequential order. Our final model pyramidxpyramid attention clusters (PPAC) combines both channel pyramid attention and temporal pyramid attention to focus on the most important sub-features, while also preserving the temporal information of the video. We demonstrate the effectiveness of PPAC on seven real-world video classification datasets. Our model achieves competitive results across all of these, showing that our proposed framework can consistently outperform the existing local feature integration methods across a range of different scenarios. KW - Feature extraction KW - Convolution KW - Computational modeling KW - Plugs KW - Three-dimensional displays KW - Task analysis KW - Two dimensional displays KW - Video classification KW - action recognition KW - attention mechanism KW - computer vision KW - Algorithms KW - Neural Networks KW - Computer Y1 - 2020 U6 - https://doi.org/10.1109/TPAMI.2020.3029554 SN - 0162-8828 SN - 1939-3539 SN - 2160-9292 VL - 44 IS - 4 SP - 2140 EP - 2154 PB - Inst. of Electr. and Electronics Engineers CY - Los Alamitos ER -