A deep semantic framework for multimodal representation learning
- Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to bothMultimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval.…
MetadatenAuthor details: | Cheng Wang, Haojin YangGND, Christoph MeinelORCiDGND |
---|
DOI: | https://doi.org/10.1007/s11042-016-3380-8 |
---|
ISSN: | 1380-7501 |
---|
ISSN: | 1573-7721 |
---|
Title of parent work (English): | Multimedia tools and applications : an international journal |
---|
Publisher: | Springer |
---|
Place of publishing: | Dordrecht |
---|
Publication type: | Article |
---|
Language: | English |
---|
Year of first publication: | 2016 |
---|
Publication year: | 2016 |
---|
Release date: | 2020/03/22 |
---|
Tag: | Cross-modal retrieval; Deep neural networks; Multimodal representation; Semantic feature |
---|
Volume: | 75 |
---|
Number of pages: | 22 |
---|
First page: | 9255 |
---|
Last Page: | 9276 |
---|