A deep semantic framework for multimodal representation learning

Wang, Cheng; Yang, Haojin; Meinel, Christoph

doi:10.1007/s11042-016-3380-8

Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to bothMultimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval.… show more

Author details:	Cheng Wang, Haojin Yang GND, Christoph Meinel ORCiD GND
DOI:	https://doi.org/10.1007/s11042-016-3380-8
ISSN:	1380-7501
ISSN:	1573-7721
Title of parent work (English):	Multimedia tools and applications : an international journal
Publisher:	Springer
Place of publishing:	Dordrecht
Publication type:	Article
Language:	English
Year of first publication:	2016
Publication year:	2016
Release date:	2020/03/22
Tag:	Cross-modal retrieval; Deep neural networks; Multimodal representation; Semantic feature
Volume:	75
Number of pages:	22
First page:	9255
Last Page:	9276

A deep semantic framework for multimodal representation learning

Export metadata

Additional Services