The search result changed since you submitted your search request. Documents might be displayed in a different sort order.
  • search hit 10 of 123
Back to Result List

Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

  • Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness andGenerating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.show moreshow less

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Cheng Wang, Haojin YangGND, Christoph MeinelORCiDGND
DOI:https://doi.org/10.1145/3115432
ISSN:1551-6857
ISSN:1551-6865
Title of parent work (English):ACM transactions on multimedia computing, communications, and applications
Publisher:Association for Computing Machinery
Place of publishing:New York
Publication type:Article
Language:English
Date of first publication:2018/04/25
Publication year:2018
Release date:2021/12/01
Tag:Deep learning; LSTM; image captioning; multimodal representations; mutli-task learning
Volume:14
Issue:2
Number of pages:20
Organizational units:An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Peer review:Referiert
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.