TY  - JOUR
A1  - Wang, Cheng
A1  - Yang, Haojin
A1  - Meinel, Christoph
T1  - A deep semantic framework for multimodal representation learning
JF  - Multimedia tools and applications : an international journal
N2  - Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval.
KW  - Multimodal representation
KW  - Deep neural networks
KW  - Semantic feature
KW  - Cross-modal retrieval
Y1  - 2016
U6  - https://doi.org/10.1007/s11042-016-3380-8
SN  - 1380-7501
SN  - 1573-7721
VL  - 75
SP  - 9255
EP  - 9276
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Yang, Haojin
A1  - Quehl, Bernhard
A1  - Sack, Harald
T1  - A framework for improved video text detection and recognition
JF  - Multimedia tools and applications : an international journal
N2  - Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.
KW  - Video OCR
KW  - Video indexing
KW  - Multimedia retrieval
Y1  - 2014
U6  - https://doi.org/10.1007/s11042-012-1250-6
SN  - 1380-7501
SN  - 1573-7721
VL  - 69
IS  - 1
SP  - 217
EP  - 245
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - THES
A1  - Yang, Haojin
T1  - Automatic video indexing and retrieval using video ocr technology
Y1  - 2013
CY  - Potsdam
ER  - 
TY  - THES
A1  - Yang, Haojin
T1  - Deep representation learning for multimedia data analysis
Y1  - 2019
ER  - 
TY  - JOUR
A1  - Wang, Cheng
A1  - Yang, Haojin
A1  - Meinel, Christoph
T1  - Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning
JF  - ACM transactions on multimedia computing, communications, and applications
N2  - Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.
KW  - Deep learning
KW  - LSTM
KW  - multimodal representations
KW  - image captioning
KW  - mutli-task learning
Y1  - 2018
U6  - https://doi.org/10.1145/3115432
SN  - 1551-6857
SN  - 1551-6865
VL  - 14
IS  - 2
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Bartz, Christian
A1  - Yang, Haojin
A1  - Bethge, Joseph
A1  - Meinel, Christoph
T1  - LoANs
BT  - Weakly Supervised Object Detection with Localizer Assessor Networks
T2  - Computer Vision – ACCV 2018 Workshops
N2  - Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data.
Y1  - 2019
SN  - 978-3-030-21074-8
SN  - 978-3-030-21073-1
U6  - https://doi.org/10.1007/978-3-030-21074-8_29
SN  - 0302-9743
SN  - 1611-3349
VL  - 11367
SP  - 341
EP  - 356
PB  - Springer
CY  - Cham
ER  - 
TY  - JOUR
A1  - Rezaei, Mina
A1  - Yang, Haojin
A1  - Meinel, Christoph
T1  - Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation
JF  - Multimedia tools and applications : an international journal
N2  - We propose a new recurrent generative adversarial architecture named RNN-GAN to mitigate imbalance data problem in medical image semantic segmentation where the number of pixels belongs to the desired object are significantly lower than those belonging to the background. A model trained with imbalanced data tends to bias towards healthy data which is not desired in clinical applications and predicted outputs by these networks have high precision and low recall. To mitigate imbalanced training data impact, we train RNN-GAN with proposed complementary segmentation mask, in addition, ordinary segmentation masks. The RNN-GAN consists of two components: a generator and a discriminator. The generator is trained on the sequence of medical images to learn corresponding segmentation label map plus proposed complementary label both at a pixel level, while the discriminator is trained to distinguish a segmentation image coming from the ground truth or from the generator network. Both generator and discriminator substituted with bidirectional LSTM units to enhance temporal consistency and get inter and intra-slice representation of the features. We show evidence that the proposed framework is applicable to different types of medical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, and LiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacy of our approach.
KW  - Imbalanced medical image semantic segmentation
KW  - Recurrent generative
KW  - adversarial network
Y1  - 2019
U6  - https://doi.org/10.1007/s11042-019-7305-1
SN  - 1380-7501
SN  - 1573-7721
VL  - 79
IS  - 21-22
SP  - 15329
EP  - 15348
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - GEN
A1  - Bartz, Christian
A1  - Yang, Haojin
A1  - Meinel, Christoph
T1  - SEE: Towards semi-supervised end-to-end scene text recognition
T2  - Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Thirtieth Innovative Applications of Artificial Intelligence Conference, Eight Symposium on Educational Advances in Artificial Intelligence
N2  - Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present SEE, a step towards semi-supervised neural networks for scene text detection and recognition, that can be optimized end-to-end. Most existing works consist of multiple deep neural networks and several pre-processing steps. In contrast to this, we propose to use a single deep neural network, that learns to detect and recognize text from natural images, in a semi-supervised way. SEE is a network that integrates and jointly learns a spatial transformer network, which can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results.
Y1  - 2018
SN  - 978-1-57735-800-8
VL  - 10
SP  - 6674
EP  - 6681
PB  - ASSOC Association for the Advancement of Artificial Intelligence
CY  - Palo Alto
ER  -