TY - JOUR A1 - Wang, Cheng A1 - Yang, Haojin A1 - Meinel, Christoph T1 - A deep semantic framework for multimodal representation learning JF - Multimedia tools and applications : an international journal N2 - Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval. KW - Multimodal representation KW - Deep neural networks KW - Semantic feature KW - Cross-modal retrieval Y1 - 2016 U6 - https://doi.org/10.1007/s11042-016-3380-8 SN - 1380-7501 SN - 1573-7721 VL - 75 SP - 9255 EP - 9276 PB - Springer CY - Dordrecht ER - TY - JOUR A1 - Yang, Haojin A1 - Quehl, Bernhard A1 - Sack, Harald T1 - A framework for improved video text detection and recognition JF - Multimedia tools and applications : an international journal N2 - Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets. KW - Video OCR KW - Video indexing KW - Multimedia retrieval Y1 - 2014 U6 - https://doi.org/10.1007/s11042-012-1250-6 SN - 1380-7501 SN - 1573-7721 VL - 69 IS - 1 SP - 217 EP - 245 PB - Springer CY - Dordrecht ER - TY - THES A1 - Yang, Haojin T1 - Automatic video indexing and retrieval using video ocr technology Y1 - 2013 CY - Potsdam ER - TY - THES A1 - Yang, Haojin T1 - Deep representation learning for multimedia data analysis Y1 - 2019 ER - TY - JOUR A1 - Wang, Cheng A1 - Yang, Haojin A1 - Meinel, Christoph T1 - Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning JF - ACM transactions on multimedia computing, communications, and applications N2 - Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset. KW - Deep learning KW - LSTM KW - multimodal representations KW - image captioning KW - mutli-task learning Y1 - 2018 U6 - https://doi.org/10.1145/3115432 SN - 1551-6857 SN - 1551-6865 VL - 14 IS - 2 PB - Association for Computing Machinery CY - New York ER - TY - GEN A1 - Bartz, Christian A1 - Yang, Haojin A1 - Bethge, Joseph A1 - Meinel, Christoph T1 - LoANs BT - Weakly Supervised Object Detection with Localizer Assessor Networks T2 - Computer Vision – ACCV 2018 Workshops N2 - Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data. Y1 - 2019 SN - 978-3-030-21074-8 SN - 978-3-030-21073-1 U6 - https://doi.org/10.1007/978-3-030-21074-8_29 SN - 0302-9743 SN - 1611-3349 VL - 11367 SP - 341 EP - 356 PB - Springer CY - Cham ER - TY - JOUR A1 - Rezaei, Mina A1 - Yang, Haojin A1 - Meinel, Christoph T1 - Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation JF - Multimedia tools and applications : an international journal N2 - We propose a new recurrent generative adversarial architecture named RNN-GAN to mitigate imbalance data problem in medical image semantic segmentation where the number of pixels belongs to the desired object are significantly lower than those belonging to the background. A model trained with imbalanced data tends to bias towards healthy data which is not desired in clinical applications and predicted outputs by these networks have high precision and low recall. To mitigate imbalanced training data impact, we train RNN-GAN with proposed complementary segmentation mask, in addition, ordinary segmentation masks. The RNN-GAN consists of two components: a generator and a discriminator. The generator is trained on the sequence of medical images to learn corresponding segmentation label map plus proposed complementary label both at a pixel level, while the discriminator is trained to distinguish a segmentation image coming from the ground truth or from the generator network. Both generator and discriminator substituted with bidirectional LSTM units to enhance temporal consistency and get inter and intra-slice representation of the features. We show evidence that the proposed framework is applicable to different types of medical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, and LiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacy of our approach. KW - Imbalanced medical image semantic segmentation KW - Recurrent generative KW - adversarial network Y1 - 2019 U6 - https://doi.org/10.1007/s11042-019-7305-1 SN - 1380-7501 SN - 1573-7721 VL - 79 IS - 21-22 SP - 15329 EP - 15348 PB - Springer CY - Dordrecht ER - TY - GEN A1 - Bartz, Christian A1 - Yang, Haojin A1 - Meinel, Christoph T1 - SEE: Towards semi-supervised end-to-end scene text recognition T2 - Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, Thirtieth Innovative Applications of Artificial Intelligence Conference, Eight Symposium on Educational Advances in Artificial Intelligence N2 - Detecting and recognizing text in natural scene images is a challenging, yet not completely solved task. In recent years several new systems that try to solve at least one of the two sub-tasks (text detection and text recognition) have been proposed. In this paper we present SEE, a step towards semi-supervised neural networks for scene text detection and recognition, that can be optimized end-to-end. Most existing works consist of multiple deep neural networks and several pre-processing steps. In contrast to this, we propose to use a single deep neural network, that learns to detect and recognize text from natural images, in a semi-supervised way. SEE is a network that integrates and jointly learns a spatial transformer network, which can learn to detect text regions in an image, and a text recognition network that takes the identified text regions and recognizes their textual content. We introduce the idea behind our novel approach and show its feasibility, by performing a range of experiments on standard benchmark datasets, where we achieve competitive results. Y1 - 2018 SN - 978-1-57735-800-8 VL - 10 SP - 6674 EP - 6681 PB - ASSOC Association for the Advancement of Artificial Intelligence CY - Palo Alto ER -