TY - JOUR A1 - Rezaei, Mina A1 - Yang, Haojin A1 - Meinel, Christoph T1 - Recurrent generative adversarial network for learning imbalanced medical image semantic segmentation JF - Multimedia tools and applications : an international journal N2 - We propose a new recurrent generative adversarial architecture named RNN-GAN to mitigate imbalance data problem in medical image semantic segmentation where the number of pixels belongs to the desired object are significantly lower than those belonging to the background. A model trained with imbalanced data tends to bias towards healthy data which is not desired in clinical applications and predicted outputs by these networks have high precision and low recall. To mitigate imbalanced training data impact, we train RNN-GAN with proposed complementary segmentation mask, in addition, ordinary segmentation masks. The RNN-GAN consists of two components: a generator and a discriminator. The generator is trained on the sequence of medical images to learn corresponding segmentation label map plus proposed complementary label both at a pixel level, while the discriminator is trained to distinguish a segmentation image coming from the ground truth or from the generator network. Both generator and discriminator substituted with bidirectional LSTM units to enhance temporal consistency and get inter and intra-slice representation of the features. We show evidence that the proposed framework is applicable to different types of medical images of varied sizes. In our experiments on ACDC-2017, HVSMR-2016, and LiTS-2017 benchmarks we find consistently improved results, demonstrating the efficacy of our approach. KW - Imbalanced medical image semantic segmentation KW - Recurrent generative KW - adversarial network Y1 - 2019 U6 - https://doi.org/10.1007/s11042-019-7305-1 SN - 1380-7501 SN - 1573-7721 VL - 79 IS - 21-22 SP - 15329 EP - 15348 PB - Springer CY - Dordrecht ER - TY - JOUR A1 - Wang, Cheng A1 - Yang, Haojin A1 - Meinel, Christoph T1 - Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning JF - ACM transactions on multimedia computing, communications, and applications N2 - Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset. KW - Deep learning KW - LSTM KW - multimodal representations KW - image captioning KW - mutli-task learning Y1 - 2018 U6 - https://doi.org/10.1145/3115432 SN - 1551-6857 SN - 1551-6865 VL - 14 IS - 2 PB - Association for Computing Machinery CY - New York ER - TY - JOUR A1 - Yang, Haojin A1 - Quehl, Bernhard A1 - Sack, Harald T1 - A framework for improved video text detection and recognition JF - Multimedia tools and applications : an international journal N2 - Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets. KW - Video OCR KW - Video indexing KW - Multimedia retrieval Y1 - 2014 U6 - https://doi.org/10.1007/s11042-012-1250-6 SN - 1380-7501 SN - 1573-7721 VL - 69 IS - 1 SP - 217 EP - 245 PB - Springer CY - Dordrecht ER - TY - JOUR A1 - Wang, Cheng A1 - Yang, Haojin A1 - Meinel, Christoph T1 - A deep semantic framework for multimodal representation learning JF - Multimedia tools and applications : an international journal N2 - Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval. KW - Multimodal representation KW - Deep neural networks KW - Semantic feature KW - Cross-modal retrieval Y1 - 2016 U6 - https://doi.org/10.1007/s11042-016-3380-8 SN - 1380-7501 SN - 1573-7721 VL - 75 SP - 9255 EP - 9276 PB - Springer CY - Dordrecht ER -