TY  - JOUR
A1  - Yang, Haojin
A1  - Quehl, Bernhard
A1  - Sack, Harald
T1  - A framework for improved video text detection and recognition
JF  - Multimedia tools and applications : an international journal
N2  - Text displayed in a video is an essential part for the high-level semantic information of the video content. Therefore, video text can be used as a valuable source for automated video indexing in digital video libraries. In this paper, we propose a workflow for video text detection and recognition. In the text detection stage, we have developed a fast localization-verification scheme, in which an edge-based multi-scale text detector first identifies potential text candidates with high recall rate. Then, detected candidate text lines are refined by using an image entropy-based filter. Finally, Stroke Width Transform (SWT)- and Support Vector Machine (SVM)-based verification procedures are applied to eliminate the false alarms. For text recognition, we have developed a novel skeleton-based binarization method in order to separate text from complex backgrounds to make it processible for standard OCR (Optical Character Recognition) software. Operability and accuracy of proposed text detection and binarization methods have been evaluated by using publicly available test data sets.
KW  - Video OCR
KW  - Video indexing
KW  - Multimedia retrieval
Y1  - 2014
U6  - https://doi.org/10.1007/s11042-012-1250-6
SN  - 1380-7501
SN  - 1573-7721
VL  - 69
IS  - 1
SP  - 217
EP  - 245
PB  - Springer
CY  - Dordrecht
ER  - 
TY  - JOUR
A1  - Wang, Cheng
A1  - Yang, Haojin
A1  - Meinel, Christoph
T1  - Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning
JF  - ACM transactions on multimedia computing, communications, and applications
N2  - Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.
KW  - Deep learning
KW  - LSTM
KW  - multimodal representations
KW  - image captioning
KW  - mutli-task learning
Y1  - 2018
U6  - https://doi.org/10.1145/3115432
SN  - 1551-6857
SN  - 1551-6865
VL  - 14
IS  - 2
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Bartz, Christian
A1  - Yang, Haojin
A1  - Bethge, Joseph
A1  - Meinel, Christoph
T1  - LoANs
BT  - Weakly Supervised Object Detection with Localizer Assessor Networks
T2  - Computer Vision – ACCV 2018 Workshops
N2  - Recently, deep neural networks have achieved remarkable performance on the task of object detection and recognition. The reason for this success is mainly grounded in the availability of large scale, fully annotated datasets, but the creation of such a dataset is a complicated and costly task. In this paper, we propose a novel method for weakly supervised object detection that simplifies the process of gathering data for training an object detector. We train an ensemble of two models that work together in a student-teacher fashion. Our student (localizer) is a model that learns to localize an object, the teacher (assessor) assesses the quality of the localization and provides feedback to the student. The student uses this feedback to learn how to localize objects and is thus entirely supervised by the teacher, as we are using no labels for training the localizer. In our experiments, we show that our model is very robust to noise and reaches competitive performance compared to a state-of-the-art fully supervised approach. We also show the simplicity of creating a new dataset, based on a few videos (e.g. downloaded from YouTube) and artificially generated data.
Y1  - 2019
SN  - 978-3-030-21074-8
SN  - 978-3-030-21073-1
U6  - https://doi.org/10.1007/978-3-030-21074-8_29
SN  - 0302-9743
SN  - 1611-3349
VL  - 11367
SP  - 341
EP  - 356
PB  - Springer
CY  - Cham
ER  -