VERGE

VERGE comprises several modules and each of them integrates certain technologies. Here follows a brief description of the technologies used in each module.

Automatic Query Formulation and Expansion using High-Level Concepts

Version 16.0.

This module formulates and expands an input query in order to translate it into a set of high-level concepts. It splits the textual query in a number of elementary subqueries using POS tagging and task-specific NLP rules and translates them into high level concepts from a predefined pool of concepts using the explicit the Explicit Semantic Analysis (ESA) measure.

Video clustering

Version 16.0.

This module clusters videos by using the visual features of their keyframes. Specically, for each video, we retrieve the top-N most similar keyframes of each video keyframe. Then, a simple majority vote algorithm is applied, which counts the frequencies of the returned videos. In the sequel, the frequency values are normalized per video, and we consider as similar to a video the top-M videos that are linked with a similarity value exceeding a pre-dened value.

Multimodal Video Fusion

Version 16.0.

This module consider 2 modalities; visual concepts and video textual metadata. In order to retrieve the visual concepts representing each video, we consider the top-K concepts of each video keyframe Then the Jaccard Similarity Coefficient is used for finding similar video for each modality. Eventually, the 2 modalities are merged by ranking higher videos that appear higher in both lists.

Visual Similarity Search by Example Module

Two techniques have been applied for retrieving visually similar images. In both techniques during the query phase, a feature vector, defined by each technique, is extracted from the query image and a distance metric is computed between the query's image descriptor and the descriptors of the dataset images. These distances are ranked according to their level of similarity with the query example. The techniques that are applied for retrieving visually similar images are:

Version 14.0.

In this version, the output of a pre-trained DCNN on 1000 ImageNet categories is used as feature vector describing each image. Similar to Version 13.0, Nearest Neighbour search is realized using Asymmetric Distance and a web service is implemented in order to accelerate the querying process.

Version 13.0.

In this version, SURF features are extracted and then K-Means clustering is applied on the database vectors in order to acquire the visual vocabulary and VLAD encoding for representing images is realized. For Nearest Neighbour search, we build an IVFADC index for database vectors and then we compute KNearest Neighbours from the query file. Search is realized by combining an inverted file system with the Asymmetric Distance Computation (ADC). Finally, a web service is implemented in order to accelerate the querying process.

Version 9.0-12.0.

MPEG-7 based technique

This technique captures global information such as colour and texture. Two descriptor combinations were selected to represent the visual content; 1) ColorLayout and EdgeHistogram and 2) ColorLayout and Color Structure.

SIFT based technique

This technique captures local information by applying the Lowes SIFT transform. The method adopted is an implementation of the bag-of- visual words approach where a large amount of local descriptors (training data) is used for learning the visual words which are extracted by applying clustering using a fixed number of clusters.
In both cases an r-tree structure is constructed to support efficient indexing and fast retrieval.

Textual Information Processing Module

The textual query module attempts to exploit the shot audio information. This audio information is processed off-line with the application of Automatic Speech Recognition and Machine Translation to the initial video, so that specific sets of keywords can be assigned to each shot. Textual processing module differentiates by versions, thus we will describe briefly the basic of each version.

Version 14.0.

In this version, text indexing is realized by Apache Lucene search platform which allows full-text search and enables fast retrieval as well easy formulation of complicated queries. Also, the DBpedia Spotlight is used that produces DB entities/ concepts. Indexing in this case, is also realized by Apache Lucene.

Version 9.0.

In this version, the text-retrieval code base is re-written as a set of Perl modules and the full-text retrieval engine is migrated from KinoSearch to the Lemur Toolkit, which is an open-source framework designed to facilitate research in language modeling and information retrieval. Term weights for each keyword are still computed using the BM25 text algorithm. Moreover, the query expansion based on WordNet synsets as well as the concepts suggestions introduced in version 8.0 are preserved buta semantic similarity measure that calculates the similarity between query terms and concept terms (broader, narrower and related terms) is also introduced, in order to achieve better results with the query expansion and concept suggestion functions.

Version 8.0.

In this version, a full-text retrieval engine using KinoSearch, which is a Perl search engine library based on Lucene, is introduced to allow indexing and query functions. Term weights for each keyword are computed using the BM25 text algorithm. Moreover the systems recall is boosted by using query expansion, which is implemented by generating a list of synonyms for each query term based on WordNet synsets. Finally, to assist the user in subsequent query iteration tasks, traditional thesaurus concept suggestions by mapping WordNet hypernyms to broader terms and hyponyms to narrower terms are generated.

Version 7.0.

In this version, the text algorithm employed by the module is the BM25 algorithm, which incorporates both normalised document length and term frequency. The module is further capable of providing related keywords to the searcher by processing the associated text of the initial results and eventually extracting the most frequent keywords. In that way the module receives feedback from the results and suggests additional input to the user for submitting similar queries.

High Level Concept Retrieval Module

Version 14.0.

Similar to Version 14.0, a two-layer concept detection system is employed. The difference compared to the previous version is the independent concept detectors used Specifically, each image is represented using one or more types of features based on three different pre-trained convolutional neural networks (CNN): i) The 16-layer deep ConvNet network, ii) the 22-layer GoogLeNet network and iii) the 8-layer CaffeNet network. Each of these networks is applied on the keyframes and the output of one or more layers is used as a feature. The CNN-based feature vectors are served as input to Support Vector Machine classifiers. The output of the classifiers trained for the same concept is combined using a cascade of classifiers. Again, in the second layer fusion of score is realized.

Version 13.0.

A two-layer concept detection system is employed. The first layer builds multiple independent concept detectors. Each image represented using one or more types of appropriate local descriptors (e.g. SIFT, RGB-SIFT, SURF, ORB etc.). All the local descriptors are compacted using PCA and are subsequently aggregated using the VLAD encoding. These VLAD vectors are compressed by applying a modification of the random projection matrix and served as input to Logistic Regression (LR) classifiers. The output of the classifiers is combined by means of late fusion. When different descriptors are combined, again late fusion is performed by averaging of the classifier output scores. In the second layer of the stacking architecture, the fused scores from the first layer are aggregated in model vectors and refined by applying multi-label learning algorithm and temporal re-ranking.

Version 9.0-12.0.

This module provides high level concept (e.g. animal, landscape, outdoor, etc.) selection for the user. After an off line preprocessing the images are sorted based on similarity co-efficients for each concept. The procedure required for the extraction of high level concept information is based on a combination of MPEG-7 and SIFT based global image features. A set of SVM classifiers is used to create classification models for the MPEG-7 and BoW features, using the first half of the development set for the training procedure. The output of the classification is the Degree of Confidence (DoC), by which the query may be classified in a particular concept. After the classification process for both MPEG-7 and BoW features is completed, the respective DoCs are used as input to a second SVM. This stage-2 classification uses for training the second half of the development set and self-optimized parameters to create a classification model. The results from this 2-stage SVM on the testing set are sorted by DoC, and the 2000 higher in the rank are indexed in the database to support the concept retrieval.

Clustering Module

This module clusters images into color classes using Self Organizing Maps, for image visualization and offers fast browsing. Shot representation is realized using the mean reciprocal distance to the most red, green and blue shots. Eventually, the image collection is represented as a color-map, using the most representative image per color class.

Object-based Visual Module

This module is based on the Bag-Of-Words (BoW) model. Fast Hessian detector and SIFT descriptors are extracted from images and then, the detected features are randomly sampled and afterwards clustered using Repeated Bisecting K-Means and a 2-layer visual vocabulary is constructed. An inverted index is built using the Apache Lucene software for fast online search of the image database BoW vectors. The similarity score is obtained based on Lucene’s scoring function and the ranking position of the frame in the retrieved list.