VERGE comprises several modules and each of them integrates certain technologies.
Here follows a brief description of the technologies used in each module.
Automatic Query Formulation and Expansion using High-Level Concepts
Version 16.0.
This module formulates and expands an input query in order to translate it into
a set of high-level concepts.
It splits the textual query in a number of elementary subqueries using POS tagging and task-specific NLP rules
and translates them into high level concepts from a predefined pool of concepts using the explicit the Explicit
Semantic Analysis (ESA) measure.
Video clustering
Version 16.0.
This module clusters videos by using the visual features of
their keyframes. Specically, for each video, we retrieve the top-N most similar
keyframes of each video keyframe. Then, a simple majority vote algorithm is
applied, which counts the frequencies of the returned videos. In the sequel, the
frequency values are normalized per video, and we consider as similar to a video
the top-M videos that are linked with a similarity value exceeding a pre-dened
value.
Multimodal Video Fusion
Version 16.0.
This module consider 2 modalities; visual concepts and video textual metadata.
In order to retrieve the visual concepts representing each video, we consider the top-K concepts of each video keyframe
Then the Jaccard Similarity Coefficient is used for finding similar video for each modality. Eventually, the 2 modalities
are merged by ranking higher videos that appear higher in both lists.
Visual Similarity Search by Example Module
Two techniques have been applied for retrieving visually similar images.
In both techniques during the query phase, a feature vector, defined by each technique, is extracted from the query
image and a distance metric is computed between the query's image descriptor and the descriptors of the dataset images.
These distances are ranked according to their level of similarity with the query example.
The techniques that are applied for retrieving visually similar images are:
Version 14.0.
In this version, the output of a pre-trained DCNN on 1000 ImageNet categories is used as feature vector describing each image. Similar to Version 13.0, Nearest Neighbour search is realized using Asymmetric Distance and a web service is
implemented in order to accelerate the querying process.
Version 13.0.
In this version, SURF features are extracted and then K-Means clustering is applied
on the database vectors in order to acquire the visual vocabulary and VLAD encoding
for representing images is realized.
For Nearest Neighbour search, we build an IVFADC index for database vectors and
then we compute KNearest Neighbours from the query file. Search is realized by
combining an inverted file system with the Asymmetric Distance Computation
(ADC). Finally, a web service is implemented in order to accelerate the querying
process.
Version 9.0-12.0.
MPEG-7 based technique
This technique captures global information such as colour and texture. Two descriptor combinations were selected to represent
the visual content; 1) ColorLayout and EdgeHistogram and 2) ColorLayout and Color Structure.
SIFT based technique
This technique captures local information by applying the Lowes SIFT transform. The method adopted is an implementation
of the bag-of- visual words approach where a large amount of local descriptors (training data) is used for learning the
visual words which are extracted by applying clustering using a fixed number of clusters.
In both cases an r-tree structure is constructed to support efficient indexing and fast retrieval.
Textual Information Processing Module
The textual query module attempts to exploit the shot audio information. This audio information is processed off-line
with the application of Automatic Speech Recognition and Machine Translation to the initial video, so that specific
sets of keywords can be assigned to each shot.
Textual processing module differentiates by versions, thus we will describe briefly the basic of each version.
Version 14.0.
In this version, text indexing is realized by Apache Lucene search platform which
allows full-text search and enables fast retrieval as well easy formulation of
complicated queries.
Also, the DBpedia Spotlight is used that produces DB entities/ concepts. Indexing in
this case, is also realized by Apache Lucene.
Version 9.0.
In this version, the text-retrieval code base is re-written as a set of Perl modules and the full-text retrieval
engine is migrated from KinoSearch to the Lemur Toolkit, which is an open-source framework designed to facilitate research
in language modeling and information retrieval. Term weights for each keyword are still computed using the BM25 text algorithm.
Moreover, the query expansion based on WordNet synsets as well as the concepts suggestions introduced in version 8.0
are preserved buta semantic similarity measure that calculates the similarity between query terms and concept terms
(broader, narrower and related terms) is also introduced, in order to achieve better results with the query expansion and concept suggestion functions.
Version 8.0.
In this version, a full-text retrieval engine using KinoSearch, which is a Perl search engine library based on Lucene,
is introduced to allow indexing and query functions. Term weights for each keyword are computed using the BM25 text algorithm.
Moreover the systems recall is boosted by using query expansion, which is implemented by generating a list of synonyms for
each query term based on WordNet synsets. Finally, to assist the user in subsequent query iteration tasks,
traditional thesaurus concept suggestions by mapping WordNet hypernyms to broader terms and hyponyms to narrower terms are
generated.
Version 7.0.
In this version, the text algorithm employed by the module is the BM25 algorithm, which incorporates both
normalised document length and term frequency. The module is further capable of providing related keywords
to the searcher by processing the associated text of the initial results and eventually extracting the most
frequent keywords. In that way the module receives feedback from the results and suggests additional input to
the user for submitting similar queries.
High Level Concept Retrieval Module
Version 14.0.
Similar to Version 14.0, a two-layer concept detection system is employed. The
difference compared to the previous version is the independent concept detectors used
Specifically, each image is represented using one or more types of features based on
three different pre-trained convolutional neural networks (CNN): i) The 16-layer
deep ConvNet network, ii) the 22-layer GoogLeNet network and iii) the 8-layer
CaffeNet network. Each of these networks is applied on the keyframes and the output
of one or more layers is used as a feature. The CNN-based feature vectors are served
as input to Support Vector Machine classifiers. The output of the classifiers trained
for the same concept is combined using a cascade of classifiers. Again, in the second
layer fusion of score is realized.
Version 13.0.
A two-layer concept detection system is employed. The first layer
builds multiple independent concept detectors. Each image represented using one or
more types of appropriate local descriptors (e.g. SIFT, RGB-SIFT, SURF, ORB etc.).
All the local descriptors are compacted using PCA and are subsequently aggregated
using the VLAD encoding. These VLAD vectors are compressed by applying a
modification of the random projection matrix and served as input to Logistic
Regression (LR) classifiers. The output of the classifiers is combined by means of late
fusion. When different descriptors are combined, again late fusion is performed by
averaging of the classifier output scores. In the second layer of the stacking
architecture, the fused scores from the first layer are aggregated in model vectors and
refined by applying multi-label learning algorithm and temporal re-ranking.
Version 9.0-12.0.
This module provides high level concept (e.g. animal, landscape, outdoor, etc.) selection for the user.
After an off line preprocessing the images are sorted based on similarity co-efficients for each concept.
The procedure required for the extraction of high level concept information is based on a combination
of MPEG-7 and SIFT based global image features.
A set of SVM classifiers is used to create classification models for the MPEG-7 and BoW features, using
the first half of the development set for the training procedure. The output of the classification is the
Degree of Confidence (DoC), by which the query may be classified in a particular concept. After the classification
process for both MPEG-7 and BoW features is completed, the respective DoCs are used as input to a second SVM.
This stage-2 classification uses for training the second half of the development set and self-optimized parameters
to create a classification model. The results from this 2-stage SVM on the testing set are sorted by
DoC, and the 2000 higher in the rank are indexed in the database to support the concept retrieval.
Clustering Module
This module clusters images into color classes using Self Organizing Maps, for image
visualization and offers fast browsing. Shot representation is realized using the mean
reciprocal distance to the most red, green and blue shots. Eventually, the image
collection is represented as a color-map, using the most representative image per color
class.
Object-based Visual Module
This module is based on the Bag-Of-Words (BoW) model. Fast Hessian detector and
SIFT descriptors are extracted from images and then, the detected features are
randomly sampled and afterwards clustered using Repeated Bisecting K-Means and a
2-layer visual vocabulary is constructed. An inverted index is built using the Apache
Lucene software for fast online search of the image database BoW vectors. The
similarity score is obtained based on Lucene’s scoring function and the ranking
position of the frame in the retrieved list.