Code

  • Code

    This page contains the open-source code developed and used in the MULTISENSOR Project.

    Feel free to download, use and change the code as you like. We’d love to get your feedback.

    No. Description Project Partner Licence
    01 Deep statistical dependency parser [-> code]
    This Parser takes as input a surface-syntactic dependency structure as produced, e.g., by the MATE Tools parser, and outputs dependency structures annotated with deep-syntactic relations in the sense of the Meaning-Text Theory. The parser is trained on a parallel corpus of surface-syntactic structures and deep-syntactic structures in the CoNLL’09 format.
    Check out the demo here. (select “deep output”). For more details see “DSynt Converter” section.
    UPF GNU GPL v2
    02 Mate Tools Surface Dependency Parser [-> code]
    This tool takes as input plain text and produces dependency structures annotated with surface-syntactic relations (subject, object, etc.), lemmas, part of speech, and morpho-syntactic features. The MATE Tools parser is trained on surface-syntactic data anntotated in the CoNLL’09 (one word per line) format.

    Check out the demo here. (select “surface output”).
    UPF GNU GPL v2
    03 Socially interconnected/interlinked and multimedia-enriched objects [-> code]
    A model for representing multimedia content in the context of the Web and Social Media. This model allows for representing in a common format heterogeneous content such as: web pages with images and videos, images, videos, textual documents, social media posts and user comments.
    CERTH Apache Licence v2.0
    04 VERGE [-> code]
    A hybrid interactive video retrieval system, which is capable of searching into video content by integrating different search modules that employ visual- and textual-based techniques. VERGE is built on Open source technologies such as HTML, PHP, Javascript and MongoDB. It can be used by companies, organizations that are interested in indexing and searching image and video content.

    Check out the demo here.
    CERTH Apache Licence v2.0
    05 DSynt Converter (ENG) [-> code]
    A tool that converts the reference surface-syntactic annotation of English (Penn TreeBank) into its corresponding deep-syntactic annotation in the CoNLL’09 format. The conversion removes auxiliaries, modals, and functional prepositions, conjunctions and determiners, and maps the grammatical labels onto semantics-oriented labels. A deep-syntactic structure expresses at the same time the syntactic structure of the sentence and most predicate-argument relations between the meaning-bearing elements that are in it. Together with the corresponding surface-syntactic corpus, the deep-syntactic corpus is used for training a deep-syntactic parser or a deep-syntactic generator.
    UPF GNU GPL v2
    06 Deep statistical text generator [-> code]
    This tool takes as input a deep-syntactic dependency structure in the sense of the Meaning-Text Theory (in the CoNLL’09 format), and outputs a linearized structure with all the words of the sentence. The generator is trained on a parallel corpus of surface-syntactic structures and deep-syntactic structures in the CoNLL’09 format. See on this page the “DSynt Converter” section for more details.
    UPF GNU GPL v2
    07 Twitter Crawler for Contributor Analysis and Name Search [-> code]
    This tool receives as input a Twitter handle and extracts information about the user and his immediate connections, including measures of the user’s authority. The authority scores are based on three criteria: 1) reach (number of followers and size of the ego network), 2) relevance to a given set of keywords and 3) retweet influence score (average fraction of followers that retweet a random post by the user). Instead of giving as input a specific Twitter handle, the tool can work alternatively given a specific search key as input. Given this search key, the tool retrieves the top 10 relevant Twitter accounts with this string and proceeds as before with each of them.
    EURECAT MIT
    08 Framework for topic detection [-> code]
    In this framework, topic detection is tackled as a clustering problem and a hybrid clustering approach for assigning news articles into topics is realized. In this approach, prior knowledge of the correct number of clusters/topics is not required, as this number is automatically estimated by means of a novel methodology named DBSCAN-Martingale. The assignment of news articles into topics is done using Latent Dirichlet Allocation (LDA).
    CERTH Apache Licence v2.0
    09 Framework for category-based classification [-> code]
    This is the implementation of a framework for classification of news articles into a predefined set of generic categories, i.e. Nature_Environment, Politics, Science_Technology, Economy_Business_Finance, Health and Lifestyle_Leisure. The framework relies on the Random Forests (RF) machine learning method and a late fusion strategy that is based on the operational capabilities of RF, namely the OOB error estimate. For a given dataset, two types of textual features are extracted, namely word2vec and N-grams. One RF model is trained for each type of features. Next, the predicted probabilities from each model on the test set are aggregated, so as to calculate the final late fusion model predictions. These probabilities are not equally weighted. Weights are individually calculated for each class based on the OOB error estimate of each RF model.
    CERTH Apache Licence v2.0
    10 Multimedia retrieval framework [-> code]
    This framework

    • fuses multiple modalities, so as to retrieve multimedia objects in response to a multimodal query;
    • integrates high-level information, i.e. multimedia objects are enriched with high-level textual and visual concepts;
    • is language-independent.

    The framework leverages 3 modalities from every multimedia object, namely visual features, visual concepts and textual concepts. Each modality provides a vector representation of the multimedia object through its corresponding features. The similarity matrices from the 3 modalities are constructed and fused for the computation of one relevance score vector.

    CERTH Apache Licence v2.0
    11 Character-based Stack-LSTM surface parser [-> code]
    This tool takes as input plain text and produces dependency structures annotated with surface-syntactic relations (subject, object, etc.). The Stack-LSTM parser is trained on surface-syntactic data annotated in the CoNLL’06 (one word per line) format. The character-based representations are a way of overcoming the out-of-vocabulary (OOV) problem; without any additional resources, they enable the parser to substantially improve the performance when OOV rates are high since they allow to calculate vector representations for words that the machine learning model has never seen during training (out of domain, mainly). This implies that the machine learning model will be able to handle (and classify) new words without using additional resources.
    UPF Apache Licence v2.0
    12 Frame Semantics parser (ENG) [-> code]
    This parser produces structures as found in FrameNet. It has several advantages when compared to state-of-the-art systems:

    1. Unlike, e.g., the Semafor system, this parser does not consider spans of text as Frame Fillers, but instead individual meanings;
    2. Unlike, e.g., Semafor or FRED, this tool will eventually be able to process multilingual inputs.

    This parser builds upon the output of a deep-syntactic parser as described above. The level of abstraction of a Frame Semantics structure is greater than that of deep-syntactic structures, but it is also more complete from the perspective of semantics, in particular by making explicit many relations which are not given by a syntactic parser (shared arguments, gapping constructions, etc.).

    UPF Apache Licence v2.0
    13 Multimedia concept and event detection [-> code]
    Implementation of experiments for the MULTISENSOR video concept and event detection framework. The framework relies on DCNN features and Support Vector Machines (SVM) classification algorithm. A three-fold cross validation (CV) is executed to evaluate performance. The code has been developed and tested in Python, version 3.5.1, 64-bit.
    In the experiments, a dataset that contains 106 videos from news reports is utilized. Videos are categorised into nine concepts/events. Note that one video may be relevant to zero or more of these concepts/events.
    The dataset is available on the projects dataset page or through direct download here.
    CERTH Apache Licence v2.0
    14 Community detection [-> code]
    Implementation of the MULTISENSOR community detection task. Contrary to the traditional modularity maximization approaches for finding community structure, MULTISENSOR adopts the information-theoretic codelength minimization, known as the Infomap method. MULTISENSOR uses this module for the detection of Twitter communities, given a list of desired keywords/hashtags.
    CERTH Apache Licence v2.0
    15 Ontology alignment [-> code]
    The MULTISENSOR visual-based ontology alignment implements the ontology alignment algorithm for computing a visual-based similarity metric for entity matching between two ontologies. Each ontological entity is associated with sets of images, retrieved through ImageNet or web-based search, and visual feature extraction, clustering and indexing for computing the similarity between concepts is employed. An adaptation of a popular Wordnet-based matching algorithm to exploit the visual similarity has also been developed.
    CERTH Apache Licence v2.0
    16 User and Context-centric Content Analysis [-> code]
    This code implements models for representing contextual, sentiment and online social interaction features, as well as deploys linguistic processing at different levels of accuracy and completeness. Our approach is based on disambiguated entities, relations between them, subjective expressions, opinion holders and, relations between pieces of sentiment-rich information.
    EURECAT Apache Licence v2.0