We are happy to announce the availability of a valuable resource for the Multimedia and Computer Vision community: a visual search index for the Yahoo! corpus of 100M CC-licensed images (and videos). The search index is based on state-of-the-art visual features (SURF+VLAD) and indexing techniques (Product Quantization) and is accompanied by an open-source Java project that can be used to perform rapid visual similarity search operations on the index. The feature extraction was possible in short time with the help of Amazon Web Services (AWS), in particular the Elastic Map-Reduce (EMR) service. Simple wrapper code is included in the aforementioned project and could be used by anyone to perform feature extraction over massive image collections using EMR.
Having the possibility to search fast (in less than one second, see short report) through such a large index of images could be potentially useful for a number of multimedia analysis tasks, in which nearest neighbour search is one of the key components: image annotation, landmark detection, geolocation estimation, etc.
To make use of the index you need to download the index itself from this link
(~24GB) and a number of "learning" files (see instructions below) from this link
How to use
Java class of the multimedia-indexing project
gives an example of how a precomputed IVFPQ index of the YFCC100M collection can be loaded and used to answer queries using the multimedia-indexing library that was developed within the SocialSensor
The main method of the class takes six command line arguments:
- path to the folder where the IVFPQ index resides,
- number of images (i.e. vectors) to load. This number should be equal or smaller to the total size of the index (95,213,780),
- path to the folder where the learning files reside,
- path to a file that contains the URLs of the query images, one URL per row,
- number of coarse quantizer lists to be searched out of 8192. Use small values to decrease query time (e.g. 1 or 2 when the full index is loaded to obtain query times less than 1 sec),
- the size of the cache in Megabytes. Larger values decrease name look-up time but increase memory requirements (typival values are 1024 or 2048),
and performs the following operations:
- the IVFPQ index is loaded in memory (could take up to 2 hours if you load the whole collection depending on your hardware configuration),
- the coarse and product quantizer are loaded from the learning folder,
- an image vectorizer is initialized (i.e. codebooks and PCA matrix are loaded).
- a text file with one image URL per line is parsed and for each URL:
-- the image is downloaded,
-- the image is vectorized and the vector is used to query the index,
-- 30 most similar images are downloaded and printed along with their distances from the query.
Depending on the number of images that are loaded, a sufficient amount of memory should be allocated using the -Xmx command (use -Xmx16g to load the full collection).
How to cite
If you use this dataset in your research, please cite the following paper:
If you have any questions or needs, please get in touch with Symeon Papadopoulos (papadop) or Eleftherios Spyromitros-Xioufis (espyromi), both at @iti.gr. For questions related to the deployment of feature extraction on AWS EMR, get in touch with Katerina Andreadou (kandreadou).
This research has been supported by the SocialSensor and REVEAL projects.