Spanish Passage Retrieval dataset (PRES)

data
3 min read

Description

This dataset is a new test collection for passage retrieval from health-related Web resources in Spanish. The task being modeled is that of a Spanish-speaking user with information needs on the subjects of "baby care", "vaccination", or "low back pain" in reputable health-related websites in Spanish and the system retrieves relevant short passages.

The test collection contains 10,037 health-related Web documents in Spanish, 37 topics representing complex information needs formulated in a total of 167 natural language questions, and manual relevance assessments of text passages, pooled from multiple systems.

Details of the Dataset

For each topic of the dataset we provide:
1. The topic id
2. The name of the topic (in Spanish)
3. The category of the topic
4. A short description of the topic (in English)
5. A set of natural language questions that are relevant to the topic (in Spanish)

For each document of the dataset we provide:
1. The document number serving as its identifier (id)
2. The URL of the Web resource corresponding to the specific document
3. The resulting HTML content from the scraping procedure performed on the respective Web resource
4. The plain text that was finally indexed after processing the HTML content

For each relevance assessment at document level we provide:
1. The topic id
2. The document id
3. Its relevance score (which receives always the value of “1” as we provide only the relevant documents)

For each relevance assessment at passage level we provide:
1. The topic id
2. The document id
3. The starting character index of the relevant passage
4. The ending character index of the relevant passage
5. Its relevance score (which receives always the value of "1" as we provide only the relevant passages)

Provided files

The needed files for using the created dataset can be downloaded from here. After unpacking the compressed file ("PRES_Dataset.rar") a single directory will be generated containing the following files:
1. A JSON file describing the topics
2. A JSON file describing the documents
3. A "trec_eval" format file providing the relevance assessments at the document level for pooled documents
4. A JSON file describing the relevance assessments at passage level for pooled documents

Copyright

TBD

Publication

If you use this dataset, please cite the following scientific work:
[1] E. Kamateri, T. Tsikrika, S. Symeonidis, S. Vrochidis, W. Minker and Y. Kompatsiaris, “A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources”, 41st European Conference on Information Retrieval (ECIR 2019), 14-18 April 2019, Cologne, Germany

Contact

For any queries, please contact:
ekamater@iti.gr (Eleni Kamateri)
theodora.tsikrika@iti.gr (Theodora Tsikrika)
spyridons@iti.gr (Spyridon Symeonidis)
stefanos@iti.gr (Stefanos Vrochidis)

Acknowledgements

This work is supported by the EU's H2020 KRISTINA project (645012) and the European Regional Development Fund of EU and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE for the REA project (T1EDK-00686).

MKLab

Spanish Passage Retrieval dataset (PRES)