Proceedings - Terabyte 2006¶

The TREC 2006 Terabyte Track¶

Stefan Büttcher, Charles L. A. Clarke, Ian Soboroff

Paper: http://trec.nist.gov/pubs/trec15/papers/TERA06.OVERVIEW.pdf

Abstract

The primary goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. In addition, we are interested in efficiency and scalability issues, which can be studied more easily in the context of a larger collection. TREC 2006 is the third year for the track. The track was introduced as part of TREC 2004, with a single adhoc retrieval task. For TREC 2005, the track was expanded with two optional tasks: a named page finding task and an efficiency task. These three tasks were continued in 2006, with 20 groups submitting runs to the adhoc retrieval task, 11 groups submitting runs to the named page finding task, and 8 groups submitting runs to the efficiency task. This report provides an overview of each task, summarizes the results, and outlines directions for the future. Further background information on the development of the track can be found in the 2004 and 2005 track reports [4, 5]. For TREC 2006, we made the following major changes to the tasks: 1. We strongly encouraged the submission of adhoc manual runs, as well as runs using pseudo- relevance feedback and other query expansion techniques. Our goal was to increase the diversity of the judging pools in order to a create a more re-usable test collection. Special recognition (and a prize) was offered to the group submitting the run contributing the most unique relevant documents to the judging pool. 2. The named page finding topics were created by task participants, with each group asked to create at least 12 topics. 3. The experimental procedure for the efficiency track was re-defined to permit more realistic intra- and inter-system comparisons, and to generate separate measurements of latency and throughput. In order to compare systems across various hardware configurations, comparative runs using publicly available search engines were encouraged.

Bibtex

@inproceedings{DBLP:conf/trec/ButtcherCS06,
    author = {Stefan B{\"{u}}ttcher and Charles L. A. Clarke and Ian Soboroff},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {The {TREC} 2006 Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/TERA06.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ButtcherCS06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Melbourne University at the 2006 Terabyte Track¶

Vo Ngoc Anh, William Webber, Alistair Moffat

Participant: umelbourne.ngoc-anh
Paper: http://trec.nist.gov/pubs/trec15/papers/umelbourne.tera.final.pdf
Runs: MU06TBy1 | MU06TBy2 | MU06TBy5 | MU06TBy6 | MU06TBa2 | MU06TBa5 | MU06TBa1 | MU06TBa6 | MU06TBn2 | MU06TBn5 | MU06TBn6 | MU06TBn9

Abstract

This report describes the work done at The University of Melbourne for the TREC-2006 Terabyte Track. For this track, we participated in all three main tasks. We continued our work with impact-based ranking and sought to reduce indexing as well as query time. However, to support the named-page task, more conventional retrieval mechanisms were also employed. The results show that, in general, the efficiency performance is slightly better than the previous year. The effectiveness level remains the same.

Bibtex

@inproceedings{DBLP:conf/trec/AnhWM06,
    author = {Vo Ngoc Anh and William Webber and Alistair Moffat},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Melbourne University at the 2006 Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/umelbourne.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AnhWM06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

The Hedge Algorithm for Metasearch at TREC 2006¶

Javed A. Aslam, Virgiliu Pavlu, Carlos Rei

Participant: northeasternu.aslam
Paper: http://trec.nist.gov/pubs/trec15/papers/northeasternu.tera.final.pdf
Runs: hedge0 | hedge50 | hedge5 | hedge10 | hedge30

Abstract

Aslam, Pavlu, and Savell [3] introduced the Hedge algorithm for metasearch which effectively combines the ranked lists of documents returned by multiple retrieval systems in response to a given query and learns which documents are likely to be relevant from a sequence of on-line relevance judgments. It has been demonstrated that the Hedge algorithm is an effective technique for metasearch, often significantly exceeding the performance of standard metasearch and IR techniques over small TREC collections. In this work, we explore the effectiveness of Hedge over the much larger Terabyte 2006 collection

Bibtex

@inproceedings{DBLP:conf/trec/AslamPR06,
    author = {Javed A. Aslam and Virgiliu Pavlu and Carlos Rei},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {The Hedge Algorithm for Metasearch at {TREC} 2006},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/northeasternu.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AslamPR06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IO-Top-k at TREC 2006: Terabyte Track¶

Hannah Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, Gerhard Weikum

Participant: max-planck.theobald
Paper: http://trec.nist.gov/pubs/trec15/papers/max-planck-inst.tera.final.pdf
Runs: mpiiotopk | mpiiotopkpar | mpiiotopk2p | mpiiotopk2 | mpiirmanual | mpiircomb | mpiirtitle | mpiirdesc | wumpus

Abstract

This paper describes the setup and results of our contribution to the TREC 2006 Terabyte Track. Our implementation was based on the algorithms proposed in [1] “IO-Top-k: Index-Access Optimized Top-K Query Processing, VLDB'06”, with a main focus on the efficiency track.

Bibtex

@inproceedings{DBLP:conf/trec/BastMSTW06,
    author = {Hannah Bast and Debapriyo Majumdar and Ralf Schenkel and Martin Theobald and Gerhard Weikum},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {IO-Top-k at {TREC} 2006: Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/max-planck-inst.tera.final.pdf},
    timestamp = {Thu, 14 Oct 2021 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/BastMSTW06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

MG4J at TREC 2006¶

Paolo Boldi, Sebastiano Vigna

Participant: umilano.vigna
Paper: http://trec.nist.gov/pubs/trec15/papers/umilano.tera.final.pdf
Runs: mg4jAdhocBV | mg4jAdhocBBV | mg4jAdhocBVV | mg4jAdhocV | mg4jAutoV | mg4jAutoBV | mg4jAutoBBV | mg4jAutoBVV

Abstract

MG4J participated in the ad hoc task of the Terabyte Track (find all the relevant documents with high precision from 25.2 million pages from the .gov domain) at TREC 2006. It was the second time the MG4J group participated to TREC. For this year, we integrated standard techniques (such as stemming and BM25 scoring) into MG4J, and submitted also automatic runs based on trivial query expansion techniques.

Bibtex

@inproceedings{DBLP:conf/trec/BoldiV06,
    author = {Paolo Boldi and Sebastiano Vigna},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{MG4J} at {TREC} 2006},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/umilano.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BoldiV06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Index Pruning and Result Reranking: Effects on Ad-Hoc Retrieval and Named Page Finding¶

Stefan Büttcher, Charles L. A. Clarke, Peter C. K. Yeung

Participant: uwaterloo-clarke
Paper: http://trec.nist.gov/pubs/trec15/papers/uwaterloo-clarke.tera.final.pdf
Runs: uwmtFdcp12 | uwmtFnoprune | uwmtFdcp03 | uwmtFdcp06 | uwmtFadTPRR | uwmtFadTPFB | uwmtFadDS | uwmtFmanual | uwmtFcompW | uwmtFcompW1 | uwmtFcompW2 | uwmtFcompW3 | uwmtFcompI0 | uwmtFcompI1 | uwmtFcompI2 | uwmtFcompI3 | uwmtFnpstr1 | uwmtFnpstr2 | uwmtFnpsRR1 | uwmtFcompZ0 | uwmtFcompZ1 | uwmtFcompZ2 | uwmtFcompZ3

Abstract

We describe experiments conducted for the TREC 2006 Terabyte track. Our experiments are centered around two concepts: Static index pruning (for increased retrieval efficiency) and result reranking (for improved precision). We investigate their effect on retrieval efficiency and effectiveness, paying special attention to the difference between ad-hoc retrieval and named page finding. We show that index pruning and reranking based on relevance models can be beneficial in an ad-hoc retrieval setting, but have a disastrous repercussion on the effectiveness of named page finding. Result reranking based on anchor text, on the other hand, is very useful for named page finding, but should not be used for ad-hoc retrieval. This dichotomy poses a problem for search engines, as there is no easy way for a search engine to decide whether a given query represents an ad-hoc retrieval task, with the purpose to satisfy an abstract information need, or a named page finding task, targeting a specific document.

Bibtex

@inproceedings{DBLP:conf/trec/ButtcherCY06,
    author = {Stefan B{\"{u}}ttcher and Charles L. A. Clarke and Peter C. K. Yeung},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Index Pruning and Result Reranking: Effects on Ad-Hoc Retrieval and Named Page Finding},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/uwaterloo-clarke.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ButtcherCY06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Juru at TREC 2006: TAAT versus DAAT in the Terabyte Track¶

David Carmel, Einat Amitay

Participant: ibm.carmel
Paper: http://trec.nist.gov/pubs/trec15/papers/ibm-haifa.tera.final.pdf
Runs: JuruT | JuruTD | JuruTWE | JuruMan

Abstract

Our experiments focused this year on the ad-hock task of the Terabyte track. We experimented with WAND, a document-at-a-time evaluation algorithm we developed recently. Our results demonstrate the superiority of WAND over traditional term-a-time strategy while searching over a large collection such as gov2. We demonstrate how Web expansion can be successfully applied to significantly improve search results. In addition, we describe several schemes for creating manual queries, following this year's goal to enrich the pool of results by manual runs.

Bibtex

@inproceedings{DBLP:conf/trec/CarmelA06,
    author = {David Carmel and Einat Amitay},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Juru at {TREC} 2006: {TAAT} versus {DAAT} in the Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/ibm-haifa.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CarmelA06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Partitioning the Gov2 Corpus by Internet Domain Name: A Result-set Merging Experiment¶

Christopher T. Fallen, Gregory B. Newby

Participant: ualaska.fairbanks.newby
Paper: http://trec.nist.gov/pubs/trec15/papers/ualaska.tera.final.pdf
Runs: arscDomAlog | arscDomAsrt | arscDomManL | arscDomManS

Abstract

To study the MultiSearch problem and complete the Ad Hoc Task of the 2006 TREC Terabyte Track, the Gov2 collection was divided according to web domain and for each topic, the results from each domain were merged into single ranked list. The mean average precision scores of the results from two different merge algorithms applied to the domain-divided Gov2 collection and a randomized domain-divided collection are compared with a 2-way analysis of variance.

Bibtex

@inproceedings{DBLP:conf/trec/FallenN06,
    author = {Christopher T. Fallen and Gregory B. Newby},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Partitioning the Gov2 Corpus by Internet Domain Name: {A} Result-set Merging Experiment},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/ualaska.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/FallenN06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dublin City University at the TREC 2006 Terabyte Track¶

Paul Ferguson, Alan F. Smeaton, Peter Wilkins

Participant: dublincityu.gurrin
Paper: http://trec.nist.gov/pubs/trec15/papers/dublin-cityu.tera.final.pdf
Runs: DCU05BASE

Abstract

For the 2006 Terabyte track in TREC, Dublin City University's participation was focussed on the ad hoc search task. As per the pervious two years [7, 4], our experiments on the Terabyte track have concentrated on the evaluation of a sorted inverted index, the aim of which is to sort the postings within each posting list in such a way, that allows only a limited number of postings to be processed from each list, while at the same time minimising the loss of effectiveness in terms of query precision. This is done using the Fisreal search system, developed at Dublin City University [4, 8].

Bibtex

@inproceedings{DBLP:conf/trec/FergusonSW06,
    author = {Paul Ferguson and Alan F. Smeaton and Peter Wilkins},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Dublin City University at the {TREC} 2006 Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/dublin-cityu.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/FergusonSW06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

RMIT University at TREC 2006: Terabyte Track¶

Steven Garcia, Nicholas Lester, Falk Scholer, Milad Shokouhi

Participant: rmit.scholer
Paper: http://trec.nist.gov/pubs/trec15/papers/rmit.tera.final.pdf
Runs: rmit06effic | zetamerg | zetabm | zetadir | zetamerg2 | zetaman | zetnpbm | zetnpft | zetnpfa | zetnpfta | rmit06cmpind | rmit06cmpwum | rmit06cmpzet

Abstract

The TREC 2006 terabyte track consisted of three tasks: informational (or ad hoc) search, named page finding, and efficient retrieval. This paper outlines RMIT University's participation in these tasks.

Bibtex

@inproceedings{DBLP:conf/trec/GarciaLSS06,
    author = {Steven Garcia and Nicholas Lester and Falk Scholer and Milad Shokouhi},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{RMIT} University at {TREC} 2006: Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/rmit.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GarciaLSS06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

PSM: A New Re-Ranking Algorithm for Named-Page¶

Jiafeng Guo, Lin Ding, Gang Zhang, Yue Liu, Xueqi Cheng

Participant: cas-ict.wang
Paper: http://trec.nist.gov/pubs/trec15/papers/cas-ict.tera.final.pdf
Runs: icttb0601 | icttb0604 | icttb0603 | icttb0600 | icttb0602

Abstract

This year, the IR group of ICT participated in the terabyte track named-page Finding subtask for the first time. Since the document collection is as large as about 426G, our most important goal is to find an efficient way to catch the target web page in such a huge size data set. Meanwhile we want to make the indexing and retrieval processing at a reasonable low cost, both on hardware and time-consuming. We used our “FirteX” engine for indexing and retrieval of this task. The indexing time is within 15 hours and the retrieval time is short enough(less than 2 seconds per query). The main contribution of our work is that we design a Pattern Similarity Matching(PSM) re-ranking algorithm to reorder the results and rank the target document as top 1 as possible. We were glad to see that we've got an exciting performance on the last year's (2005) topics during our experiment. The chief procedure of our work can be divided into three parts as below, which are data preprocess, indexing and retrieval, and re-ranking.

Bibtex

@inproceedings{DBLP:conf/trec/GuoDZLC06,
    author = {Jiafeng Guo and Lin Ding and Gang Zhang and Yue Liu and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{PSM:} {A} New Re-Ranking Algorithm for Named-Page},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/cas-ict.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GuoDZLC06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

MonetDB/X100 at the 2006 TREC Terabyte Track¶

Sándor Héman, Marcin Zukowski, Arjen P. de Vries, Peter A. Boncz

Participant: lowlands-team.deVries
Paper: http://trec.nist.gov/pubs/trec15/papers/cwi-heman.tera.final.pdf
Runs: CWI06DIST8 | CWI06MEM4 | CWI06DISK1 | CWI06MEM1 | CWI06DIST8ah | CWI06DISK1ah | CWI06COMP1

Abstract

Requirements of database management (DB) and information retrieval (IR) systems overlap more and more. Database systems are being applied to scenarios where features such as text search and similarity scoring on multiple attributes become crucial. Many information retrieval systems are being extended beyond plain text, to rank semi-structured documents marked up in XML, or maintain ontologies or thesauri. In both areas, these new features are usually implemented using specialized solutions limited in their features and performance. Full integration of DB and IR has been considered highly desirable, see e.g. [5, 1] for some recent advocates. Yet, none of the attempts into this direction has been very successful. The explanation can be sought in what has been termed the 'structure chasm' [8]: database research builds upon the idea that all data should satisfy a pre-defined schema, and the natural language text documents of concern to information retrieval do not match this database application scenario. Still, the structure chasm does not explain why IR systems do not use database technology to alleviate their data management tasks during index construction and document ranking. In practice however, custom-built information retrieval engines have always outperformed generic database technology, especially when also taking into account the trade-off between run-time performance and resources needed. To investigate the feasibility of running terabyte scale information retrieval tasks on top of a relational engine, our team from CWI participated in the 2006 TREC Terabyte Track, using its experimental MonetDB/X100 database system [3, 11]. This system, is designed for high performance on data-intensive workloads, whereof TREC-TB is an excellent example. Furthermore, we believe that standard relational algebra provides enough flexibility to express most IR retrieval models, and show that, by employing a hardware-conscious DBMS architecture, it is possible to achieve perormance, both in terms of efficiency and effectiveness, that is competitive with leading, customized IR systems. This notebook is organized as follows. Section 2 describes the distinguishing features of MonetDB/X100 that allow it to run large-scale data processing tasks efficiently. Section 3 then explains the process of indexing the TREC-TB collection, and the resulting relational schema. This is followed by a description of the TREC-TB runs we submitted, together with the hardware platforms used to run them. Effectiveness and efficiency results for these runs are then presented in Sections 6 and Section 7, respectively, before concluding in Section 8.

Bibtex

@inproceedings{DBLP:conf/trec/HemanZVB06,
    author = {S{\'{a}}ndor H{\'{e}}man and Marcin Zukowski and Arjen P. de Vries and Peter A. Boncz},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {MonetDB/X100 at the 2006 {TREC} Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/cwi-heman.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HemanZVB06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Peking University at the TREC 2006 Terabyte Track¶

Li Jinging, Yan Hongfei

Participant: pekingu.yan
Paper: http://trec.nist.gov/pubs/trec15/papers/pekingu.tera.pdf
Runs: TWTB06AD01 | TWTB06AD02 | TWTB06AD03 | TWTB06AD04 | TWTB06AD05 | TWTB06NP01 | TWTB06NP02 | TWTB06NP03

Abstract

This paper details the experiments carried out at TREC 2006 Terabyte Track using Indri Search Engine. There were three tasks in the Terabyte track of TREC 2006, i.e. efficiency task, ad hoc task and named page finding task. We participated in two tasks, and submitted 5 runs for ad hoc task and 3 runs for named page task respectively. In ad hoc task, we looked at the importance of term proximity. In named page finding task, we cared more about the information of document structure and document prior.

Bibtex

@inproceedings{DBLP:conf/trec/JingingH06,
    author = {Li Jinging and Yan Hongfei},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Peking University at the {TREC} 2006 Terabyte Track},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/pekingu.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/JingingH06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments with Document and Query Representations for a Terabyte of Text¶

Jaap Kamps

Participant: uamsterdam.ilps
Paper: http://trec.nist.gov/pubs/trec15/papers/uamsterdam-kamps.tera.final.pdf
Runs: UAmsT06aTeLM | UAmsT06aAnLM | UAmsT06aTDN | UAmsT06aTTDN | UAmsT06a3SUM | UAmsT06n3SUM | UAmsT06nTeLM | UAmsT06nTurl | UAmsT06nAnLM

Abstract

As part of the TREC 2006 Terabyte track, we conducted a range of experiments investigating the effects of larger test collections for both Adhoc and known-item topics. First, we looked at the amount of smoothing required for large-scale collections, and found that the large-scale collections require little smoothing. Second, we investigated the relative effectiveness of various web-centric document representations based on document-text, incoming anchor-texts, and page titles. We found that these are of little value for the Adhoc task, but can provide crucial additional retrieval cues for the Named page finding task. Third, we studied the relative effectiveness of various query representations, both short and verbose statements of the topic of request, plus an intermediate query based on the most characteristic terms in the whole topic statement. We we found that using a more verbose query leads to an improvement of retrieval effectiveness.

Bibtex

@inproceedings{DBLP:conf/trec/Kamps06,
    author = {Jaap Kamps},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments with Document and Query Representations for a Terabyte of Text},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/uamsterdam-kamps.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Kamps06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier¶

Christina Lioma, Craig Macdonald, Vassilis Plachouras, Jie Peng, Ben He, Iadh Ounis

Participant: uglasgow.ounis
Paper: http://trec.nist.gov/pubs/trec15/papers/uglasgow.tera.ent.final.pdf
Runs: uogTB06QET1 | uogTB06QET2 | uogTB06S50L | uogTB06SS10L | uogTB06SSQL | uogTB06M | uogTB06MP | uogTB06MPIA

Abstract

In TREC 2006, we participate in three tasks of the Terabyte and Enterprise tracks. We continue experiments using Terrier1, our modular and scalable Information Retrieval (IR) platform. Furthering our research into the Divergence From Randomness (DFR) framework of weighting models, we introduce two new effective and low-cost models, which combine evidence from document structure and capture term dependence and proximity, respectively. Additionally, in the Terabyte track, we improve on our query expansion mechanism on fields, presented in TREC 2005, with a new and more refined technique, which combines evidence in a linear, rather than uniform, way. We also introduce a novel, low-cost syntactically-based noise reduction technique, which we flexibly apply to both the queries and the index. Furthermore, in the Named Page Finding task, we present a new technique for combining query-independent evidence, in the form of prior probabilities. In the Enterprise track, we test our new voting model for expert search. Our experiments focus on the need for candidate length normalisation, and on how retrieval performance can be enhanced by applying retrieval techniques to the underlying ranking of documents.

Bibtex

@inproceedings{DBLP:conf/trec/LiomaMPPHO06,
    author = {Christina Lioma and Craig Macdonald and Vassilis Plachouras and Jie Peng and Ben He and Iadh Ounis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2006: Experiments in Terabyte and Enterprise Tracks with Terrier},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/uglasgow.tera.ent.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LiomaMPPHO06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Fuzzy Term Proximity With Boolean Queries at 2006 TREC Terabyte Task¶

Annabelle Mercier, Michel Beigbeder

Participant: ecole-des-mines.beigbeder
Paper: http://trec.nist.gov/pubs/trec15/papers/ecole.tera.final.pdf
Runs: AMRIMtp5006 | AMRIMtp20006 | AMRIMtpm5006

Abstract

We report here the results of fuzzy term proximity method applied to Terabyte Task. Fuzzy proxmity main feature is based on the idea that the closer the query terms are in a document, the more relevant this document is. With this principle, we have a high precision method so we complete by these obtained with Zettair search engine default method (dirichlet). Our model is able to deal with Boolean queries, but contrary to the traditional extensions of the basic Boolean IR model, it does not explicitly use a proximity operator because it can not be generalized to nodes. The fuzzy term proximity is controlled with an influence function. Given a query term and a document, the influence function associates to each position in the text a value dependant of the distance of the nearest occurence of this query term. To model proximity, this function is decreasing with distance. Different forms of function can be used: triangular, gaussian etc. For practical reasons only functions with finite support were used. The support of the function is limited by a constant called k. The fuzzy term proximity functions are associated to every leaves of the query tree. Then fuzzy proximities are computed for every nodes with a post-order tree traversal. Given the fuzzy proximities of the sons of a node, its fuzzy proximity is computed, like in the fuzzy IR models, with a mimimum (resp. maximum) combination for conjunctives (resp. disjunctives) nodes. Finally, a fuzzy query proximity value is obtained for each position in this document at the root of the query tree. The score of this document is the integration of the function obtained at the tree root. For the experiments, we modify Lucy (version 0.5.2) to implement our matching function. Two query sets are used for our runs. One set is manually built with the title words (and sometimes some description words). Each of these words is OR'ed with its derivatives like plurals for instance. Then the OR nodes obtained are AND'ed at the tree root. An other automatic query sets is built with an AND of automatically extracted terms from the title field. These two query sets are submitted to our system with two values of k: 50 and 200. The two corresponding query sets with flat queries are also submitted to zettair search engine.

Bibtex

@inproceedings{DBLP:conf/trec/MercierB06,
    author = {Annabelle Mercier and Michel Beigbeder},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Fuzzy Term Proximity With Boolean Queries at 2006 {TREC} Terabyte Task},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/ecole.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MercierB06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Indri TREC Notebook 2006: Lessons Learned From Three Terabyte Tracks¶

Donald Metzler, Trevor Strohman, W. Bruce Croft

Participant: umass.allan
Paper: http://trec.nist.gov/pubs/trec15/papers/umass.tera.final.pdf
Runs: indri06Aql | indri06AdmD | indri06AlceD | indri06AlceB | indri06AtdnD | indri06Nfi | indri06Nfip | indri06Nsd | indri06Nsdp

Abstract

This report describes the lessons learned using the Indri search system during the 2004-2006 TREC Terabyte Tracks. We provide an overview of Indri, and, for the ad hoc and named page finding tasks, discuss our general approach to the problem, what worked, what did not work, and what could possibly work in the future.

Bibtex

@inproceedings{DBLP:conf/trec/MetzlerSC06,
    author = {Donald Metzler and Trevor Strohman and W. Bruce Croft},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Indri {TREC} Notebook 2006: Lessons Learned From Three Terabyte Tracks},
    booktitle = {Proceedings of the Fifteenth Text REtrieval Conference, {TREC} 2006, Gaithersburg, Maryland, USA, November 14-17, 2006},
    series = {{NIST} Special Publication},
    volume = {500-272},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2006},
    url = {http://trec.nist.gov/pubs/trec15/papers/umass.tera.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MetzlerSC06.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}