Proceedings - Terabyte 2004¶

Overview of the TREC 2004 Terabyte Track¶

Charles L. A. Clarke, Nick Craswell, Ian Soboroff

Paper: http://trec.nist.gov/pubs/trec13/papers/TERA.OVERVIEW.pdf

Abstract

The Terabyte Track explores how adhoc retrieval and evaluation techniques can scale to terabyte-sized collections. For TREC 2004, our first year, 50 new adhoc topics were created and evaluated over a a 426GB collection of 25 million documents taken from the .gov Web domain. A total of 70 runs were submitted by 17 groups. Along with the top documents, each group reported average query times, indexing times, index sizes, and hardware and software characteristics for their systems.

Bibtex

@inproceedings{DBLP:conf/trec/ClarkeCS04,
    author = {Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2004 Terabyte Track},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/TERA.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ClarkeCS04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Melbourne University 2004: Terabyte and Web Tracks¶

Vo Ngoc Anh, Alistair Moffat

Participant: u.melbourne
Paper: http://trec.nist.gov/pubs/trec13/papers/umelbourne.tera.web.pdf
Runs: MU04tb1 | MU04tb2 | MU04tb3 | MU04tb4 | MU04tb5

Abstract

The University of Melbourne carried out experiments in the Terabyte and Web tracks of TREC 2004. We applied a further variant of our impact-based retrieval approach by integrating evidence from text content, anchor text, URL depth, and link structure into the process of ranking documents, working toward a retrieval system that handles equally well all of the four query types employed in these two tracks. That is, we sought to avoid special techniques, and did not apply any explicit or implicit query classifiers. The system was designed to be scalable and efficient.

Bibtex

@inproceedings{DBLP:conf/trec/AnhM04,
    author = {Vo Ngoc Anh and Alistair Moffat},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Melbourne University 2004: Terabyte and Web Tracks},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/umelbourne.tera.web.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AnhM04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

RMIT University at TREC 2004¶

Bodo Billerbeck, Adam Cannane, Abhijit Chattaraj, Nicholas Lester, William Webber, Hugh E. Williams, John Yiannis, Justin Zobel

Participant: rmit.scholer
Paper: http://trec.nist.gov/pubs/trec13/papers/rmit.tera.geo.pdf
Runs: zetbodoffff | zetanch | zetplain | zetfuzzy | zetfunkyz

Abstract

RMIT University participated in two tracks at TREC 2004: Terabyte and Genomics, both for the first time. This paper describes the techniques we applied and our experiments in both tracks, and discusses the results of the genomics track runs; the terabyte track results are unavailable at the time of manuscript submission. We also describe our new zettair search engine, in use for the first time at TREC

Bibtex

@inproceedings{DBLP:conf/trec/BillerbeckCCLWWYZ04,
    author = {Bodo Billerbeck and Adam Cannane and Abhijit Chattaraj and Nicholas Lester and William Webber and Hugh E. Williams and John Yiannis and Justin Zobel},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{RMIT} University at {TREC} 2004},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/rmit.tera.geo.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BillerbeckCCLWWYZ04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004¶

Stephen Blott, Fabrice Camous, Paul Ferguson, Georgina Gaughan, Cathal Gurrin, Gareth J. F. Jones, Noel Murphy, Noel E. O'Connor, Alan F. Smeaton, Peter Wilkins, Oisín Boydell, Barry Smyth

Participant: dubblincity.u
Paper: http://trec.nist.gov/pubs/trec13/papers/dcu.tera.geo.novelty.pdf
Runs: DcuTB04Ucd1 | DcuTB04Base | DcuTB04Ucd2 | DcuTB04Wbm25 | DcuTB04Combo

Abstract

In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document.

Bibtex

@inproceedings{DBLP:conf/trec/BlottCFGGJMOSWBS04,
    author = {Stephen Blott and Fabrice Camous and Paul Ferguson and Georgina Gaughan and Cathal Gurrin and Gareth J. F. Jones and Noel Murphy and Noel E. O'Connor and Alan F. Smeaton and Peter Wilkins and Ois{\'{\i}}n Boydell and Barry Smyth},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for {TREC} 2004},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/dcu.tera.geo.novelty.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BlottCFGGJMOSWBS04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Initial Results with Structured Queries and Language Models on Half a Terabyte of Text¶

Kevyn Collins-Thompson, Paul Ogilvie, Jamie Callan

Participant: cmu.dir.callan
Paper: http://trec.nist.gov/pubs/trec13/papers/cmu-dir.tera.pdf
Runs: cmutufs2500 | cmuapfs2500 | cmutuns2500

Abstract

The CMU Distributed IR group's experiments for the TREC 2004 Terabyte track are some of the first to use Indri, a new indexing and retrieval component developed by the University of Massachusetts for the Lemur Toolkit [2]. Indri combines an inference network with a language-modeling approach and is designed to scale to terabyte-sized collections. Our goals for this year's Terabyte track were modest: to complete a set of simple baseline runs successfully using the new Indri software, and to gain more experience with Indri's retrieval model, the track's GOV2 corpus, and terabyte-scale collections in general.

Bibtex

@inproceedings{DBLP:conf/trec/Collins-ThompsonOC04,
    author = {Kevyn Collins{-}Thompson and Paul Ogilvie and Jamie Callan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Initial Results with Structured Queries and Language Models on Half a Terabyte of Text},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/cmu-dir.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Collins-ThompsonOC04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IIT at TREC 2004 Standard Retrieval Models Over Partitioned Indices for the Terabyte Track¶

Jefferson Heard, Ophir Frieder, David A. Grossman

Participant: iit
Paper: http://trec.nist.gov/pubs/trec13/papers/iit.tera.pdf
Runs: iit00t | robertson

Abstract

For TREC-2004, we participated in the Terabyte track. We focused on partitioning the data in the GOV2 collection across a homogeneous cluster of machines and indexing and querying the collection in a distributed fashion using different standard retrieval models on a single system, such as the Robertson BM25 probabilistic measure and a vector space measure. Our partitioned indices were each independent of each other, with independent collection statistics and lexicons. We combined the results as if all indices were the same, however, not weighing any one result set more or less than another

Bibtex

@inproceedings{DBLP:conf/trec/HeardFG04,
    author = {Jefferson Heard and Ophir Frieder and David A. Grossman},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IIT} at {TREC} 2004 Standard Retrieval Models Over Partitioned Indices for the Terabyte Track},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/iit.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HeardFG04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Using Normal PC to Index and Retrieval Terabyte Document - THUIR at TREC 2004 Terabyte Track¶

Yijiang Jin, Wei Qi, Min Zhang, Shaoping Ma

Participant: tsinghua.ma
Paper: http://trec.nist.gov/pubs/trec13/papers/tsinghua-ma.tera.pdf
Runs: THUIRtb2 | THUIRtb3 | THUIRtb4 | THUIRtb5 | THUIRtb6

Abstract

This year, Tsinghua University Information Retrieval Group (THUIR) participated in the terabyte track of TREC for the first time. Since the document collection is as large as about 426G and we do not have super computers, our first and most import target is to complete the task in a reasonable low cost, both on the hardware system and time-consuming. This target was archived in such approaches: carefully data preprocess, data set reduction, optimization of algorithm and program. As the effect of the approaches, the task was completed in a normal high performance desktop PC with an indexing time not more than several ten hours and an acceptable retrieval time. Furthermore, the retrieval performance is not terrible. All experiments have been performed on TMiner IR system, developed by THUIR group last year.

Bibtex

@inproceedings{DBLP:conf/trec/JinQZM04,
    author = {Yijiang Jin and Wei Qi and Min Zhang and Shaoping Ma},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Using Normal {PC} to Index and Retrieval Terabyte Document - {THUIR} at {TREC} 2004 Terabyte Track},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/tsinghua-ma.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/JinQZM04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Indri at TREC 2004: Terabyte Track¶

Donald Metzler, Trevor Strohman, Howard R. Turtle, W. Bruce Croft

Participant: u.mass
Paper: http://trec.nist.gov/pubs/trec13/papers/umass.tera.pdf
Runs: indri04AWRM | indri04AW | indri04QLRM | indri04QL | indri04FAW

Abstract

This paper provides an overview of experiments carried out at the TREC 2004 Terabyte Track using the Indri search engine. Indri is an efficient, effective distributed search engine. Like INQUERY, it is based on the inference network framework and supports structured queries, but unlike INQUERY, it uses language modeling probabilities within the network which allows for added flexibility. We describe our approaches to the Terabyte Track, all of which involved automatically constructing structured queries from the title portions of the TREC topics. Our methods use term proximity information and HTML document structure. In addition, a number of optimization procedures for efficient query processing are explained.

Bibtex

@inproceedings{DBLP:conf/trec/MetzlerSTC04,
    author = {Donald Metzler and Trevor Strohman and Howard R. Turtle and W. Bruce Croft},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Indri at {TREC} 2004: Terabyte Track},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/umass.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MetzlerSTC04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Amberfish at the TREC 2004 Terabyte Track¶

Nassib Nassar

Participant: etymon
Paper: http://trec.nist.gov/pubs/trec13/papers/etymon.tera.pdf
Runs: nn04test | nn04tint | nn04eint

Abstract

The TREC 2004 Terabyte Track evaluated information retrieval in large-scale text collections, using a set of 25 million documents (426 GB). This paper gives an overview of our experiences with this collection and describes Amberfish, the text retrieval software used for the experiments.

Bibtex

@inproceedings{DBLP:conf/trec/Nassar04,
    author = {Nassib Nassar},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Amberfish at the {TREC} 2004 Terabyte Track},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/etymon.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Nassar04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Towards Grid-Based Information Retrieval¶

Gregory B. Newby

Participant: u.alaska
Paper: http://trec.nist.gov/pubs/trec13/papers/ualaska.web.tera.pdf
Runs: irttbtl

Abstract

The IRTools software toolkit was used in TREC 2004 for submissions to the Web track and the Terabyte track. Terabyte track results were not available at the time of the due date for this Proceedings paper. While Web track results were available, qrels were not. Because we discovered a bug in the MySQL++ API that truncated docid numbers in our results, we will await qrels to reevaluate submitted runs and report results. This year, the Terabyte track dictated some changes to IRTools in order to handle the 430+GB of text (about 25M documents). The main change was to operate on chunks of the collection (272 separate chunks, each containing one of the Terabyte collections' subdirectories). Chunks were generated in parallel using the National Center for Supercomputing Application's cluster, Mercury (dual Itanium systems). Up to about 40 systems were used simultaneously for both indexing and querying. Query merging was simplistic, based on the cosine value with Lnu.Ltc weighting. Use of the NCSA cluster, and other experiments with commodity clusters, is part of work underway to enable information retrieval in Grid computing environments. The site http://www.gir-wg.org has information about Grid Information Retrieval (GIR), including links to the published Requirements document and draft Architecture document. The GIR working group is chartered by the Global Grid Forum (GGF) to develop standards and reference implementations for GIR. TREC participants are urged to consider getting involved with Grid computing. Computational grids offer a very good fit for the needs of large-scale information retrieval research and practice. This brief abstract for the proceedings will be replaced with a complete analysis of this year's submissions for the full conference paper. Meanwhile, Newby (2004) provides a profile of IRTools, which is generally applicable to this year's submissions.

Bibtex

@inproceedings{DBLP:conf/trec/Newby04,
    author = {Gregory B. Newby},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Towards Grid-Based Information Retrieval},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/ualaska.web.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Newby04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

JHU/APL at TREC 2004: Robust and Terabyte Tracks¶

Christine D. Piatko, James Mayfield, Paul McNamee, R. Scott Cost

Participant: jhu.apl.mcnamee
Paper: http://trec.nist.gov/pubs/trec13/papers/jhu-apl.robust.tera.pdf
Runs: apl04w4tdn | apl04w4t

Abstract

The Johns Hopkins University Applied Physics Laboratory (JHU/APL) focused on the Robust and Terabyte Tracks at the 2004 TREC conference.

Bibtex

@inproceedings{DBLP:conf/trec/PiatkoMMC04,
    author = {Christine D. Piatko and James Mayfield and Paul McNamee and R. Scott Cost},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{JHU/APL} at {TREC} 2004: Robust and Terabyte Tracks},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/jhu-apl.robust.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/PiatkoMMC04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier¶

Vassilis Plachouras, Ben He, Iadh Ounis

Participant: u.glasgow
Paper: http://trec.nist.gov/pubs/trec13/papers/uglasgow.web.robust.tera.pdf
Runs: uogTBBaseS | uogTBBaseL | uogTBQEL | uogTBAnchS | uogTBPoolQEL

Abstract

With our participation in TREC2004, we test Terrier, a modular and scalable Information Retrieval framework, in three tracks. For the mixed query task of the Web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. For the robust track, in order to cope with the poorly-performing queries, we use two pre-retrieval performance predictors and a weighting function recommender mechanism. We also test a new training approach for the automatic tuning of the term frequency normalisation parameters. In the Terabyte track, we employ a distributed version of Terrier and test the effectiveness of techniques, such as using the anchor text, query expansion and selecting an optimal weighting model for each query. Overall, in all three tracks we participated, Terrier and the tested Divergence From Randomness models were shown to be stable and effective.

Bibtex

@inproceedings{DBLP:conf/trec/PlachourasHO04,
    author = {Vassilis Plachouras and Ben He and Iadh Ounis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/uglasgow.web.robust.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/PlachourasHO04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004¶

Ruihua Song, Ji-Rong Wen, Shuming Shi, Guomao Xin, Tie-Yan Liu, Tao Qin, Xin Zheng, Jiyu Zhang, Gui-Rong Xue, Wei-Ying Ma

Participant: microsoft.asia
Paper: http://trec.nist.gov/pubs/trec13/papers/microsoft-asia.web.tera.pdf
Runs: MSRAt1 | MSRAt2 | MSRAt3 | MSRAt4 | MSRAt5

Abstract

Terabye track aMir 2 search Asia (MRA)s experiments on the mixed query task of Web track and For Web track, we mainly test a set of new technologies. One of our efforts is to test some new features of Web pages to see if they are helpful to retrieval performance. Title extraction, sitemap based feature propagation, and URL scoring are of this kind. Another effort is to propose new function of algorithm to improve relevance or importance ranking. For example, we found that a new link analysis algorithm named HostRank that can outweigh PageRank [4] for topic distillation queries based on our experimental results. Eventually, linear combination of multiple scores with normalizations is tried to achieve stable performance improvement with mixed queries.

Bibtex

@inproceedings{DBLP:conf/trec/SongWSXLQZZXM04,
    author = {Ruihua Song and Ji{-}Rong Wen and Shuming Shi and Guomao Xin and Tie{-}Yan Liu and Tao Qin and Xin Zheng and Jiyu Zhang and Gui{-}Rong Xue and Wei{-}Ying Ma},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Microsoft Research Asia at Web Track and Terabyte Track of {TREC} 2004},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/microsoft-asia.web.tera.pdf},
    timestamp = {Tue, 01 Dec 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SongWSXLQZZXM04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Robust, Web and Terabyte Retrieval with Hummingbird SearchServer at TREC 2004¶

Stephen Tomlinson

Participant: hummingbird
Paper: http://trec.nist.gov/pubs/trec13/papers/humingbird.robust.web.tera.pdf
Runs: humT04 | humT04l | humT04vl | humT04dvl | humT04l3

Abstract

Hummingbird participated in 3 tracks of TREC 2004: the ad hoc task of the Robust Retrieval Track (find at least one relevant document in the first 10 rows from 1.9GB of news and government data), the mixed navigational and distillation task of the Web Track (find the home or named page or key resource pages in 1.2 million pages (18GB) from the .GOV domain), and the ad hoc task of the Terabyte Track (find all the relevant documents with high precision from 25.2 million pages (426GB) from the .GOV domain). In the robustness task, SearchServer found a relevant document in the first 10 rows for 46 of the 49 new short (Title-only) topics. In the web task, SearchServer returned a desired page in the first 10 rows for more than 75% of the 225 queries. In the terabyte task, SearchServer found a relevant document in the first 10 rows for 45 of the 49 short topics.

Bibtex

@inproceedings{DBLP:conf/trec/Tomlinson04,
    author = {Stephen Tomlinson},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Robust, Web and Terabyte Retrieval with Hummingbird SearchServer at {TREC} 2004},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/humingbird.robust.web.tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Tomlinson04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Using Clustering and Blade Clusters in the Terabyte Task¶

Giuseppe Attardi, Andrea Esuli, Chirag Patel

Participant: upisa.attardi
Paper: http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf
Runs: pisa1 | pisa2 | pisa3 | pisa4

Abstract

Web search engines exploit conjunctive queries and special ranking criteria which differ from the disjunctive queries typically used for ad-hoc retrieval. We wanted to asses the effectiveness of those techniques in the TeraByte task, in particular scoring criteria like: link popularity, proximity boosting, home page score, descriptions and anchor text. Since conjunctive queries sometimes produce low recall, we tested a new approach to query expansion, which extracts additional query terms from a clustering of the snippets from the first query. The technique proved effective, almost doubling the Mean Average Precision. However, the improvement was just enough to compensate for the drop that was introduced, contrary to our expectations, by the proximity boost.

Bibtex

@inproceedings{DBLP:conf/trec/AttardiEP04,
    author = {Giuseppe Attardi and Andrea Esuli and Chirag Patel},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Using Clustering and Blade Clusters in the Terabyte Task},
    booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
    series = {{NIST} Special Publication},
    volume = {500-261},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2004},
    url = {http://trec.nist.gov/pubs/trec13/papers/upisa-tera.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/AttardiEP04.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}