Skip to content

Proceedings 2009

Relevance Feedback

FUB at TREC 2009 Relevance Feedback Track: Diversifying Feedback Documents (Extended Abstract)

Andrea Bernardini, Claudio Carpineto, Edgardo Ambrosi

Abstract

The focus of our participation was optimal selection and use of diverse feedback documents. Assuming that the query has a topical structure and that the user is interested only in some query topics and assuming also that only a small amount of feedback information will be made available, the goal was to select topic representatives to be used as feedback documents and then exploit the feedback information to bias the set of results towards the topics of interest. Our method consists of the following steps. The selection of topic representatives was performed by running a first-pass retrieval on the original query and extracting the most novel and relevant documents from the obtained results. Such documents were returned for manual evaluation and the provided feedback was given as an input to an improved version of the relevance feedback system that we developed at TREC 2008 Relevance Feedback track. As this system was designed for explicitly dealing with both positive and negative relevance feedback, we reckoned that in principle it would be able to use the diverse topic representatives to filter out the highly-ranked documents belonging to the unwanted topics in an effective manner. Unfortunately, due to resource and time limitations, we were not able to fully implement and test the method outlined above. We had to make a number of approximations and simplifications that did not allow us to evaluate its true potential for improving relevance feedback. Likewise, we could not test our main hypothesis that this method is suitable for dealing with multi-topic queries.

Bibtex
@inproceedings{DBLP:conf/trec/BernardiniCA09,
    author = {Andrea Bernardini and Claudio Carpineto and Edgardo Ambrosi},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{FUB} at {TREC} 2009 Relevance Feedback Track: Diversifying Feedback Documents (Extended Abstract)},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/fub.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BernardiniCA09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Padua at TREC 2009: Relevance Feedback Track

Emanuele Di Buccio, Massimo Melucci

Abstract

In the Relevance Feedback (RF) task the user is directly involved in the search process: given an initial set of results, he specifies if they are relevant or not to the achievement of his information goal. In the TREC 2009 RF track the first five documents retrieved by the baseline systems were judged by the assessors and then used as evidence for the RF algorithms to be tested. The specific algorithm we tested is mainly based on a geometric framework which allows the latent semantic associations of terms in the feedback documents to be modeled as a vector subspace; the documents of the collection represented as vectors of TF·IDF weights were re-ranked according to their distance from the subspace. The adopted geometric framework was used in past works as a basis for Implicit Relevance Feedback (IRF) and Pseudo Relevance Feedback (PRF) algorithms; the participation to the RF track allows us to make some preliminary investigations on the effectiveness of the adopted framework when it is exploited to support explicit RF on much larger test collections, thus complementing the work carried out for the other RF strategies.

Bibtex
@inproceedings{DBLP:conf/trec/BuccioM09,
    author = {Emanuele Di Buccio and Massimo Melucci},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Padua at {TREC} 2009: Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/upadua.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BuccioM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Minimal Test Collections for Relevance Feedback

Ben Carterette, Praveen Chandar, Aparna Kailasam, Divya Muppaneni, Sree Lekha Thota

Abstract

The Information Retrieval Lab at the University of Delaware participated in the Relevance Feedback track at TREC 2009. We used only the Category B subset of the ClueWeb collection; our preprocessing and indexing steps are described in our paper on ad hoc and diversity runs [10]. The second year of the Relevance Feedback track focused on selection of documents for feedback. Our hypothesis is that documents that are good at distinguishing systems in terms of their effectiveness by mean average precision will also be good documents for relevance feedback. Thus we have applied the document selection algorithm MTC (Minimal Test Collections) developed by Carterette et al. [6, 4, 9, 5] that is used in the Million Query Track [2, 1, 8] for selecting documents to be judged to find the right ranking of systems. Our approach can therefore be described as “MTC for Relevance Feedback”.

Bibtex
@inproceedings{DBLP:conf/trec/CarteretteCKMT09,
    author = {Ben Carterette and Praveen Chandar and Aparna Kailasam and Divya Muppaneni and Sree Lekha Thota},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Minimal Test Collections for Relevance Feedback},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-ben.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CarteretteCKMT09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

UMass Amherst and UT Austin @ the TREC 2009 Relevance Feedback Track

Marc-Allen Cartright, Jangwon Seo, Matthew Lease

Abstract

We present a new supervised method for estimating term-based retrieval models and apply it to weight expansion terms from relevance feedback. While previous work on supervised feedback [Cao et al., 2008] demonstrated significantly improved retrieval accuracy over standard unsupervised approaches [Lavrenko and Croft, 2001, Zhai and Lafferty, 2001], feedback terms were assumed to be independent in order to reduce training time. In contrast, we adapt the AdaRank learning algorithm [Xu and Li, 2007] to simultaneously estimate parameterization of all feedback terms. While not evaluated here, the method can be more generally applied for joint estimation of both query and feedback terms. To apply our method to a large web collection, we also investigate use of sampling to reduce feature extraction time while maintaining robust learning.

Bibtex
@inproceedings{DBLP:conf/trec/CartrightSL09,
    author = {Marc{-}Allen Cartright and Jangwon Seo and Matthew Lease},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {UMass Amherst and {UT} Austin @ the {TREC} 2009 Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/umass-amhearst.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CartrightSL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Gordon V. Cormack, Mona Mojdeh

Abstract

For the TREC 2009, we exhaustively classified every document in each corpus, using machine learning methods that had previously been shown to work well for email spam [9, 3]. We treated each document as a sequence of bytes, with no tokenization or parsing of tags or meta-information. This approach was used exclusively for the adhoc web, diversity and relevance feedback tasks, as well as to the batch legal task: the ClueWeb09 and Tobacco collections were processed end-to-end and never indexed. We did the interactive legal task in two phases: first, we used interactive search and judging to find a large and diverse set of training examples; then we used active learning process, similar to what we used for the other tasks, to find find more relevant documents. Finally, we fitted a censored (i.e. truncated) mixed normal distribution to estimate recall and the cutoff to optimize F1, the principal effectiveness measure.

Bibtex
@inproceedings{DBLP:conf/trec/CormackM09,
    author = {Gordon V. Cormack and Mona Mojdeh},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Machine Learning for Information Retrieval: {TREC} 2009 Web, Relevance Feedback and Legal Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.WEB.RF.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CormackM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Microsoft Research at TREC 2009: Web and Relevance Feedback Track

Nick Craswell, Dennis Fetterly, Marc Najork, Stephen Robertson, Emine Yilmaz

Abstract

We took part in the Web and Relevance Feedback tracks, using the ClueWeb09 corpus. To process the corpus, we developed a parallel processing pipeline which avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC experiments, and we present effectiveness results.

Bibtex
@inproceedings{DBLP:conf/trec/CraswellFNRY09,
    author = {Nick Craswell and Dennis Fetterly and Marc Najork and Stephen Robertson and Emine Yilmaz},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Microsoft Research at {TREC} 2009: Web and Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/microsoft.WEB.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CraswellFNRY09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

CMIC@TREC 2009: Relevance Feedback Track

Kareem Darwish, Ahmed El-Deeb

Abstract

This paper describes CMIC's submissions to the TREC'09 relevance feedback track. In the phase 1 runs we submitted, we experimented with two different techniques to produce 5 documents to be judged by the user in the initial feedback step, namely using knowledge bases and clustering. Both techniques attempt to topically diversify these 5 documents as much as possible in an effort to maximize the probability that they contain at least 1 relevant document. The basic premise is that if a query has n diverse interpretations, then diversifying results and picking the top 5 most likely interpretations would maximize the probability that a user would be interested in at least one interpretation. In phase 2 runs, which involved the use of the feedback attained from phase 1 judgments, we attempted to use positive and negative judgments in weighing the terms to be used for subsequent feedback.

Bibtex
@inproceedings{DBLP:conf/trec/DarwishE09,
    author = {Kareem Darwish and Ahmed El{-}Deeb},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {CMIC@TREC 2009: Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/cmic.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DarwishE09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Pairwise Document Classification for Relevance Feedback

Jonathan L. Elsas, Pinar Donmez, Jamie Callan, Jaime G. Carbonell

Abstract

In this paper we present Carnegie Mellon University's submission to the TREC 2009 Relevance Feedback Track. In this submission we take a classification approach on document pairs to using relevance feedback information. We explore using textual and non-textual document-pair features to classify unjudged documents as relevant or non-relevant, and use this prediction to re-rank a baseline document retrieval. These features include co-citation measures, URL similarities, as well as features often used in machine learning systems for document ranking such as the difference in scores assigned by the baseline retrieval system.

Bibtex
@inproceedings{DBLP:conf/trec/ElsasDCC09,
    author = {Jonathan L. Elsas and Pinar Donmez and Jamie Callan and Jaime G. Carbonell},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Pairwise Document Classification for Relevance Feedback},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/cmu.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ElsasDCC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Twente @ TREC 2009: Indexing Half a Million Web Pages

Claudia Hauff, Djoerd Hiemstra

Abstract

The University of Twente participated in three tasks of TREC 2009: the adhoc task, the diversity task and the relevance feedback task. All experiments are performed on the English part of ClueWeb09. We describe our approach to tuning our retrieval system in absence of training data in Section 3. We describe the use of categories and a query log for diversifying search results in Section 4. Section 5 describes preliminary results for the relevance feedback task.

Bibtex
@inproceedings{DBLP:conf/trec/HauffH09,
    author = {Claudia Hauff and Djoerd Hiemstra},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Twente @ {TREC} 2009: Indexing Half a Million Web Pages},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/utwente.WEB.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HauffH09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

PRIS at 2009 Relevance Feedback Track: Experiments in Language Model for Relevance Feedback

Si Li, Xinsheng Li, Hao Zhang, Sanyuan Gao, Guang Chen, Jun Guo

Abstract

This paper describes BUPT (pris) participation in Relevance Feedback Track 2009. The track has two phrases. In the first phrase, 5 documents are submitted based on the results of the k-means. In the second phrase, language model is used to relevance feedback for query expansion.

Bibtex
@inproceedings{DBLP:conf/trec/LiLZGCG09,
    author = {Si Li and Xinsheng Li and Hao Zhang and Sanyuan Gao and Guang Chen and Jun Guo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{PRIS} at 2009 Relevance Feedback Track: Experiments in Language Model for Relevance Feedback},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/pris.RF.pdf},
    timestamp = {Tue, 17 Nov 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LiLZGCG09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Mining Specific and General Features in Both Positive and Negative Relevance Feedback

Yuefeng Li, Xiaohui Tao, Abdulmohsen Algarni, Sheng-Tang Wu

Abstract

User relevance feedback is usually utilized by Web systems to interpret user information needs and retrieve effective results for users. However, how to discover useful knowledge in user relevance feedback and how to wisely use the discovery knowledge are two critical problems. In TREC 2009, we participated in the Relevance Feedback Track and experimented a model consisting of two innovative stages: one for subject-based query expansion to extract pseudo-relevance feedback; one for relevance feature discovery to find useful patterns and terms in relevance judgements to rank documents. In this paper, the detailed description of our model is given, as well as the related discussions for the experimental results.

Bibtex
@inproceedings{DBLP:conf/trec/LiTAW09,
    author = {Yuefeng Li and Xiaohui Tao and Abdulmohsen Algarni and Sheng{-}Tang Wu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Mining Specific and General Features in Both Positive and Negative Relevance Feedback},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/queenslandu.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LiTAW09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2009: Experiments with Terrier

Richard McCreadie, Craig Macdonald, Iadh Ounis, Jie Peng, Rodrygo L. T. Santos

Abstract

In TREC 2009, we extend our Voting Model for the faceted blog distillation, top stories identification, and related entity finding tasks. Moreover, we experiment with our novel xQuAD framework for search result diversification. Besides fostering our research in multiple directions, by participating in such a wide portfolio of tracks, we further develop the indexing and retrieval capabilities of our Terrier Information Retrieval platform, to effectively and efficiently cope with a new generation of large-scale test collections.

Bibtex
@inproceedings{DBLP:conf/trec/McCreadieMOPS09,
    author = {Richard McCreadie and Craig Macdonald and Iadh Ounis and Jie Peng and Rodrygo L. T. Santos},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2009: Experiments with Terrier},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/McCreadieMOPS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Topical Diversity and Relevance Feedback

Edgar Meij, Jiyin He, Wouter Weerkamp, Maarten de Rijke

Abstract

We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the relevance feedback track at TREC 2009. Our main conclusion for the relevance feedback track is that a topical diversity approach provides good feedback documents. Further, we find that our relevance feedback algorithm seems to help most when there are sufficient relevant documents available.

Bibtex
@inproceedings{DBLP:conf/trec/MeijHWR09,
    author = {Edgar Meij and Jiyin He and Wouter Weerkamp and Maarten de Rijke},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Topical Diversity and Relevance Feedback},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-derijke.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MeijHWR09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

Mark D. Smucker, Charles L. A. Clarke, Gordon V. Cormack

Abstract

In this paper, we report on our TREC experiments with the ClueWeb09 document collection. We participated in the relevance feedback and web tracks. While our phase 1 relevance feedback run's performance was good, our other relevance feedback and web track submissions' performances were lacking. We suspect this performance difference is caused by the Category B document subset of the ClueWeb09 collection having a higher prior probability of relevance than the rest of the collection. Future work will involve a more detailed error analysis of our experiments.

Bibtex
@inproceedings{DBLP:conf/trec/SmuckerCC09,
    author = {Mark D. Smucker and Charles L. A. Clarke and Gordon V. Cormack},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments with ClueWeb09: Relevance Feedback and Web Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SmuckerCC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Relevance Feedback Based on Constrained Clustering: FDU at TREC 09

Bingqing Wang, Xuanjing Huang

Abstract

We introduce our participation of the TREC Relevance Feedback(RF) TRACK in 2009. The RF09 TRACK is focused on the explicit relevant feedback, where a few relevant and irrelevant documents are available to each query. Our system is implemented under the framework of probabilistic language model. We apply the constrained clustering on the top returned documents and extract the expanded words to reform the query. We also extract the named entities from the explicit relevant documents to expand the query. The experiment was conducted on the ClueWeb09 TREC Category B, which is a new and huge test collection for the TREC TRACKs. The evaluation result shows the performance of the constrained clustering.

Bibtex
@inproceedings{DBLP:conf/trec/WangH09,
    author = {Bingqing Wang and Xuanjing Huang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Relevance Feedback Based on Constrained Clustering: {FDU} at {TREC} 09},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/fudanu.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangH09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

York University at TREC 2009: Relevance Feedback Track

Zheng Ye, Xiangji Huang, Ben He, Hongfei Lin

Abstract

We describe a series of experiments conducted in our participation in the Relevance Feedback Track. We evaluate two traditional weighting models (BM25 and DFR) for the phase 1 task, which are widely used in text retrieval domain. We also evaluate a statistics-based feedback model and our proposed feedback model for the phase 2 task. Currently, we are waiting for the overview paper to facilitate further analyses.

Bibtex
@inproceedings{DBLP:conf/trec/YeHHYL09,
    author = {Zheng Ye and Xiangji Huang and Ben He and Hongfei Lin},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {York University at {TREC} 2009: Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/yorku.RF.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/YeHHYL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

UCSC at Relevance Feedback Track

Lanbo Zhang, Jadiel de Arma, Kai Yu

Abstract

The relevance feedback track in TREC 2009 focuses on two sub tasks: actively selecting good documents for users to provide relevance feedback and retrieving documents based on user relevance feedback. For the first task, we tried a clustering based method and the Transductive Experimental Design (TED) method proposed by Yu et al. [5]. For clustering based method, we use the K-means algorithm to cluster the top retrieved documents and choose the most representative document of each cluster. The TED method aims to find documents that are hard-to-predict and representative of the unlabeled documents. For the second task, we did query expansion based on a relevance model learned on the relevant documents.

Bibtex
@inproceedings{DBLP:conf/trec/ZhangAY09,
    author = {Lanbo Zhang and Jadiel de Arma and Kai Yu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{UCSC} at Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uc-santa.cruz.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ZhangAY09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Chemical

Overview of the TREC 2009 Chemical IR Track

Mihai Lupu, Florina Piroi, Xiangji Huang, Jianhan Zhu, John Tait

Abstract

TREC 2009 was the first year of the Chemical IR Track, which focuses on evaluation of search techniques for discovery of digitally stored information on chemical patents and academic journal articles. The track included two tasks: Prior Art (PA) and Technical Survey (TS) tasks. This paper describes how we designed the two tasks and presents the official results of eight participating groups.

Bibtex
@inproceedings{DBLP:conf/trec/LupuPHZT09,
    author = {Mihai Lupu and Florina Piroi and Xiangji Huang and Jianhan Zhu and John Tait},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/CHEM09.OVERVIEW.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LupuPHZT09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Strategies for Effective Chemical Information Retrieval

Suleyman Cetintas, Luo Si

Abstract

We participated in the technology survey and prior art search subtasks of the TREC 2009 Chemical IR Track. This paper describes the methods developed for these two tasks. For the technology survey task, we propose a method that constructs highly structured queries to do retrieval on different fields of chemical patents and documents in a weighted way. The proposed method i) enriches these structured queries with synonyms of the chemicals that have been identified, and ii) uses simple entity recognition to extract information for increasing or decreasing weights of some terms and to filter out documents from the ranked list. For prior art search task; we propose an automated query generation method that uses all title words, and selects sets of terms from the claims, abstract and description fields of query patents to transform a query patent into a search query. From the selected terms, chemical entities are extracted and synonyms for the identified chemical entities are included from PubChem. Then structured queries are formed to do retrieval over different fields of documents with different weights. Furthermore a post-processing step is also proposed that i) filters out some of the retrieved documents from the ranked list because of date constraints and ii) utilizes the IPC similarities between query patent and its retrieved patents to re-rank the retrieved documents. Empirical results demonstrate the effectiveness of these methods in both tasks.

Bibtex
@inproceedings{DBLP:conf/trec/CetintasS09,
    author = {Suleyman Cetintas and Luo Si},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Strategies for Effective Chemical Information Retrieval},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/purdue.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CetintasS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Report on the TREC 2009 Experiments: Chemical IR Track

Julien Gobeill, Douglas Teodoro, Emilie Pasche, Patrick Ruch

Abstract

The goal of the first TREC Chemical track was to retrieve documents relevant to a given patent query, within a large collection of patents in chemistry. Regarding this objective, for the Prior Art subtask, our runs performed significantly better that runs submitted by other participating teams. Baseline retrieval methods achieved relatively poor performances (Mean Average Precision = 0.067). Query expansion, driven my chemical named entity recognition resulted in some modest improvement (+2 to 3%). Filtering based on IPC codes did not result in any significant improvement. A re-ranking strategy, based on claims only improved MAP by about 3%. The most effective gain was obtained by using patent citation patterns. Somehow similar to feed-back but restricted to citations, we used patents cited in the retrieved patents in order to boost the retrieval status value of the baseline run. This strategy led to a remarkable improvement (MAP 0.18, +168 %). Nevertheless, as official topics were sampled from the collection disregarding their creation date, our strategy happened to exploit citations of patents which were patented after the topic itself. From a user perspective, such a setting is questionable. We think that future TREC-CHEM competitions should address this issue by using patents filed as recently as possible.

Bibtex
@inproceedings{DBLP:conf/trec/GobeillTPR09,
    author = {Julien Gobeill and Douglas Teodoro and Emilie Pasche and Patrick Ruch},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Report on the {TREC} 2009 Experiments: Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/bitem.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GobeillTPR09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Patent Retrieval in Chemistry Based on Semantically Tagged Named Entities

Harsha Gurulingappa, Bernd Müller, Roman Klinger, Heinz-Theodor Mevissen, Martin Hofmann-Apitius, Juliane Fluck, Christoph M. Friedrich

Abstract

This paper reports on the work that has been conducted by Fraunhofer SCAI for Trec Chemistry (Trec-Chem) track 2009. The team of Fraunhofer SCAI participated in two tasks, namely Technology Survey and Prior Art Search. The core of the framework is an index of 1.2 million chemical patents provided as a data set by Trec. For the technology survey, three runs were submitted based on semantic dictionaries and noun phrases. For the prior art search task, several fields were introduced into the index that contained normalized noun phrases, biomedical as well as chemical entities. Altogether, 36 runs were submitted for this task that were based on automatic querying with tokens, noun phrases and entities along with different search strategies.

Bibtex
@inproceedings{DBLP:conf/trec/GurulingappaMKMHFF09,
    author = {Harsha Gurulingappa and Bernd M{\"{u}}ller and Roman Klinger and Heinz{-}Theodor Mevissen and Martin Hofmann{-}Apitius and Juliane Fluck and Christoph M. Friedrich},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Patent Retrieval in Chemistry Based on Semantically Tagged Named Entities},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/scai.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GurulingappaMKMHFF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

DUTIR at TREC 2009: Chemical IR Track

Song Jin, Zheng Ye, Hongfei Lin

Abstract

This paper presents the DUTIR submission to TREC 2009 Chemical IR Track. This track included two tasks: Prior Art (PA) and Technical Survey (TS) tasks. We present a series of experiments on two text retrieval models, BM25 and Language Model for IR (LMIR). For Prior Art task, we focused on formulating the queries from the query patents and date filtering. Moreover, some traditional search techniques are used for Technical Survey task.

Bibtex
@inproceedings{DBLP:conf/trec/JinYL09,
    author = {Song Jin and Zheng Ye and Hongfei Lin},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{DUTIR} at {TREC} 2009: Chemical {IR} Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/dalianu.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/JinYL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC Blog and TREC Chem: A View from the Corn Fields

Yelena Mejova, Viet Ha-Thuc, Steven Foster, Christopher G. Harris, Robert J. Arens, Padmini Srinivasan

Abstract

The University of Iowa Team, participated in the blog track and the chemistry track of TREC-2009. This is our first year participating in the blog track as well as the chemistry track.

Bibtex
@inproceedings{DBLP:conf/trec/MejovaHFHAS09,
    author = {Yelena Mejova and Viet Ha{-}Thuc and Steven Foster and Christopher G. Harris and Robert J. Arens and Padmini Srinivasan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} Blog and {TREC} Chem: {A} View from the Corn Fields},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uiowa.BLOG.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MejovaHFHAS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jay Urbain, Ophir Frieder

Abstract

For the TREC-2009 Chemical IR Track, we explore development of a distributed information retrieval system based on a dimensional data model. The indexing model supports named entity identification and aggregation of term statistics at multiple levels of patent structure including individual words, sentences, claims, descriptions, abstracts, and titles. The system was deployed across 15 Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instances and 15 Elastic Block Storage (EBS) database shards to support efficient indexing and query processing of the relatively large index generated from indexing each individual word (sans stop words) in the 100G+ collection of chemical patent documents. The query processing algorithm for technology survey search and prior art search uses information extraction techniques and locally aggregated term statistics to help disambiguate candidate entities and terms in context. Query processing for prior art search automatically generates a structured query based on the relative distinctiveness of individual terms and candidate entity phrases from the query patent's claims, abstract, and title sections. For both the technology survey and prior art search, we evaluated several probabilistic retrieval functions for integrating statistics of retrieved named entities with term statistics at multiple levels of document structure to identify relevant patents.

Bibtex
@inproceedings{DBLP:conf/trec/UrbainF09,
    author = {Jay Urbain and Ophir Frieder},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} Chemical {IR} Track 2009: {A} Distributed Dimensional Indexing Model for Chemical Patent Search},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/milwaukee.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/UrbainF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Formulating Simple Structured Queries Using Temporal and Distributional Cues in Patents

Le Zhao, James P. Callan

Abstract

Patent prior art retrieval aims to find related publications, especially patents, which may invalidate the patent. The task exhibits its own characteristic because of the possible use of a whole patent as a query. This work focuses on the use of date fields and content fields of the query patent to formulate effective structured queries. Retrieval is performed on the collection of patents which also share the same structure as the query patent, mainly priority dates, application date, publication date and content fields. Unsurprisingly, results show that filtering using date information improves retrieval significantly. However, results also show that a careful choice of the date filter is important, given the multiple date fields existent in a patent. The actual ranking query is constructed based on word distributions of title, claims and content fields of the query patent. The overall MAP of this citation finding task is still in the lower 0.1 range. An error analysis focusing on the lower performing topics finds that the citation finding task (given publication recommend citations, which is a very similar setup as this year's prior art evaluation) can be very different from the prior art task (finding patents that invalidates the query patent). It raises the concern that just the citations included in query patents can be a biased and incomplete set of relevance judgements for the prior art task.

Bibtex
@inproceedings{DBLP:conf/trec/ZhaoC09,
    author = {Le Zhao and James P. Callan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Formulating Simple Structured Queries Using Temporal and Distributional Cues in Patents},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/cmu.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ZhaoC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

York University at TREC 2009: Chemical Track

Jiashu Zhao, Xiangji Huang, Zheng Ye, Jianhan Zhu

Abstract

Our chemical experiments mainly focus on addressing three major problems in two chemical information retrieval tasks, Technology Survey (TS) task and Prior Art (PA) task. The three problems are: (1) how to deal with chemical terminology synonyms? (2) how to deal with chemical terminology abbreviation? (3) how to deal with long queries in Prior Art (PA) task? In particular, we propose a query expansion algorithm for TS task and a keyword-selection algorithm for PA task. The Mean Average Precision (MAP) for our TS task run “york09ca07” using Algorithm 1 was 0.2519 and for our PA task run “york09caPA01” using Algorithm 2 was 0.0566. The evaluation results show that both algorithms are effective for improving retrieval performance.

Bibtex
@inproceedings{DBLP:conf/trec/ZhaoHYYZ09,
    author = {Jiashu Zhao and Xiangji Huang and Zheng Ye and Jianhan Zhu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {York University at {TREC} 2009: Chemical Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/yorku.CHEM.pdf},
    timestamp = {Sun, 02 Oct 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhaoHYYZ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Bruce Hedin, Stephen Tomlinson, Jason R. Baron, Douglas W. Oard

Abstract

TREC 2009 was the fourth year of the Legal Track, which focuses on evaluation of search technology for “discovery” (i.e., responsive review) of electronically stored information in litigation and regulatory settings. The track included two tasks: an Interactive task (in which real users could iteratively refine their queries and/or engage in multi-pass relevance feedback) and a Batch task (two-pass search in a controlled setting with some relevant and nonrelevant documents manually marked after the first pass). This paper describes the design of the two tasks and presents the results.

Bibtex
@inproceedings{DBLP:conf/trec/HedinTBO09,
    author = {Bruce Hedin and Stephen Tomlinson and Jason R. Baron and Douglas W. Oard},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Legal Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/LEGAL09.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HedinTBO09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Gordon V. Cormack, Mona Mojdeh

Abstract

For the TREC 2009, we exhaustively classified every document in each corpus, using machine learning methods that had previously been shown to work well for email spam [9, 3]. We treated each document as a sequence of bytes, with no tokenization or parsing of tags or meta-information. This approach was used exclusively for the adhoc web, diversity and relevance feedback tasks, as well as to the batch legal task: the ClueWeb09 and Tobacco collections were processed end-to-end and never indexed. We did the interactive legal task in two phases: first, we used interactive search and judging to find a large and diverse set of training examples; then we used active learning process, similar to what we used for the other tasks, to find find more relevant documents. Finally, we fitted a censored (i.e. truncated) mixed normal distribution to estimate recall and the cutoff to optimize F1, the principal effectiveness measure.

Bibtex
@inproceedings{DBLP:conf/trec/CormackM09,
    author = {Gordon V. Cormack and Mona Mojdeh},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Machine Learning for Information Retrieval: {TREC} 2009 Web, Relevance Feedback and Legal Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.WEB.RF.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CormackM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Bruce Ellis Fein, Brian Merrell, F. Eli Nelson

Abstract

This paper presents the results of the collaborative entry of Backstop LLP and Cleary Gottlieb Steen & Hamilton LLP in the Legal Track of the 2009 Text Retrieval Conference (TREC) sponsored by the National Institute for Standards and Technology (NIST). The Legal Track served as a truncated replication of a document review of almost one million documents. Backstop software, assisted by attorney document review of less than one-tenth of one percent of the overall document set, classified the documents and achieved a combined accuracy rate (“F1 score”) of approximately 80%.

Bibtex
@inproceedings{DBLP:conf/trec/FeinMN09,
    author = {Bruce Ellis Fein and Brian Merrell and F. Eli Nelson},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Backstop {LLP} and Cleary Gottlied Steen {\&} Hamilton {LLP} at {TREC} Legal Track 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/backstop.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/FeinMN09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Sparse Matrix Factorization: Applications to Latent Semantic Indexing

Erin Moulding, Raymond J. Spiteri, April Kontostathis

Abstract

This article describes the use of Latent Semantic Indexing (LSI) and some of its variants for the TREC Legal batch task. Both folding-in and Essential Dimensions of LSI (EDLSI) ap- peared as if they might be successful for recall-focused retrieval on a collection of this size. Furthermore, we developed a new LSI technique, one which replaces the Singular Value Decomposition (SVD) with another technique for matrix factorization, the sparse column-row approximation (SCRA). We were able to conclude that all three LSI techniques have similar performance. Although our 2009 results showed significant improvement when compared to our 2008 results, the use of a better method for selection of the parameter K, which is the ranking that results in the best balance between precision and recall, appears to have provided the most benefit.

Bibtex
@inproceedings{DBLP:conf/trec/MouldingSK09,
    author = {Erin Moulding and Raymond J. Spiteri and April Kontostathis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Sparse Matrix Factorization: Applications to Latent Semantic Indexing},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ursinus.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MouldingSK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Venkat Rangan, Maojin Jiang

Abstract

The TREC Legal Track 2009 features an Interactive Task that is designed to replicate real-world challenges in producing a collection of responsive documents from a large collection of documents. The task required us to produce responsive documents from any of the seven topics, which are production requests. Clearwell Systems incorporated novel methods for producing a responsive collection using a combination of automated sampling, evaluation of the samples, and using the samples as input into a blind relevance feedback engine. The algorithms applied use an automatic correlation covariance matrix for automatic evaluation of the samples and, using the correlation coefficient, determine whether the process of blind feedback converges to a highly correlated set of responsive documents. The number of iterations of sampling, the K-value for blind feedback, along with the final convergence threshold are monitored. The F-measure results of this are compared across the three different Interactive Topics that Clearwell participated in, for discussions.

Bibtex
@inproceedings{DBLP:conf/trec/RanganJ09,
    author = {Venkat Rangan and Maojin Jiang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Clearwell Systems at {TREC} 2009 Legal Interactive},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/clearwell09.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/RanganJ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Tal Sterenzy

Abstract

Equivio participated in two runs under the legal interactive track: topics 205 and 207. The runs utilized the Equivio>Relevance product. Equivio>Relevance is an expert-guided system which enables automated prioritization of documents and keywords. Based on initial input from a lead attorney, Equivio>Relevance uses statistical and self-learning techniques to calculate graduated relevance scores for each document in the data collection. [...]

Bibtex
@inproceedings{DBLP:conf/trec/Sterenzy09,
    author = {Tal Sterenzy},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{EQUIVIO} at {TREC} 2009 Legal Interactive},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/equivio.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Sterenzy09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Stephen Tomlinson

Abstract

For our participation in the Batch Task of the TREC 2009 Legal Track, we produced several retrieval sets to compare experimental Boolean, vector, fusion and relevance feedback techniques for e-Discovery requests. In this paper, we have reported not just the mean scores of the experimental approaches but also the largest per-topic impacts of the techniques for several measures. The experimental automatic relevance feedback technique was found to attain a statistically significant gain over the reference Boolean result in both the mean Precision@B and F1@K measures.

Bibtex
@inproceedings{DBLP:conf/trec/Tomlinson09,
    author = {Stephen Tomlinson},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments with the Negotiated Boolean Queries of the {TREC} 2009 Legal Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/open-text.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Tomlinson09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

John Wang, Cameron Coles, Rob Elliot, Sofia Andrianakou

Abstract

Organizations responding to requests to produce electronically stored information (ESI) for litigation today often conduct information retrieval with a limited amount of data that has first been culled by custodian mailboxes, date ranges, or other factors chosen semi-arbitrarily based on legal negotiations or other exogenous factors. The culling process does not necessarily take into account the composition of the data set; and may, in fact, impede the expediency and cost-effectiveness of the eDiscovery process as ESI not initially identified may need to be collected later in the eDiscovery process. This exclusionary eDiscovery approach has been recommended by search and information retrieval technology providers in the past, in part, based on the state of technology available at the time; however, the technology now exists to perform an inclusive, content-based, investigative eDiscovery across a large document collection without the introduction of semi- arbitrary exclusion factors. In this paper, we investigate whether limited document retrieval based on custodian email mailboxes results in lower recall and produces fewer responsive documents than a broader, inclusive search process that covers all potential custodians. In order to compare the two approaches, we designed an experiment with two independent teams conducting electronic discovery using the different approaches. We found that searching across the entire data set resulted in finding significantly more responsive documents and more initial custodians than implementing an approach that relies on custodian-based culling. Specifically, investigative eDiscovery found 516% more relevant documents and 1825% more initial custodians in our study. Based on these results, we believe organizations that employ an exclusionary, culling-based methodology may require subsequent collections, risk under production and sanctions during litigation, and will ultimately expend more resources in responding to eDiscovery production requests with a less comprehensive result.

Bibtex
@inproceedings{DBLP:conf/trec/WangCEA09,
    author = {John Wang and Cameron Coles and Rob Elliot and Sofia Andrianakou},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ZL} Technologies at {TREC} 2009 Legal Interactive: Comparing Exclusionary and Investigative Approaches for Electronic Discovery Using the {TREC} Enron Corpus},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/zlti.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangCEA09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jianqiang Wang, Ying Sun, Paul Thompson

Abstract

For the TREC 2009, the team from University at Buffalo, the State University of New York participated in the Legal E-Discovery track, working on the interactive search task. We explored indexing and searching at both the record level and the document level with the Enron email collection. We studied the usefulness of fielded search and document presentation features such as clustering documents based on email threads. For query formulation for the selected search topic, we combined a precision-oriented Specific Query method that a recall-oriented Generic Query method. Future evaluation of the effectiveness of these query techniques is still needed.

Bibtex
@inproceedings{DBLP:conf/trec/WangST09,
    author = {Jianqiang Wang and Ying Sun and Paul Thompson},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} 2009 at the University of Buffalo: Interactive Legal E-Discovery With Enron Emails},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ubuffalo.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangST09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

A Model for Understanding Collaborative Information Behavior in E-Discovery

Zhen Yue, Daqing He

Abstract

The University of Pittsburgh team participated in the interactive task of Legal Track in TREC 2009. We designed an experiment to investigate into the collaborative information behavior (CIB) of the group of people working on e-discovery tasks provided by Legal Track in TREC 2009. Through the studies, we proposed a model for understanding CIB in e-discovery.

Bibtex
@inproceedings{DBLP:conf/trec/YueH09,
    author = {Zhen Yue and Daqing He},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A Model for Understanding Collaborative Information Behavior in E-Discovery},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/pitt\_sis.legal.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/YueH09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Web

Overview of the TREC 2009 Web Track

Charles L. A. Clarke, Nick Craswell, Ian Soboroff

Abstract

The TREC Web Track explores and evaluates Web retrieval technologies. Currently, the Web Track conducts experiments using the new billion-page ClueWeb09 collection1. The TREC 2009 track is the successor to the Terabyte Retrieval Track, which ran from 2004 to 2006, and to the older Web Track, which ran from 1999 to 2003. The TREC 2009 Web Track includes both a traditional adhoc retrieval task and a new diversity task. The goal of this diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. For example, given the query “windows”, a system might return the Windows update page first, followed by the Microsoft home page, and then a news article discussing the release of Windows 7. Mixed in these results might be pages providing product information on doors and windows for homes and businesses. The track used the new ClueWeb09 dataset as its document collection. The full collection consists of roughly 1 billion web pages, comprising approximately 25TB of uncompressed data (5TB compressed) in multiple languages. The dataset was crawled from the Web during January and February 2009. For groups who were unable to work with this full “Category A” dataset, the track accepted runs over the smaller ClueWeb09 “Category B” dataset, a subset of about 50 million English-language pages. Topics for the track were created from the logs of a commercial search engine, with the aid of tools developed at Microsoft Research. Given a target query, these tools extracted and analyzed groups of related queries, using co-clicks and other information, to identify clusters of queries that highlight different aspects and interpretations of the target query. These clusters were employed by NIST for topic development. Each resulting topic is structured as a representative set of subtopics, each related to a different user need. Documents were judged with respect to the subtopics, as well as with respect to the topic as a whole. For each subtopic, NIST assessors made a binary judgment as to whether or not the document satisfies the information need associated with the subtopic. These topics were used for both the adhoc task and the diversity task. For both tasks, partici- pants executed the original target queries over the ClueWeb09 collection. The tasks differ primarily in their evaluation measures. The adhoc task uses an estimate of mean average precision, based on overall topical relevance [3]. The diversity task uses newer measures, based on the subtopics, which explicitly consider novelty in the result list (intent aware precision [1] and alpha-nDCG [4]) A total of 26 groups submitted runs to the track, with many groups participating in both tasks. Table 1 summarizes the participation of these groups. About half the groups worked with the full collection. A few groups submitted runs over both the full (Category A) collection and the Category B collection. This report provides an overview of the track, including topic development, evaluation measures, and results.

Bibtex
@inproceedings{DBLP:conf/trec/ClarkeCS09,
    author = {Charles L. A. Clarke and Nick Craswell and Ian Soboroff},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Web Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/WEB09.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ClarkeCS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

ICTNET at Web Track 2009 Diversity Track

Wenjing Bi, Xiaoming Yu, Yue Liu, Feng Guan, Zeying Peng, Hongbo Xu, Xueqi Cheng

Abstract

We (ICTNET team) participated in Web Track of TREC2009, and in this paper, we summarize our work on Diversity task of Web Track, which is new in this year. The goal of the diversity task is to return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the result list. For this task, we cluster the results of ad hoc task, and rerank the results depend on subtopics docs covers. Besides, we introduce two methods which tried to find the implicit subtopic by using the docs returned from commerce search engine.

Bibtex
@inproceedings{DBLP:conf/trec/BiYLGPXC09,
    author = {Wenjing Bi and Xiaoming Yu and Yue Liu and Feng Guan and Zeying Peng and Hongbo Xu and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ICTNET} at Web Track 2009 Diversity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ictnet.WEB-DIV.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BiYLGPXC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Ad Hoc and Diversity Retrieval at the University of Delaware

Praveen Chandar, Aparna Kailasam, Divya Muppaneni, Sree Lekha Thota, Ben Carterette

Abstract

This is the report on the University of Delaware Information Retrieval Lab's participation in the TREC 2009 Web and Million Query tracks. Our report on the Relevance Feedback track is in a separate document [3].

Bibtex
@inproceedings{DBLP:conf/trec/ChandarKMTC09,
    author = {Praveen Chandar and Aparna Kailasam and Divya Muppaneni and Sree Lekha Thota and Ben Carterette},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Ad Hoc and Diversity Retrieval at the University of Delaware},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-ben.WEB.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ChandarKMTC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Gordon V. Cormack, Mona Mojdeh

Abstract

For the TREC 2009, we exhaustively classified every document in each corpus, using machine learning methods that had previously been shown to work well for email spam [9, 3]. We treated each document as a sequence of bytes, with no tokenization or parsing of tags or meta-information. This approach was used exclusively for the adhoc web, diversity and relevance feedback tasks, as well as to the batch legal task: the ClueWeb09 and Tobacco collections were processed end-to-end and never indexed. We did the interactive legal task in two phases: first, we used interactive search and judging to find a large and diverse set of training examples; then we used active learning process, similar to what we used for the other tasks, to find find more relevant documents. Finally, we fitted a censored (i.e. truncated) mixed normal distribution to estimate recall and the cutoff to optimize F1, the principal effectiveness measure.

Bibtex
@inproceedings{DBLP:conf/trec/CormackM09,
    author = {Gordon V. Cormack and Mona Mojdeh},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Machine Learning for Information Retrieval: {TREC} 2009 Web, Relevance Feedback and Legal Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.WEB.RF.LEGAL.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CormackM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Microsoft Research at TREC 2009: Web and Relevance Feedback Track

Nick Craswell, Dennis Fetterly, Marc Najork, Stephen Robertson, Emine Yilmaz

Abstract

We took part in the Web and Relevance Feedback tracks, using the ClueWeb09 corpus. To process the corpus, we developed a parallel processing pipeline which avoids the generation of an inverted file. We describe the components of the parallel architecture and the pipeline and how we ran the TREC experiments, and we present effectiveness results.

Bibtex
@inproceedings{DBLP:conf/trec/CraswellFNRY09,
    author = {Nick Craswell and Dennis Fetterly and Marc Najork and Stephen Robertson and Emine Yilmaz},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Microsoft Research at {TREC} 2009: Web and Relevance Feedback Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/microsoft.WEB.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CraswellFNRY09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IRRA at TREC 2009: Index Term Weighting Based on Divergence From Independence Model

Bekir Taner Dinçer, Ilker Kocabas, Bahar Karaoglan

Abstract

IRRA (IR-Ra) group participated in the 2009 Web track (both adhoc task and diversity task) and the Million Query track. In this year, the major concern is to examine the effectiveness of a novel, nonparametric index term weighting model, divergence from independence (DFI). The notion of independence, which is the notion behind the well-known statistical exploratory data analysis technique called the correspondence analysis (Greenacre, 1984; Jambu, 1991), can be adapted to the index term weighting problem. In this respect, it can be thought of as a qualitative description of the importance of terms for documents, in which they appear, importance in the sense of contribution to the information contents of documents relative to other terms. According to the independence notion, if the ratios of the frequencies of two different terms are the same across documents, they are independent from documents. For example, each Web page contains a pair of “html” and a pair of “body” tags, so that the ratio of frequencies of these tags is the same across all Web pages, indicating that the “html” and “body” tags are independent from Web pages. They are used by design, irrespective of the information contents of Web pages. On the other hand, some tags, such as “image”, “table”, which are also independent from Web pages, may occur less or more in some pages than the expected frequencies suggested by the independence model; so, their associated frequency ratios may not be the same for all Web pages. However, it is reasonable to expect that, if the pages are not about the tags' usage, such as a “HTML Handbook”, frequencies of those tags should not be significantly different from their expected frequencies: they should be close to the expectation, i.e., in a parametric point of view, their observed frequencies on individual documents should be attributed to chance fluctuation. Although this tag example is helpful in exemplifying the use of independence notion, it is obvious that the tags are artificial, and so, governed by some rules completely different from the rules of a spoken language. Nonetheless, some words, like the ones in a common “stopwords list”, appear in documents, not because of their contribution to the information contents of documents, but because of the grammatical rules. On this account, such words can be modeled as if they were tags, because they are independent from documents in the same manner. Their observed frequencies in individual documents is expected to fluctuate around their frequencies expected under independence, as in the case of tags. Content bearing words are, therefore, the words whose frequencies highly diverge from the frequencies expected under independence. The results of the TREC experiments about IRRA runs show that the independence notion promises a natural basis for quantifying the categorical relationships between the terms and the documents. The TERRIER retrieval platform (Ounis et al., 2007) is used to index and search the ClueWeb09-T09B1 data set, a subset of about 50 million Web pages in English (TREC 2009 “Category B” data set). During indexing and searching, terms are stemmed and a particular set of stop words2 are eliminated.

Bibtex
@inproceedings{DBLP:conf/trec/DincerKK09,
    author = {Bekir Taner Din{\c{c}}er and Ilker Kocabas and Bahar Karaoglan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IRRA} at {TREC} 2009: Index Term Weighting Based on Divergence From Independence Model},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/muglau.WEB.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DincerKK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Microsoft Research Asia at the Web Track of TREC 2009

Zhicheng Dou, Kun Chen, Ruihua Song, Yunxiao Ma, Shuming Shi, Ji-Rong Wen

Abstract

In TREC 2009, we participate in the Web track, and focus on the diversity task. We propose to diversify web search results by first mining subtopics, and then rank results based on mined subtopics. We propose a model to diversify search results by considering both relevance of documents and richness of mined subtopics. Our experimental results show that the model improves diversity of search results in terms of α-NDCG, and combining subtopics from multiple data sources helps further improve result diversity.

Bibtex
@inproceedings{DBLP:conf/trec/DouCSMSW09,
    author = {Zhicheng Dou and Kun Chen and Ruihua Song and Yunxiao Ma and Shuming Shi and Ji{-}Rong Wen},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Microsoft Research Asia at the Web Track of {TREC} 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/microsoft-asia.WEB.pdf},
    timestamp = {Tue, 01 Dec 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DouCSMSW09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

RMIT University at TREC 2009: Web Track

Steven Garcia

Abstract

RMIT participated in the 2009 Web Track tasks. Our submissions utilised the Zettair search engine1 to index and search the Category B subset of the ClueWeb collection used by the Web Track. The Web Track was composed of two tasks, a traditional adhoc retrieval task, and a new diversity task where participants attempted to retrieve documents covering a range of sub topics for each query. Sub topics were not provided with the queries. Our experiments utilised the well known measures Okapi BM25 and language modeling with Dirichlet smoothing for the adhoc task. For the diversity task we attempted to improve the diversity of query results by minimising the number of documents returned for a single domain.

Bibtex
@inproceedings{DBLP:conf/trec/Garcia09,
    author = {Steven Garcia},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{RMIT} University at {TREC} 2009: Web Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/rmit.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Garcia09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

ICTNET at Web Track 2009 Ad-hoc Task

Feng Guan, Xiaoming Yu, Zeying Peng, Hongbo Xu, Yue Liu, Linhai Song, Xueqi Cheng

Abstract

This paper is about the work done for ad-hoc task of TREC 2009 Web Track. We introduce three methods for this task, including two improved BM25 models and query expansion. The results of these models indicate that both minimum window and query expansion could improve BM25 model.

Bibtex
@inproceedings{DBLP:conf/trec/GuanYPXLSC09,
    author = {Feng Guan and Xiaoming Yu and Zeying Peng and Hongbo Xu and Yue Liu and Linhai Song and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ICTNET} at Web Track 2009 Ad-hoc Task},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ictnet.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GuanYPXLSC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Twente @ TREC 2009: Indexing Half a Million Web Pages

Claudia Hauff, Djoerd Hiemstra

Abstract

The University of Twente participated in three tasks of TREC 2009: the adhoc task, the diversity task and the relevance feedback task. All experiments are performed on the English part of ClueWeb09. We describe our approach to tuning our retrieval system in absence of training data in Section 3. We describe the use of categories and a query log for diversifying search results in Section 4. Section 5 describes preliminary results for the relevance feedback task.

Bibtex
@inproceedings{DBLP:conf/trec/HauffH09,
    author = {Claudia Hauff and Djoerd Hiemstra},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Twente @ {TREC} 2009: Indexing Half a Million Web Pages},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/utwente.WEB.RF.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HauffH09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Heuristic Ranking and Diversification of Web Documents

Jiyin He, Krisztian Balog, Katja Hofmann, Edgar Meij, Maarten de Rijke, Manos Tsagkias, Wouter Weerkamp

Abstract

We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.

Bibtex
@inproceedings{DBLP:conf/trec/HeBHMRTW09,
    author = {Jiyin He and Krisztian Balog and Katja Hofmann and Edgar Meij and Maarten de Rijke and Manos Tsagkias and Wouter Weerkamp},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Heuristic Ranking and Diversification of Web Documents},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-derijke.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/HeBHMRTW09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Rianne Kaptein, Marijn Koolen, Jaap Kamps

Abstract

In this paper, we document our efforts in participating to the TREC 2009 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track's Adhoc task we experiment with document text and anchor text representation, and the use of the link structure. For the Web Track's Diversity task we experiment with using a top down sliding window that, given the top ranked documents, chooses as the next ranked document the one that has the most unique terms or links. We test our sliding window method on a standard document text index and an index of propagated anchor texts. We also experiment with extreme query expansions by taking the top n results of the initial ranking as multi-faceted aspects of the topic to construct n relevance models to obtain n sets of results. A final diverse set of results is obtained by merging the n results lists. For the Entity Ranking Track, we also explore the effectiveness of the anchor text representation, look at the co-citation graph, and experiment with using Wikipedia as a pivot. Our main findings can be summarized as follows: Anchor text is very effective for diversity. It gives high early precision and the results cover more relevant sub-topics than the document text index. Our baseline runs have low diversity, which limits the possible impact of the sliding window approach. New link information seems more effective for diversifying text-based search results than the amount of unique terms added by a document. In the entity ranking task, anchor text finds few primary pages , but it does retrieve a large number of relevant pages. Using Wikipedia as a pivot results in large gains of P10 and NDCG when only primary pages are considered. Although the links between the Wikipedia entities and pages in the Clueweb collection are sparse, the precision of the existing links is very high.

Bibtex
@inproceedings{DBLP:conf/trec/KapteinKK09,
    author = {Rianne Kaptein and Marijn Koolen and Jaap Kamps},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Result Diversity and Entity Ranking Experiments: Anchors, Links, Text and Wikipedia},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-kamps.ENT.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KapteinKK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Zhichao Li, Fei Chen, Qianli Xing, Junwei Miao, Yufei Xue, Tong Zhu, Bo Zhou, Rongwei Cen, Yiqun Liu, Min Zhang, Yijiang Jin, Shaoping Ma

Abstract

This is the 8th year that IR group of Tsinghua University (THUIR) participates in TREC. This year we focus on Web track, which contains two tasks, namely ad hoc and diversity. On ad hoc task, we improved the efficiency of our distributed retrieval system TMiner to handle terabytes of Web data. Then three studies have been done, namely page quality estimation, ranking feature analysis, and model comparison. On diversity task, we proposed several new approaches on searching strategy, user intention detection, and duplication elimination. To mine user‟s intention, we proposed and compared two different strategies, namely „searching + content-based diversity‟ which is a kind of result clustering, and „user based diverse intention prediction + searching‟ which is in the branch of query expansion.

Bibtex
@inproceedings{DBLP:conf/trec/LiCXMXZZCLZJM09,
    author = {Zhichao Li and Fei Chen and Qianli Xing and Junwei Miao and Yufei Xue and Tong Zhu and Bo Zhou and Rongwei Cen and Yiqun Liu and Min Zhang and Yijiang Jin and Shaoping Ma},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{THUIR} at {TREC} 2009 Web Track: Finding Relevant and Diverse Results for Large Scale Web Search},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/tsinghuau.WEB.pdf},
    timestamp = {Wed, 16 Sep 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LiCXMXZZCLZJM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

UCD SIFT in the TREC 2009 Web Track

David Lillis, Fergus Toolan, Ling Lin, Rem W. Collier, John Dunnion

Abstract

The SIFT (SIFT Information Fusion Techniques) group in UCD is dedicated to researching Data Fusion in Information Retrieval. This area of research involves the merging of multiple sets of results into a single result set that is presented to the user. As a means of evaluating the effectiveness of this work, the group entered Category B of the TREC 2009 Web Track. This paper discusses the strategies and experiments employed by the UCD SIFT group in entering the TREC Web Track 2009. This involved the use of freely-available Information Retrieval tools to provide inputs to the data fusion process, with the aim of contrasting with more sophisticated systems.

Bibtex
@inproceedings{DBLP:conf/trec/LillisTLCD09,
    author = {David Lillis and Fergus Toolan and Ling Lin and Rem W. Collier and John Dunnion},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{UCD} {SIFT} in the {TREC} 2009 Web Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ucollege-dublin.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LillisTLCD09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jimmy Lin, Tamer Elsayed, Lidan Wang, Donald Metzler

Abstract

This paper describes Ivory, an attempt to build a distributed retrieval system around the open-source Hadoop implementation of MapReduce. We focus on three noteworthy aspects of our work: a retrieval architecture built directly on the Hadoop Distributed File System (HDFS), a scalable MapReduce algorithm for inverted indexing, and webpage classification to enhance retrieval effectiveness.

Bibtex
@inproceedings{DBLP:conf/trec/LinEWM09,
    author = {Jimmy Lin and Tamer Elsayed and Lidan Wang and Donald Metzler},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/umd-yahoo.WEB.pdf},
    timestamp = {Fri, 27 Aug 2021 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LinEWM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2009: Experiments with Terrier

Richard McCreadie, Craig Macdonald, Iadh Ounis, Jie Peng, Rodrygo L. T. Santos

Abstract

In TREC 2009, we extend our Voting Model for the faceted blog distillation, top stories identification, and related entity finding tasks. Moreover, we experiment with our novel xQuAD framework for search result diversification. Besides fostering our research in mul- tiple directions, by participating in such a wide portfolio of tracks, we further develop the indexing and retrieval capabilities of our Terrier Information Retrieval platform, to effectively and efficiently cope with a new generation of large-scale test collections.

Bibtex
@inproceedings{DBLP:conf/trec/McCreadieMOPS09,
    author = {Richard McCreadie and Craig Macdonald and Iadh Ounis and Jie Peng and Rodrygo L. T. Santos},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2009: Experiments with Terrier},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/McCreadieMOPS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Lucene for n-grams using the CLUEWeb Collection

Gregory B. Newby, Christopher T. Fallen, Kylie McCormick

Abstract

The ARSC team made modifications to the Apache Lucene engine to accommodate “go words,” taken from the Google Gigaword vocabulary of n-grams. Indexing the Category “B” subset of the ClueWeb collection was accomplished by a divide and conquer method, working across the separate ClueWeb subsets for 1, 2 and 3-grams.

Bibtex
@inproceedings{DBLP:conf/trec/NewbyFM09,
    author = {Gregory B. Newby and Christopher T. Fallen and Kylie McCormick},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Lucene for n-grams using the CLUEWeb Collection},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/arsc.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/NewbyFM09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Northeastern University in the TREC 2009 Web Track

Shahzad Rajput, Evangelos Kanoulas, Virgiliu Pavlu, Javed A. Aslam

Abstract

In a typical retrieval scenario a user poses a query to a retrieval system in order to satisfy an information need generated during some task the user is undertaking. Retrieval systems access an underline collection of searchable material and rank them according to some definition of relevance of the material to the users request and returns this ranked list to the user. In the case of web search where typical users express their information needs by 2-3 keywords submitted queries often time have ambiguous meanings, representing more than one information need. Given a query, a good retrieval system should be able to satisfy all possible users by ranking documents in a way that their content covers as many information needs as possible. The primary goal of our Web Track submission is to explore whether named entity tags can be utilized to diversify the returned ranked list of documents. Our hypothesis is that each information need could be represented by a certain named entity tag (or certain combination of them). For instance, in Table 1 one can see the example query taken from the Web Track web page. The query is “physical therapists”. The subtopics that correspond to this query are listed in the left column of the table. To illustrate our hypothesis, next to each subtopic, in bold, we have manually identified a possible combination of entity tags that could represent each subtopic/information need. Further, each document relevant to the original query could also be represented by a set of named entity tags. Instead of attempting to diversify documents based on the distance of their language models over text, we explored whether it is possible to diversify them according to the distance of their language model over entity tags. Entity tags could allow a further abstraction of documents avoiding issues like language mismatch. Our methodology highly depended on two assumptions: (1) retrieval methods based on a bug-of-words representation can retrieve many relevant documents in the top 2,000 positions, and (2) the relevant documents would be diverse enough at the first place. Then using our methodology we could abstract the representation of those documents and diversify the list based on their tag distributions. A second goal of our Web Track submission was to develop a simple spam filter. By analyzing a small subset of the documents, selected at random from the top 2,000 documents ranked by Indri language model per query over the new ClueWeb09 collection (category B) 1, we observed that 44.5% of them were spam. A large subset of the spam documents were those that contained query terms way too many times. For this purpose, we decided to develop a simple spam filter to remove these documents from the ranked list.

Bibtex
@inproceedings{DBLP:conf/trec/RajputKPA09,
    author = {Shahzad Rajput and Evangelos Kanoulas and Virgiliu Pavlu and Javed A. Aslam},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Northeastern University in the {TREC} 2009 Web Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/northeasternu.WEB.pdf},
    timestamp = {Wed, 07 Dec 2022 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/RajputKPA09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

PARADISE Based Search Engine at TREC 2009 Web Track

Dongdong Shan, Dongsheng Zhao, Jing He, Hongfei Yan

Abstract

In this paper, we introduce the PARADISE search engine in TREC09 Web track. PARADISE is the abbreviation for Platform for Applying, Research and Developing Intelligent Search Engine, which is a search engine platform developed by SEWM group, Peking University. The system is designed to support both English and Chinese information retrieval. This system preprocessed and indexed the five hundred million web pages for this year's Web Track. In the preprocessing stage, the templates were removed, the encoding were identified and unified, and the anchor texts and InLink information are extracted with the mapreduce framework (using Hadoop in this system). In retrieval, our runs used an extension of BM25. This model distinguishes terms from different fields and integrated both term counts and position information. Furthermore, some web based features are also considered.

Bibtex
@inproceedings{DBLP:conf/trec/ShanZHY09,
    author = {Dongdong Shan and Dongsheng Zhao and Jing He and Hongfei Yan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{PARADISE} Based Search Engine at {TREC} 2009 Web Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/pekingu.WEB.pdf},
    timestamp = {Mon, 15 May 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ShanZHY09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Experiments with ClueWeb09: Relevance Feedback and Web Tracks

Mark D. Smucker, Charles L. A. Clarke, Gordon V. Cormack

Abstract

In this paper, we report on our TREC experiments with the ClueWeb09 document collection. We participated in the relevance feedback and web tracks. While our phase 1 relevance feedback run's performance was good, our other relevance feedback and web track submissions' performances were lacking. We suspect this performance difference is caused by the Category B document subset of the ClueWeb09 collection having a higher prior probability of relevance than the rest of the collection. Future work will involve a more detailed error analysis of our experiments.

Bibtex
@inproceedings{DBLP:conf/trec/SmuckerCC09,
    author = {Mark D. Smucker and Charles L. A. Clarke and Gordon V. Cormack},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments with ClueWeb09: Relevance Feedback and Web Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uwaterloo-cormack.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SmuckerCC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Dawei Yin, Zhenzhen Xue, Xiaoguang Qi, Brian D. Davison

Abstract

This paper describes the method we use in the diversity task of web track in TREC 2009. The problem we aim to solve is the diversification of search results for ambiguous web queries. We present a model based on knowledge of the diversity of query subtopics to generate a diversified ranking for retrieved documents. We expand the original query into several related queries, assuming that query expansions expose subtopics of the original query. Moreover, each query expansion is given a weight which reflects the likelihood of the interpretation (the fraction of users who issued this query given the general query topic). We issue all those expanded queries including the original query to a standard BM25 search engine, then re-rank the retrieved documents to generate the final ranking. Our method can detect possible subtopics of a given query and provide a reasonable ranking that satisfies both relevancy and diversity metrics. The TREC evaluations show our method is effective on the diversity task.

Bibtex
@inproceedings{DBLP:conf/trec/YinXQD09,
    author = {Dawei Yin and Zhenzhen Xue and Xiaoguang Qi and Brian D. Davison},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Diversifying Search Results with Popular Subtopics},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/lehighu.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/YinXQD09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Axiomatic Approaches to Information Retrieval–University of Delaware at TREC 2009 Million Query and Web Tracks

Wei Zheng, Hui Fang

Abstract

We report our experiments in TREC 2009 Million Query track and Adhoc task of Web track. Our goal is to evaluate the effectiveness of axiomatic retrieval models on the large data collection. Axiomatic approaches to information retrieval have been recently proposed and studied. The basic idea is to search for retrieval functions that can satisfy all the reasonable retrieval constraints. Previous studies showed that the derived basic axiomatic retrieval functions are less sensitive to the parameters than the other state of the art retrieval functions with comparable optimal performance. In this paper, we focus on evaluating the effectiveness of the basic axiomatic retrieval functions as well as the semantic term matching based query expansion strategy. Experiment results of the two tracks demonstrate the effectiveness of the axiomatic retrieval models.

Bibtex
@inproceedings{DBLP:conf/trec/ZhengF09,
    author = {Wei Zheng and Hui Fang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Axiomatic Approaches to Information Retrieval--University of Delaware at {TREC} 2009 Million Query and Web Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-fang.MQ.WEB.pdf},
    timestamp = {Tue, 20 Oct 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhengF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Million Query

Million Query Track 2009 Overview

Ben Carterette, Virgiliu Pavlu, Hui Fang, Evangelos Kanoulas

Abstract

The Million Query Track ran for the third time in 2009. The track is designed to serve two purposes: first, it is an exploration of ad hoc retrieval over a large set of queries and a large collection of documents; second, it investigates questions of system evaluation, in particular whether it is better to evaluate using many queries judged shallowly or fewer queries judged thoroughly. Fundamentally, the Million Query tracks (2007-2009) are ad-hoc tasks, only using complex but very efficient evaluation methodologies that allow human assessment effort to be spread on up to 20 times more queries than previous ad-hoc tasks. We estimate metrics like Average Precision fairly well and produce system ranking that (with high confidence) match the true ranking that would be obtained with complete judgments. We can answer budget related questions like how many queries versus how many assessments per query give an optimal strategy; a variance analysis is possible due to the large number of queries involved. While we have confidence we can evaluate participating runs well, an important question is whether the assessments produced by the evaluation process can be reused (together with the collection and the topics) for a new search strategy—that is, one that did not participate in the assessment done by NIST. To answer this, we designed a reusability study which concludes that a variant of participating track systems may be evaluated with reasonably high confidence using the MQ data, while a complete new system cannot. The 2009 track quadrupled the number of queries of previous years from 10,000 to 40,000. In addition, this year saw the introduction of a number of new threads to the basic framework established in the 2007 and 2008 tracks: Queries were classified by the task they represented as well as by their apparent difficulty; Participating sites could choose to do increasing numbers of queries, depending on time and resources available to them; We designed and implemented a novel in situ reusability study. Section 1 describes the tasks for participants. Section 2 provides an overview of the test collection that will result from the track. Section 3 briefly describes the document selection and evaluation methods. Section 4 summarizes the submitted runs. In Section 5 we summarize evaluation results from the task, and Section 6 provides deeper analysis into the results. For TREC 2009, Million Query, Relevance Feedback, and Web track ad-hoc task judging was conducted simultaneously using MQ track methods. A number of compromises had to be made to accomplish this; a note about the usability of the resulting data is included in Section 3.3

Bibtex
@inproceedings{DBLP:conf/trec/CarterettePFK09,
    author = {Ben Carterette and Virgiliu Pavlu and Hui Fang and Evangelos Kanoulas},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Million Query Track 2009 Overview},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/MQ09OVERVIEW.pdf},
    timestamp = {Wed, 07 Dec 2022 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/CarterettePFK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Ad Hoc and Diversity Retrieval at the University of Delaware

Praveen Chandar, Aparna Kailasam, Divya Muppaneni, Sree Lekha Thota, Ben Carterette

Abstract

This is the report on the University of Delaware Information Retrieval Lab's participation in the TREC 2009 Web and Million Query tracks. Our report on the Relevance Feedback track is in a separate document [3].

Bibtex
@inproceedings{DBLP:conf/trec/ChandarKMTC09,
    author = {Praveen Chandar and Aparna Kailasam and Divya Muppaneni and Sree Lekha Thota and Ben Carterette},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Ad Hoc and Diversity Retrieval at the University of Delaware},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-ben.WEB.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ChandarKMTC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IRRA at TREC 2009: Index Term Weighting Based on Divergence From Independence Model

Bekir Taner Dinçer, Ilker Kocabas, Bahar Karaoglan

Abstract

IRRA (IR-Ra) group participated in the 2009 Web track (both adhoc task and diversity task) and the Million Query track. In this year, the major concern is to examine the effectiveness of a novel, nonparametric index term weighting model, divergence from independence (DFI). The notion of independence, which is the notion behind the well-known statistical exploratory data analysis technique called the correspondence analysis (Greenacre, 1984; Jambu, 1991), can be adapted to the index term weighting problem. In this respect, it can be thought of as a qualitative description of the importance of terms for documents, in which they appear, importance in the sense of contribution to the information contents of documents relative to other terms. According to the independence notion, if the ratios of the frequencies of two different terms are the same across documents, they are independent from documents. For example, each Web page contains a pair of “html” and a pair of “body” tags, so that the ratio of frequencies of these tags is the same across all Web pages, indicating that the “html” and “body” tags are independent from Web pages. They are used by design, irrespective of the information contents of Web pages. On the other hand, some tags, such as “image”, “table”, which are also independent from Web pages, may occur less or more in some pages than the expected frequencies suggested by the independence model; so, their associated frequency ratios may not be the same for all Web pages. However, it is reasonable to expect that, if the pages are not about the tags' usage, such as a “HTML Handbook”, frequencies of those tags should not be significantly different from their expected frequencies: they should be close to the expectation, i.e., in a parametric point of view, their observed frequencies on individual documents should be attributed to chance fluctuation. Although this tag example is helpful in exemplifying the use of independence notion, it is obvious that the tags are artificial, and so, governed by some rules completely different from the rules of a spoken language. Nonetheless, some words, like the ones in a common “stopwords list”, appear in documents, not because of their contribution to the information contents of documents, but because of the grammatical rules. On this account, such words can be modeled as if they were tags, because they are independent from documents in the same manner. Their observed frequencies in individual documents is expected to fluctuate around their frequencies expected under independence, as in the case of tags. Content bearing words are, therefore, the words whose frequencies highly diverge from the frequencies expected under independence. The results of the TREC experiments about IRRA runs show that the independence notion promises a natural basis for quantifying the categorical relationships between the terms and the documents. The TERRIER retrieval platform (Ounis et al., 2007) is used to index and search the ClueWeb09-T09B1 data set, a subset of about 50 million Web pages in English (TREC 2009 “Category B” data set). During indexing and searching, terms are stemmed and a particular set of stop words2 are eliminated.

Bibtex
@inproceedings{DBLP:conf/trec/DincerKK09,
    author = {Bekir Taner Din{\c{c}}er and Ilker Kocabas and Bahar Karaoglan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IRRA} at {TREC} 2009: Index Term Weighting Based on Divergence From Independence Model},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/muglau.WEB.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DincerKK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Northeastern University in TREC 2009 Million Query Track

Evangelos Kanoulas, Keshi Dai, Virgiliu Pavlu, Stefan Savev, Javed A. Aslam

Bibtex
@inproceedings{DBLP:conf/trec/KanoulasDPSA09,
    author = {Evangelos Kanoulas and Keshi Dai and Virgiliu Pavlu and Stefan Savev and Javed A. Aslam},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Northeastern University in {TREC} 2009 Million Query Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/northeasternu.MQ.pdf},
    timestamp = {Wed, 07 Dec 2022 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KanoulasDPSA09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

A Study of Term Proximity and Document Weighting Normalization in Pseudo Relevance Feedback–UIUC at TREC 2009 Million Query Track

Yuanhua Lv, Jing He, V. G. Vinod Vydiswaran, Kavita Ganesan, ChengXiang Zhai

Abstract

In this paper, we report our experiments in the TREC 2009 Million Query Track. Our first line of study is on proximity-based feedback, in which we propose a positional relevance model (PRM) to exploit term proximity evidence so as to assign more weights to expansion words that are closer to query words in feedback documents. The second line of study is to improve the weighting of feedback documents in the relevance model by using a regression-based method to approximate the probability of relevance (and thus the name RegRM). In the third line of study, we test a supervised approach for query classification. Besides, we also evaluate a selective pseudo feedback strategy which stops pseudo feedback for precision-oriented queries and only uses it for recall-oriented ones. The proposed PRM has shown clear improvements over the relevance model for pseudo feedback, suggesting that capturing the term proximity heuristic appropriately could lead to a better feedback model. RegRM performs as well as relevance model, but no noticeable improvement is observed. Unfortunately, the proposed query classification methods appear to not work well. The results also show that the proposed selective pseudo feedback may not work well, since precision-oriented queries can also benefit from pseudo feedback, though not as much as recall-oriented queries.

Bibtex
@inproceedings{DBLP:conf/trec/LvHVGZ09,
    author = {Yuanhua Lv and Jing He and V. G. Vinod Vydiswaran and Kavita Ganesan and ChengXiang Zhai},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A Study of Term Proximity and Document Weighting Normalization in Pseudo Relevance Feedback--UIUC at {TREC} 2009 Million Query Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uiuc.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LvHVGZ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2009: Experiments with Terrier

Richard McCreadie, Craig Macdonald, Iadh Ounis, Jie Peng, Rodrygo L. T. Santos

Abstract

In TREC 2009, we extend our Voting Model for the faceted blog distillation, top stories identification, and related entity finding tasks. Moreover, we experiment with our novel xQuAD framework for search result diversification. Besides fostering our research in multiple directions, by participating in such a wide portfolio of tracks, we further develop the indexing and retrieval capabilities of our Terrier Information Retrieval platform, to effectively and efficiently cope with a new generation of large-scale test collections.

Bibtex
@inproceedings{DBLP:conf/trec/McCreadieMOPS09,
    author = {Richard McCreadie and Craig Macdonald and Iadh Ounis and Jie Peng and Rodrygo L. T. Santos},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2009: Experiments with Terrier},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/McCreadieMOPS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

IIIT Hyderabad at Million Query Track TREC 2009

Prashant Ullegaddi, Sudip Datta, Vinay Pande, Kushal S. Dave, Vasudeva Varma

Abstract

This was our maiden attempt at Million Query track, TREC 2009. We submitted three runs for ad-hoc retrieval task in Million Query track. We explored ad-hoc retrieval of web pages using Hadoop—a distributed infrastructure. To enhance recall, we expanded the queries using WordNet and also by combining the query with all possible subsets of tokens present in the query. To prevent query drift we experimented on giving selective boosts to different steps of expansion including giving higher boosts to sub-queries containing named entities as opposed to those that did not. In fact, this run achieved highest precision among our other runs. Using simple statistics we identified authoritative domains such as wikipedia.org, answers.com, etc and attempted to boost hits from them, while preventing them from overly biasing the results. An attempt to query classification was also made.

Bibtex
@inproceedings{DBLP:conf/trec/UllegaddiDPDV09,
    author = {Prashant Ullegaddi and Sudip Datta and Vinay Pande and Kushal S. Dave and Vasudeva Varma},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IIIT} Hyderabad at Million Query Track {TREC} 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/iiit-hyderabad.MQ.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/UllegaddiDPDV09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Axiomatic Approaches to Information Retrieval–University of Delaware at TREC 2009 Million Query and Web Tracks

Wei Zheng, Hui Fang

Abstract

We report our experiments in TREC 2009 Million Query track and Adhoc task of Web track. Our goal is to evaluate the effectiveness of axiomatic retrieval models on the large data collection. Axiomatic approaches to information retrieval have been recently proposed and studied. The basic idea is to search for retrieval functions that can satisfy all the reasonable retrieval constraints. Previous studies showed that the derived basic axiomatic retrieval functions are less sensitive to the parameters than the other state of the art retrieval functions with comparable optimal performance. In this paper, we focus on evaluating the effectiveness of the basic axiomatic retrieval functions as well as the semantic term matching based query expansion strategy. Experiment results of the two tracks demonstrate the effectiveness of the axiomatic retrieval models.

Bibtex
@inproceedings{DBLP:conf/trec/ZhengF09,
    author = {Wei Zheng and Hui Fang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Axiomatic Approaches to Information Retrieval--University of Delaware at {TREC} 2009 Million Query and Web Tracks},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-fang.MQ.WEB.pdf},
    timestamp = {Tue, 20 Oct 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhengF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Blog

Overview of the TREC 2009 Blog Track

Craig Macdonald, Iadh Ounis, Ian Soboroff

Abstract

The Blog track explores the information seeking behaviour in the blogosphere. Thus far, since its inception in 2006 [9], the Blog track addressed two main search tasks based on the analysis of a commercial blog search engine: the opinion-finding task (i.e. “What do people think about X?”) and the blog distillation task (i.e. “Find me a blog with a principal, recurring interest in X.”). In TREC 2009, the Blog track has been markedly revamped with the use of a new and larger sample of the blogosphere, called Blogs08, which has a 13-month timespan covering a period ranging from 14th January 2008 to 10th February 2009, and the introduction of two new search tasks, addressing more refined and typical search scenarios on the blogosphere: Faceted blog distillation: A more refined version of the blog distillation task, addressing the quality aspect of the retrieved blogs. Top stories identification: A task that addresses news-related issues on the blogosphere. Most of the efforts of the organisers in the Blog track 2009 have been spent on defining the new search tasks, on building a suitable infrastructure to support the investigation of the introduced search tasks, and on establishing an appropriate methodology to evaluate the effectiveness of the submitted runs. The remainder of this paper is structured as follows. Section 2 describes the newly created Blogs08 collection. Section 3 describes the new faceted blog distillation task, and discusses the main obtained results by the participating groups. Section 4 describes the top stories identification task, and summarises the results of the runs and the main effective approaches deployed by the participating groups. Concluding remarks are provided in Section 5.

Bibtex
@inproceedings{DBLP:conf/trec/MacdonaldOS09,
    author = {Craig Macdonald and Iadh Ounis and Ian Soboroff},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Blog Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/BLOG09.OVERVIEW.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MacdonaldOS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

BRAT: A Random Walk Through the Semantic Spaces of the Blogosphere

Adil El Ghali, Yann Vigile Hoareau

Abstract

Semantic spaces, such as the Latent Semantic Analysis (LSA), Hyperspace Ana- log to Language (HAL) or Random Indexing (RI), offer convenient methods to represent semantic relations between words and concepts, abstracted from a distribution of documents. The distribution of documents determines the local co-occurrence pattern between words all over the corpus and, then, determines the semantic abstracted from the local distribution. Such methods are sensitive to the statistical properties on the distribution of words over documents. For instance, the semantic on the word table abstracted from a scientific corpus or a general corpus may be different. In the first case, since table may occur in the context of table of correlation or table of results, it would be considered to be associated to the word correlation whereas in the second case, because it may co-occur with kitchen or living-room, it would rather be considered as similar to chair. Nevertheless, the formal relation bearing the properties of the distribution of word's co-occurence and the final semantic produced by Semantic space methods have not been described until now. In the case of a mixed “scientific and general” corpus, what makes that the semantic of table became more similar to chair than Speerman and vice-versa? We approached the Top-stories task of the Blog-Track'09 using a system named Blogosphere Random Analysis using Texts (BRAT) composed of two layers. The first layer distributes and represents blogs posts' in different semantic spaces built using Random Indexing. The second layer is an algorithm of retrieval that have the aim of navigate in the semantic space via a ramdom walk. BRAT have been constructed under two main working hypothesis that we considered important for dealing with the semantic of the blogosphere: the notion of semantic identity and the notion of semantic pollution. The article is organized as follows. In a first part, we shortly overview the methods and properties of semantic spaces models. The notions of semantic identity and semantic pollution are described in general together with their practical implication within the Top-stories task. In the second part, the BRAT system is described. The third part gives an overview of the performances of BRAT for the Top-stories task.

Bibtex
@inproceedings{DBLP:conf/trec/GhaliH09,
    author = {Adil El Ghali and Yann Vigile Hoareau},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{BRAT:} {A} Random Walk Through the Semantic Spaces of the Blogosphere},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uparis.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GhaliH09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

BIT at TREC 2009 Faceted Blog Distillation Task

Peng Jiang, Qing Yang, Chunxia Zhang, Zhendong Niu

Abstract

This Paper presents the work done for the TREC 2009 faceted blog distillation task of blog track. In our approach, we use a mixture of language models based on global representation. Our model can be regarded as a combination of topic relevance model and faceted relevance model. By pseudo-relevance feedback method, we can estimate the above two models from topic relevance feedback documents and facet relevance feedback documents respectively. Experimental results on TREC blogs08 collection show the effectiveness of our proposed approach.

Bibtex
@inproceedings{DBLP:conf/trec/JiangYZN09,
    author = {Peng Jiang and Qing Yang and Chunxia Zhang and Zhendong Niu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{BIT} at {TREC} 2009 Faceted Blog Distillation Task},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/bit.BLOG.pdf},
    timestamp = {Fri, 04 Sep 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/JiangYZN09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Lugano at TREC 2009 Blog Track

Mostafa Keikha, Mark James Carman, Robert Gwadera, Shima Gerani, Ilya Markov, Giacomo Inches, Az Azrinudin Alidin, Fabio Crestani

Abstract

We report on the University of Lugano's participation in the Blog track of TREC 2009. In particular we describe our system for performing blog distillation, faceted search and top stories identification.

Bibtex
@inproceedings{DBLP:conf/trec/KeikhaCGGMIAC09,
    author = {Mostafa Keikha and Mark James Carman and Robert Gwadera and Shima Gerani and Ilya Markov and Giacomo Inches and Az Azrinudin Alidin and Fabio Crestani},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Lugano at {TREC} 2009 Blog Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ulugano.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KeikhaCGGMIAC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

POSTECH at TREC 2009 Blog Track: Top Stories Identification

Yeha Lee, Hun-Young Jung, Woosang Song, Jong-Hyeok Lee

Abstract

This paper describes our participation in the TREC 2009 Blog Track. Our system consists of the query likelihood component and the news headline prior component, based on the language model framework. For the query likelihood, we propose several approaches to estimate the query language model and the news headline language model. We also suggest two approaches to choose the 10 supporting relevant posts: Feed-Based Selection and Cluster-Based Selection. Furthermore, we propose two criteria to estimate the news headline prior for a given day. Experimental results show that using the prior significantly improves the performance of the task.

Bibtex
@inproceedings{DBLP:conf/trec/LeeJSL09,
    author = {Yeha Lee and Hun{-}Young Jung and Woosang Song and Jong{-}Hyeok Lee},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{POSTECH} at {TREC} 2009 Blog Track: Top Stories Identification},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/postech-kle.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LeeJSL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Facet Classification of Blogs: Know-Center at the TREC 2009 Blog Distillation Task

Elisabeth Lex, Michael Granitzer, Andreas Juffinger

Abstract

In this paper, we outline our experiments carried out at the TREC 2009 Blog Distillation Task. Our system is based on a plain text index extracted from the XML feeds of the TREC Blogs08 dataset. This index was used to retrieve candidate blogs for the given topics. The resulting blogs were classified using a Support Vector Machine that was trained on a manually labelled subset of the TREC Blogs08 dataset. Our experiments included three runs on different features: firstly on nouns, secondly on stylometric properties, and thirdly on punctuation statistics. The facet identification based on our approach was successful, although a significant number of candidate blogs were not retrieved at all.

Bibtex
@inproceedings{DBLP:conf/trec/LexGJ09,
    author = {Elisabeth Lex and Michael Granitzer and Andreas Juffinger},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Facet Classification of Blogs: Know-Center at the {TREC} 2009 Blog Distillation Task},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/know-center.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LexGJ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

A Study of Faceted Blog Distillation–PRIS at TREC 2009 Blog Track

Si Li, Huiji Gao, Hao Sun, Fei Chen, Oupeng Feng, Sanyuan Gao, Hao Zhang, Xinsheng Li, Caili Tan, Weiran Xu, Guang Chen, Jun Guo

Abstract

This paper describes BUPT (pris) participation in faceted blog distillation task at Blog Track 2009. The system adopts a two-stage strategy in faceted blog distillation task. In the first stage, the system carries out a basic topic relevance retrieval to get the top k blogs for each query. In the second stage, different models are designed to judge the facets and ranking.

Bibtex
@inproceedings{DBLP:conf/trec/LiGSCFGZLTXCG09,
    author = {Si Li and Huiji Gao and Hao Sun and Fei Chen and Oupeng Feng and Sanyuan Gao and Hao Zhang and Xinsheng Li and Caili Tan and Weiran Xu and Guang Chen and Jun Guo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A Study of Faceted Blog Distillation--PRIS at {TREC} 2009 Blog Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/pris.BLOG.pdf},
    timestamp = {Tue, 17 Nov 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LiGSCFGZLTXCG09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2009: Experiments with Terrier

Richard McCreadie, Craig Macdonald, Iadh Ounis, Jie Peng, Rodrygo L. T. Santos

Abstract

In TREC 2009, we extend our Voting Model for the faceted blog distillation, top stories identification, and related entity finding tasks. Moreover, we experiment with our novel xQuAD framework for search result diversification. Besides fostering our research in multiple directions, by participating in such a wide portfolio of tracks, we further develop the indexing and retrieval capabilities of our Terrier Information Retrieval platform, to effectively and efficiently cope with a new generation of large-scale test collections.

Bibtex
@inproceedings{DBLP:conf/trec/McCreadieMOPS09,
    author = {Richard McCreadie and Craig Macdonald and Iadh Ounis and Jie Peng and Rodrygo L. T. Santos},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2009: Experiments with Terrier},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/McCreadieMOPS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC Blog and TREC Chem: A View from the Corn Fields

Yelena Mejova, Viet Ha-Thuc, Steven Foster, Christopher G. Harris, Robert J. Arens, Padmini Srinivasan

Abstract

The University of Iowa Team, participated in the blog track and the chemistry track of TREC-2009. This is our first year participating in the blog track as well as the chemistry track.

Bibtex
@inproceedings{DBLP:conf/trec/MejovaHFHAS09,
    author = {Yelena Mejova and Viet Ha{-}Thuc and Steven Foster and Christopher G. Harris and Robert J. Arens and Padmini Srinivasan},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{TREC} Blog and {TREC} Chem: {A} View from the Corn Fields},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uiowa.BLOG.CHEM.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/MejovaHFHAS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

FEUP at TREC 2009 Blog Track: Temporal Evidence in the Faceted Blog Distillation Task

Sérgio Nunes, Cristina Ribeiro, Gabriel David

Abstract

This paper describes the participation of FEUP, from the University of Porto, in the TREC 2009 Blog Track. FEUP participated in the faceted blog distillation task with work focused on the use of temporal features available in the new TREC Blogs08 collection. The approach presented in this paper uses the temporal information available in most individual posts to amplify (or reduce) each post's score. Blog scores, and subsequent ranks, are obtained by combining individual posts' scores. While preparing the runs, no endeavors were made to identify a priori any temporal differences between the three distinct facets.

Bibtex
@inproceedings{DBLP:conf/trec/NunesRD09,
    author = {S{\'{e}}rgio Nunes and Cristina Ribeiro and Gabriel David},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{FEUP} at {TREC} 2009 Blog Track: Temporal Evidence in the Faceted Blog Distillation Task},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/feup.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/NunesRD09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

From Blogs to News: Identifying Hot Topics in the Blogosphere

Wouter Weerkamp, Manos Tsagkias, Maarten de Rijke

Abstract

We describe the participation of the University of Amsterdam's ILPS group in the blog track at TREC 2009. We focus on the top stories identification task, and take an approach that does not require the headlines of top stories to be known beforehand. We explore the feasibility of a so-called blogs to news approach: given a date and a set of blog posts, identify the main topics for that date. This approach is more general than just finding top stories, but it can still be applied to the task of headline ranking. Results show that this general approach, applied to the task at hand, is among the top performing approaches in this year's TREC.

Bibtex
@inproceedings{DBLP:conf/trec/WeerkampTR09,
    author = {Wouter Weerkamp and Manos Tsagkias and Maarten de Rijke},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {From Blogs to News: Identifying Hot Topics in the Blogosphere},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-derijke.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WeerkampTR09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

ICTNET at Blog Track TREC 2009

Xueke Xu, Yue Liu, Hongbo Xu, Xiaoming Yu, Linhai Song, Feng Guan, Zeying Peng, Xueqi Cheng

Abstract

This paper describes our participation in blog track of TREC2009. All runs are submitted for both two task, namely Top stories identification task and faceted blog distillation task. The “FirteX” platform was used to index and retrieval posts. As for top stories identification task, to identify important headlines, we measure the importance of headline by accumulating the BM25 relevance score with posts on the query day. We propose a graph-based iterative approach and a sub-topic detecting based approach respectively to identify diverse blog posts. As for faceted blog distillation task: we adopt a very straightforward approach and measure the topical relevance by only exploiting top ad-hoc 10000 posts. To identify facet inclination, we either train centroid classifier or compute facet inclination weights of terms to compute facet inclination score and rerank feed by combining relevance score and facet inclination score.

Bibtex
@inproceedings{DBLP:conf/trec/XuLXYSGPC09,
    author = {Xueke Xu and Yue Liu and Hongbo Xu and Xiaoming Yu and Linhai Song and Feng Guan and Zeying Peng and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ICTNET} at Blog Track {TREC} 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ictnet.BLOG.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/XuLXYSGPC09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Entity

Overview of the TREC 2009 Entity Track

Krisztian Balog, Arjen P. de Vries, Pavel Serdyukov, Paul Thomas, Thijs Westerveld

Abstract

The goal of the entity track is to perform entity-oriented search tasks on the World Wide Web. Many user information needs would be better answered by specific entities instead of just any type of documents. The track defines entities as “typed search results,” “things,” represented by their homepages on the web. Searching for entities thus corresponds to ranking these homepages. The track thereby investigates a problem quite similar to the QA list task. In this pilot year, we limited the track's scope to searches for instances of the organizations, people, and product entity types.

Bibtex
@inproceedings{DBLP:conf/trec/BalogVSTW09,
    author = {Krisztian Balog and Arjen P. de Vries and Pavel Serdyukov and Paul Thomas and Thijs Westerveld},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2009 Entity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ENT09.OVERVIEW.pdf},
    timestamp = {Wed, 07 Jul 2021 16:44:22 +0200},
    biburl = {https://dblp.org/rec/conf/trec/BalogVSTW09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Marc Bron, Krisztian Balog, Maarten de Rijke

Abstract

We report on experiments for the Related Entity Finding task in which we focus on only using Wikipedia as a target corpus in which to identify (related) entitities. Our approach is based on co-occurrences between the source entity and potential target entities. We observe improvements in performance when a context-independent co-occurrence model is combined with context-dependent co-occurrence models in which we stress the importance of the expected relation between source and target entity. Applying type filtering yields further improvements results.

Bibtex
@inproceedings{DBLP:conf/trec/BronBR09,
    author = {Marc Bron and Krisztian Balog and Maarten de Rijke},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Related Entity Finding Based on Co-Occurance},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-derijke.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BronBR09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Entity Retrieval with Hierarchical Relevance Model, Exploiting the Structure of Tables and Learning Homepage Classifiers

Yi Fang, Luo Si, Zhengtao Yu, Yantuan Xian, Yangbo Xu

Abstract

This paper gives an overview of our work done for the TREC 2009 Entity track. We propose a hierarchical relevance retrieval model for entity ranking. In this model, three levels of relevance are examined which are document, passage and entity, respectively. The final ranking score is a linear combination of the relevance scores from the three levels. Furthermore, we exploit the structure of tables and lists to identify the target entities from them by making a joint decision on all the entities with the same attribute. To find entity homepages, we train logistic regression models for each type of entities. A set of templates and filtering rules are also used to identify target entities. The key lessons that we learned by participating this year's Entity track include: 1) our special treatment of table and list data is well rewarding; 2) The high accuracy of homepage finding is crucial in this track; 3) Wikipedia can serve as a valuable knowledge resource for different aspects of the related entity finding task.

Bibtex
@inproceedings{DBLP:conf/trec/FangSYXX09,
    author = {Yi Fang and Luo Si and Zhengtao Yu and Yantuan Xian and Yangbo Xu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Entity Retrieval with Hierarchical Relevance Model, Exploiting the Structure of Tables and Learning Homepage Classifiers},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/purdue.ENT.pdf},
    timestamp = {Tue, 17 May 2022 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/FangSYXX09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Rianne Kaptein, Marijn Koolen, Jaap Kamps

Abstract

In this paper, we document our efforts in participating to the TREC 2009 Entity Ranking and Web Tracks. We had multiple aims: For the Web Track's Adhoc task we experiment with document text and anchor text representation, and the use of the link structure. For the Web Track's Diversity task we experiment with using a top down sliding window that, given the top ranked documents, chooses as the next ranked document the one that has the most unique terms or links. We test our sliding window method on a standard document text index and an index of propagated anchor texts. We also experiment with extreme query expansions by taking the top n results of the initial ranking as multi-faceted aspects of the topic to construct n relevance models to obtain n sets of results. A final diverse set of results is obtained by merging the n results lists. For the Entity Ranking Track, we also explore the effectiveness of the anchor text representation, look at the co-citation graph, and experiment with using Wikipedia as a pivot. Our main findings can be summarized as follows: Anchor text is very effective for diversity. It gives high early precision and the results cover more relevant sub-topics than the document text index. Our baseline runs have low diversity, which limits the possible impact of the sliding window approach. New link information seems more effective for diversifying text-based search results than the amount of unique terms added by a document. In the entity ranking task, anchor text finds few primary pages , but it does retrieve a large number of relevant pages. Using Wikipedia as a pivot results in large gains of P10 and NDCG when only primary pages are considered. Although the links between the Wikipedia entities and pages in the Clueweb collection are sparse, the precision of the existing links is very high.

Bibtex
@inproceedings{DBLP:conf/trec/KapteinKK09,
    author = {Rianne Kaptein and Marijn Koolen and Jaap Kamps},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Result Diversity and Entity Ranking Experiments: Anchors, Links, Text and Wikipedia},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uamsterdam-kamps.ENT.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/KapteinKK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

University of Glasgow at TREC 2009: Experiments with Terrier

Richard McCreadie, Craig Macdonald, Iadh Ounis, Jie Peng, Rodrygo L. T. Santos

Abstract

In TREC 2009, we extend our Voting Model for the faceted blog distillation, top stories identification, and related entity finding tasks. Moreover, we experiment with our novel xQuAD framework for search result diversification. Besides fostering our research in multiple directions, by participating in such a wide portfolio of tracks, we further develop the indexing and retrieval capabilities of our Terrier Information Retrieval platform, to effectively and efficiently cope with a new generation of large-scale test collections.

Bibtex
@inproceedings{DBLP:conf/trec/McCreadieMOPS09,
    author = {Richard McCreadie and Craig Macdonald and Iadh Ounis and Jie Peng and Rodrygo L. T. Santos},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2009: Experiments with Terrier},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uglasgow.BLOG.ENT.MQ.RF.WEB.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/McCreadieMOPS09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Jagadish Pamarthi, GuangXu Zhou, Coskun Bayrak

Abstract

The focus of this paper is to present the results obtained as a result of performing entity information retrieval, namely the home pages of products, organizations and persons. The preliminary results, based on the Indri Search Engine, of this study and experimentation were presented at the Entity Track in TREC 2009. Indri Search Engine is an efficient and effective open source tool, which is operated by indri query language in any windows or UNIX based platform. Indri is based on the inference network framework and supports structured queries.

Bibtex
@inproceedings{DBLP:conf/trec/PamarthiZB09,
    author = {Jagadish Pamarthi and GuangXu Zhou and Coskun Bayrak},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A Journey in Entity Related Retrieval for {TREC} 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uarkansas-lr.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/PamarthiZB09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Delft University at the TREC 2009 Entity Track: Ranking Wikipedia Entities

Pavel Serdyukov, Arjen P. de Vries

Abstract

This paper describes the details of our participation in Entity track of the TREC 2009.

Bibtex
@inproceedings{DBLP:conf/trec/SerdyukovV09,
    author = {Pavel Serdyukov and Arjen P. de Vries},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Delft University at the {TREC} 2009 Entity Track: Ranking Wikipedia Entities},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/delft.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SerdyukovV09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

V. G. Vinod Vydiswaran, Kavita Ganesan, Yuanhua Lv, Jing He, ChengXiang Zhai

Abstract

Our goal in participating in the TREC 2009 Entity Track was to study whether relation extraction techniques can help in improving accuracy of the entity finding task. Finding related entities is informational in nature and we wanted to explore if inducing structure on the queries helps satisfy this information need. The research outlook we took was to study techniques that retrieve relations between two entities from a large corpus, and from those, find the most relevant entities that participate in the given relation with another given entity. Instead of aiming at retrieving pages about specific entities, we tried to address the problem of directly finding the entities from the text. Our experimental results show that we were able to find many related entities using relation-based extraction, and ranking entities based on further evidence from the text helps to a certain extent.

Bibtex
@inproceedings{DBLP:conf/trec/VydiswaranGLHZ09,
    author = {V. G. Vinod Vydiswaran and Kavita Ganesan and Yuanhua Lv and Jing He and ChengXiang Zhai},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Finding Related Entities by Retrieving Relations: {UIUC} at {TREC} 2009 Entity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/uiuc.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/VydiswaranGLHZ09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

BUPT at TREC 2009: Entity Track

Zhanyi Wang, Dong-xin Liu, Weiran Xu, Guang Chen, Jun Guo

Abstract

This report introduces the work of BUPT (PRIS) in Entity Track in TREC2009. The task and data are both new this year. In our work, an improved two-stage retrieval model is proposed according to the task. The first stage is document retrieval, in order to get the similarity of the query and documents. The second stage is to find the relationship between documents and entities. We also focus on entity extraction in the second stage and the final ranking.

Bibtex
@inproceedings{DBLP:conf/trec/WangLXCG09,
    author = {Zhanyi Wang and Dong{-}xin Liu and Weiran Xu and Guang Chen and Jun Guo},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{BUPT} at {TREC} 2009: Entity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/bupt.ENT.pdf},
    timestamp = {Tue, 17 Nov 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WangLXCG09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

NiCT at TREC 2009: Employing Three Models for Entity Ranking Track

Youzheng Wu, Hideki Kashioka

Abstract

This paper describes experiments carried out at NiCT for the TREC 2009 Entity Ranking track. Our main study is to develop an effective approach to rank entities via measuring the “similarities” between supporting snippets of entities and input query. Three models are implemented to this end. 1) The DLM regards entity ranking as a task of calculating the probabilities of generating input query given supporting snippets of entities via language model. 2) The RSVM ranks entities via a supervised Ranking SVM. 3) The CSVM, an unsupervised model, ranks entities according to the probabilities of input query belonging to topics represented by entities and their supporting snippets via SVM classifier. The evaluation shows that the DLM is the best on P@10, while the RSVM outperforms the others on nDCG.

Bibtex
@inproceedings{DBLP:conf/trec/WuK09,
    author = {Youzheng Wu and Hideki Kashioka},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {NiCT at {TREC} 2009: Employing Three Models for Entity Ranking Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/nict.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WuK09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Qing Yang, Peng Jiang, Chunxia Zhang, Zhendong Niu

Abstract

Our goal in participating in the TREC 2009 Entity Track is to study whether QA list technique can help improve accuracy of the entity finding task. Also, we take a looking for homepage finding to identify homepages of an entity by training a maximum entropy classifier and a logistic regression models for three types of entity respectively.

Bibtex
@inproceedings{DBLP:conf/trec/YangJZN09,
    author = {Qing Yang and Peng Jiang and Chunxia Zhang and Zhendong Niu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Experiments on Related Entity Finding Track at {TREC} 2009},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/bit.ENT.pdf},
    timestamp = {Fri, 04 Sep 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/YangJZN09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Haijun Zhai, Xueqi Cheng, Jiafeng Guo, Hongbo Xu, Yue Liu

Abstract

This paper addresses the problem of related entity finding, which was proposed in trec 2009. The overall aim of related entity finding (REF) is to perform entity-related search on Web data, which address common information needs that are not that well modeled as ad hoc document search. In this paper, a novel framework was proposed based on a probabilistic model for related entity finding in a Web collection. This model consists of two parts. One is the probability indicating the relation between the source entity and the candidate entities. The other is the probability indicating the relevance between the candidate entities and the topic. Using ClueWeb09 dataset, the experimental evaluations show the effectiveness of our REF framework.

Bibtex
@inproceedings{DBLP:conf/trec/ZhaiCGXL09,
    author = {Haijun Zhai and Xueqi Cheng and Jiafeng Guo and Hongbo Xu and Yue Liu},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {A Novel Framework for Related Entities Finding: {ICTNET} at {TREC} 2009 Entity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/ictnet.ENT.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ZhaiCGXL09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

UDEL/SMU at TREC 2009 Entity Track

Wei Zheng, Swapna Gottipati, Jing Jiang, Hui Fang

Abstract

We report our methods and experiment results from the collaborative participation of the InfoLab group from University of Delaware and the school of Information Systems from Singapore Management University in the TREC 2009 Entity track. Our general goal is to study how we may apply language modeling approaches and natural language processing techniques to the task. Specifically, we proposed to find supporting information based on segment retrieval, to extract entities using Stanford NER tagger, and to rank entities based on a previously proposed probabilistic framework for expert finding.

Bibtex
@inproceedings{DBLP:conf/trec/ZhengGJF09,
    author = {Wei Zheng and Swapna Gottipati and Jing Jiang and Hui Fang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{UDEL/SMU} at {TREC} 2009 Entity Track},
    booktitle = {Proceedings of The Eighteenth Text REtrieval Conference, {TREC} 2009, Gaithersburg, Maryland, USA, November 17-20, 2009},
    series = {{NIST} Special Publication},
    volume = {500-278},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2009},
    url = {http://trec.nist.gov/pubs/trec18/papers/udelaware-fang.ENT.pdf},
    timestamp = {Tue, 20 Oct 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhengGJF09.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}