Proceedings - Web 2012¶

Overview of the TREC 2012 Web Track¶

Charles L. A. Clarke, Nick Craswell, Ellen M. Voorhees

Paper: 10.6028/NIST.SP.500-298.web-overview

Abstract

If you are an experienced participant, you may not need to read the full report. Apart from the results themselves (see tables 1, 2, and 3) little has changed from TREC 2011 [6]. A six-point scale was used for relevance assessment (see section 4.1). Limitations on available assessor time meant that some topics were judged to depth 30 and others to depth 20, as well as causing other minor problems (see section 4.3). However, our plans for next year, as outlined in the concluding section, are quite different from this year.

Bibtex

@inproceedings{DBLP:conf/trec/ClarkeCV12,
    author = {Charles L. A. Clarke and Nick Craswell and Ellen M. Voorhees},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Overview of the {TREC} 2012 Web Track},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/WEB12.overview.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/ClarkeCV12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-overview}
}

Query-Structure Based Web Page Indexing¶

Falah Hassan Al-akashi, Diana Inkpen

Participant: uottawa
Paper: 10.6028/NIST.SP.500-298.web-uottawa
Runs: DFalah121A | DFalah121D | DFalah120A | DFalah120D

Abstract

Indexing is a crucial technique for dealing with the massive amount of data present on the web. In our third participation in the web track at TREC 2012, we explore the idea of building an efficient query-based indexing system over Web page collection. Our prototype explores the trends in user queries and consequently indexes texts using particular attributes available in the documents. This paper provides an in-depth description of our approach for indexing web documents efficiently; that is, topics available in the web documents are discovered with the assistance of knowledge available in Wikipedia. The well- defined articles in Wikipedia are shown to be valuable as a training set when indexing Webpages. Our complex index structure also records information from titles and urls, and pays attention to web domains. Our approach is designed to close the gaps in our approaches from the previous two years, for some queries. Our framework is able to efficiently index the 50 million pages available in the subset B of the ClueWeb09 collection. Our preliminary experiments on the TREC 2012 testing queries showed that our indexing scheme is robust and efficient for both indexing and retrieving relevant web pages, for both the ad-hoc and diversity task.

Bibtex

@inproceedings{DBLP:conf/trec/Al-akashiI12,
    author = {Falah Hassan Al{-}akashi and Diana Inkpen},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Query-Structure Based Web Page Indexing},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/uottawa.web.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Al-akashiI12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-uottawa}
}

Does Category A Anchor Text Improve Category B Results?¶

Leonid Boytsov

Participant: srchvrs
Paper: 10.6028/NIST.SP.500-298.web-srchvrs
Runs: srchvrs12c10 | srchvrs12c09 | srchvrs12c00

Abstract

We merged results obtained from the Category B index with results obtained from the index built over complete (Category A) anchor text. However, we were unable to improve over Category B results in either the ad hoc or the diversity task.

Bibtex

@inproceedings{DBLP:conf/trec/Boytsov12,
    author = {Leonid Boytsov},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Does Category {A} Anchor Text Improve Category {B} Results?},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/srchvrs.web.nb.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Boytsov12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-srchvrs}
}

LIA at TREC 2012 Web Track: Unsupervised Search Concepts Identification from General Sources of Information¶

Romain Deveaud, Eric SanJuan, Patrice Bellot

Participant: LIA
Paper: 10.6028/NIST.SP.500-298.web-LIA
Runs: lcmweb | lcmweb10p | lcmwebnoW | lcm4res

Abstract

In this paper, we report the experiments we conducted for our participation to the TREC 2012 Web Track. We experimented a brand new system that models the latent concepts underlying a query. We use Latent Dirichlet Allocation (LDA), a generative probabilistic topic model, to exhibit highly-specific query-related topics from pseudo-relevant feedback documents. We define these topics as the latent concepts of the user query. Our approach automatically estimates the number of latent concepts as well as the needed amount of feedback documents, without any prior training step. These concepts are incorporated into the ranking function with the aim of promoting documents that refer to many different query-related thematics. We also explored the use of different types of sources of information for modeling the latent concepts. For this purpose, we use four general sources of information of various nature (web, news, encyclopedic) from which the feedback documents are extracted.

Bibtex

@inproceedings{DBLP:conf/trec/DeveaudSB12,
    author = {Romain Deveaud and Eric SanJuan and Patrice Bellot},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{LIA} at {TREC} 2012 Web Track: Unsupervised Search Concepts Identification from General Sources of Information},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/LIA.web.nb.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DeveaudSB12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-LIA}
}

IRRA at TREC 2012: Divergence From Independence (DFI)¶

Bekir Taner Dinçer

Participant: irra
Paper: 10.6028/NIST.SP.500-298.web-irra

Abstract

IRRA (IR-Ra) group participated in the 2012 Web track, with a system implementing a non-parametric term weighting method based on measuring the divergence from independence (DFI). This is the third year of participation for IRRA group, following the participations in TREC 2009 and 2010 Web tracks. In this year, the aim is to evaluate a new DFI-based term weighting model developed on the basis of Shannon's information theory (Shannon, 1949), along with the evaluation of a heuristic approach that is expected to provide early precision when used together with DFI term weighting. The TERRIER retrieval platform version 3.0 (Ounis et al., 2007) is used to index and search the ClueWeb09-T09B1 data set (“Category B” data set), a subset of about 50 million Web pages in English. During indexing and searching, terms are stemmed (Porter's stemmer as implemented in TERRIER) but not stopped. The result sets are filtered using the fusion of two spam-page lists provided by Cormack et al. (2010) for ClueWeb09 document collection.

Bibtex

@inproceedings{DBLP:conf/trec/Dincer12,
    author = {Bekir Taner Din{\c{c}}er},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{IRRA} at {TREC} 2012: Divergence From Independence {(DFI)}},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/irra.web.nb.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/Dincer12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-irra}
}

University of Glasgow at TREC 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks¶

Nut Limsopatham, Richard McCreadie, M-Dyaa Albakour, Craig Macdonald, Rodrygo L. T. Santos, Iadh Ounis

Participant: uogTr
Paper: 10.6028/NIST.SP.500-298.medical-uogTr
Runs: uogTrA44s9 | uogTrA44xi | uogTrA44xu | uogTrA44xl | uogTrB44xu | uogTrB45aIs

Abstract

In TREC 2012, we focus on tackling the new challenges posed by the Medical, Microblog and Web tracks, using our Terrier Information Retrieval Platform. In particular, for the Medical track, we investigate how to exploit implicit knowledge within medical records, with the aim of better identifying those records from patients with specific medical conditions. For the Microblog track adhoc task, we investigate novel techniques to leverage documents hyperlinked from tweets to better estimate relevance of those tweets and increase recall. Meanwhile, for the Microblog track filtering task, we developed a new stream processing infrastructure for real-time adaptive filtering on top of the Storm framework. For the TREC Web track, we continue to build upon our learning-to-rank approaches and novel xQuAD framework within Terrier, increasing both effectiveness and efficiency when ranking.

Bibtex

@inproceedings{DBLP:conf/trec/LimsopathamMAMS12,
    author = {Nut Limsopatham and Richard McCreadie and M{-}Dyaa Albakour and Craig Macdonald and Rodrygo L. T. Santos and Iadh Ounis},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {University of Glasgow at {TREC} 2012: Experiments with Terrier in Medical Records, Microblog, and Web Tracks},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/uogTr.medical.microblog.web.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LimsopathamMAMS12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.medical-uogTr}
}

Ensemble Clustering for Result Diversification¶

Dong Nguyen, Djoerd Hiemstra

Participant: utwente
Paper: 10.6028/NIST.SP.500-298.web-utwente
Runs: utw2012lm09 | utw2012lda | utw2012c1 | utw2012sc1 | utw2012c2 | utw2012fc1

Abstract

This paper describes the participation of the University of Twente in the Web track of TREC 2012. Our baseline approach uses the Mirex toolkit, an open source tool that sequantially scans all the documents. For result diversification, we experimented with improving the quality of clusters through ensemble clustering. We combined clusters obtained by different clustering methods (such as LDA and K-means) and clusters obtained by using different types of data (such as document text and anchor text). Our two-layer ensemble run performed better than the LDA based diversification and also better than a non-diversification run.

Bibtex

@inproceedings{DBLP:conf/trec/NguyenH12,
    author = {Dong Nguyen and Djoerd Hiemstra},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Ensemble Clustering for Result Diversification},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/utwente.web.final.pdf},
    timestamp = {Tue, 24 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/NguyenH12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-utwente}
}

QUT_Para at TREC 2012 Web Track: Word Associations for Retrieving Web Documents¶

Mike Symonds, Guido Zuccon, Bevan Koopman, Peter Bruza

Participant: QUT_Para
Paper: 10.6028/NIST.SP.500-298.web-QUT_Para
Runs: QUTparaTQEg1 | QUTparaBline

Abstract

Many existing information retrieval models do not explicitly take into account information about word associations. Our approach makes use of first and second order relationships found in natural language, known as syntagmatic and paradigmatic associations, respectively. This is achieved by using a formal model of word meaning within the query expansion process. On ad hoc retrieval, our approach achieves statistically significant improvements in MAP (0.158) and P@20 (0.396) over our baseline model. The ERR@20 and nDCG@20 of our system was 0.249 and 0.192 respectively. Our results and discussion suggest that information about both syntagamtic and paradigmatic associations can assist with improving retrieval effectiveness on ad hoc retrieval.

Bibtex

@inproceedings{DBLP:conf/trec/SymondsZKBZK12,
    author = {Mike Symonds and Guido Zuccon and Bevan Koopman and Peter Bruza},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {QUT{\_}Para at {TREC} 2012 Web Track: Word Associations for Retrieving Web Documents},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/QUT\_Para.web.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SymondsZKBZK12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-QUT_Para}
}

Exploiting Ontologies for Search Result Diversification¶

Wei Zheng, Hui Fang

Participant: udel_fang
Paper: 10.6028/NIST.SP.500-298.web-udel_fang
Runs: UDInfoDivSt | UDInfoDivC1 | UDInfoDivC2

Abstract

We report our systems and experimental results in the diversity task of web track 2012. Our goal is to exploit the structured data, i.e., the ontologies, as well as unstructured data for search result diversification. We use two strategies in the diversification systems. The first strategy combines the ontology and unstructured data to extract integrated subtopics. It then uses the coverage based diversification function to diversify documents based on the integrated subtopics. The second strategy exploits the structure information in the ontology for diversification. We use a structural diversification to diversify documents based on the structural relationships of their subtopics in the ontology.

Bibtex

@inproceedings{DBLP:conf/trec/ZhengF12,
    author = {Wei Zheng and Hui Fang},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {Exploiting Ontologies for Search Result Diversification},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/udel\_fang.web.nb.pdf},
    timestamp = {Tue, 20 Oct 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhengF12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-udel_fang}
}

ICTNET at Web Track 2012 Diversity Task¶

Zilong Feng, Yuanhai Xue, Xiaoming Yu, Hongbo Xu, Yue Liu, Xueqi Cheng

Participant: ICTNET
Paper: 10.6028/NIST.SP.500-298.web-ICTNET
Runs: ICTNET12ADR1 | ICTNET12DVR1 | ICTNET12DVR2 | ICTNET12ADR2 | ICTNET12DVR3 | ICTNET12ADR3

Abstract

In this paper, we report our experiments at Diversity task, Web Track 2012. In this year, we attempt to use query expansion and topic model such as LDA[5] to get subtopics. And an model based on xQuAD[10] was used to re-rank the ad-hoc search results.

Bibtex

@inproceedings{DBLP:conf/trec/FengXYXLC12,
    author = {Zilong Feng and Yuanhai Xue and Xiaoming Yu and Hongbo Xu and Yue Liu and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ICTNET} at Web Track 2012 Diversity Task},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/ICTNET.web-diversity.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/FengXYXLC12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-ICTNET}
}

ICTNET at Web Track 2012 adhoc Task¶

Heyuan Li, Yuanhai Xue, Shaohua Guo, Feng Guan, Xiaoming Yu, Yue Liu, Xueqi Cheng

Participant: ICTNET
Paper: 10.6028/NIST.SP.500-298.web-ICTNET2
Runs: ICTNET12ADR1 | ICTNET12DVR1 | ICTNET12DVR2 | ICTNET12ADR2 | ICTNET12DVR3 | ICTNET12ADR3

Abstract

In this paper, we report our experiments at Ad-hoc task, Web Track 2012. In this year, we attempt to use new web parser with noise elimination. The Conditional Boolean BM25 was used as major ranking function. We also introduce Learning-To-Rank to combine multiple features together for ranking, but the performance was poor due to the low quality of training data.

Bibtex

@inproceedings{DBLP:conf/trec/LiXGGYLC12,
    author = {Heyuan Li and Yuanhai Xue and Shaohua Guo and Feng Guan and Xiaoming Yu and Yue Liu and Xueqi Cheng},
    editor = {Ellen M. Voorhees and Lori P. Buckland},
    title = {{ICTNET} at Web Track 2012 adhoc Task},
    booktitle = {Proceedings of The Twenty-First Text REtrieval Conference, {TREC} 2012, Gaithersburg, Maryland, USA, November 6-9, 2012},
    series = {{NIST} Special Publication},
    volume = {500-298},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2012},
    url = {http://trec.nist.gov/pubs/trec21/papers/ICTNET.web-adhoc.final.pdf},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/LiXGGYLC12.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-298.web-ICTNET2}
}