Proceedings - Novelty 2004¶
Overview of the TREC 2004 Novelty Track¶
Ian Soboroff
Abstract
TREC 2004 marks the third and final year for the novelty track. The task is as follows: Given a TREC topic and an ordered list of documents, systems must find the relevant and novel sentences that should be returned to the user from this set. This task integrates aspects of passage retrieval and information filtering. As in 2003, there were two categories of topics - events and opinions - and four subtasks which provided systems with varying amounts of relevance or novelty information as training data. This year, the task was made harder by the inclusion of some number of irrelevant documents in document sets. Fourteen groups participated in the track this year.
Bibtex
@inproceedings{DBLP:conf/trec/Soboroff04,
author = {Ian Soboroff},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Overview of the {TREC} 2004 Novelty Track},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/NOVELTY.OVERVIEW.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/Soboroff04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for TREC 2004¶
Stephen Blott, Fabrice Camous, Paul Ferguson, Georgina Gaughan, Cathal Gurrin, Gareth J. F. Jones, Noel Murphy, Noel E. O'Connor, Alan F. Smeaton, Peter Wilkins, Oisín Boydell, Barry Smyth
- Participant: dubblincity.u
- Paper: http://trec.nist.gov/pubs/trec13/papers/dcu.tera.geo.novelty.pdf
- Runs: cdvp4QePnD2 | cdvp4CnQry2 | cdvp4QePDPC2 | cdvp4CnS101 | cdvp4QeSnD1 | cdvp4UnHis3 | cdvp4NSen4 | cdvp4NTerFr3 | cdvp4NTerFr1 | cdvp4NSnoH4
Abstract
In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document.
Bibtex
@inproceedings{DBLP:conf/trec/BlottCFGGJMOSWBS04,
author = {Stephen Blott and Fabrice Camous and Paul Ferguson and Georgina Gaughan and Cathal Gurrin and Gareth J. F. Jones and Noel Murphy and Noel E. O'Connor and Alan F. Smeaton and Peter Wilkins and Ois{\'{\i}}n Boydell and Barry Smyth},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Experiments in Terabyte Searching, Genomic Retrieval and Novelty Detection for {TREC} 2004},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/dcu.tera.geo.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/BlottCFGGJMOSWBS04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
A Hidden Markov Model for the TREC Novelty Task¶
John M. Conroy
- Participant: ida.ccs.nsa
- Paper: http://trec.nist.gov/pubs/trec13/papers/ida-ccs-conroy.novelty.pdf
- Runs: ccs3fqrt1 | ccs1f0t1 | ccs1ftop0t1 | ccs3ftop0t1 | ccs3fmmrt1 | ccs3fmmr95t3 | ccsqrt2 | ccsmmr2t2 | ccsmmr3t2 | ccsmmr5t2 | ccsmmr4t2 | ccsfbmmrt3 | ccs3fmmrt3 | ccs3fbqrt3
Abstract
The algorithms for choosing relevant sentences were tuned versions of those presented in the past DUC evaluations and TREC 2003 (see [4, 5, 10] [11] for more details). The enhancements to the previous system are detailed in Section 3. Two methods were explored to find a subset of the relevant sentences that had good coverage but low redundancy. In the multi-document summarization system, the QR algorithm is used on term-sentence matrices. For this work, the method of maximum marginal relevance was also employed. The evaluation of these methods is discussed in Section 5.
Bibtex
@inproceedings{DBLP:conf/trec/Conroy04,
author = {John M. Conroy},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {A Hidden Markov Model for the {TREC} Novelty Task},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/ida-ccs-conroy.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/Conroy04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
TREC Novelty Track at IRIT-SIG¶
Taoufiq Dkaki, Josiane Mothe
- Participant: irit.sig.boughanem
- Paper: http://trec.nist.gov/pubs/trec13/papers/irit-sig.novelty.pdf
- Runs: IRITT1 | IRITT2 | IRITT3 | IRITT4 | IRITT5 | IritTask2 | Irit2T2 | Irit2Task3 | Irit3Task3 | Irit4Task3 | Irit5Task3 | Irit1T3
Abstract
In TREC 2004, IRIT modified important features of the strategy that was developed for TREC 2003. Changes include tuning parameter values, topic expansion and exploitation of sentences context. According to our method, a sentence is considered as relevant if it matches the topic with a certain level of coverage. This coverage depends on the category of the terms used in the texts. Four types of terms have been defined highly relevant, scarcely relevant, non-relevant (like stop words), highly non-relevant terms (negative terms). Term categorization is based on topic analysis: highly non-relevant terms are extracted from the narrative parts that describe what will be a non-relevant document. The three other types of terms are extracted from the rest of the query. Each term of a topic is weighted according to both its occurrence and the topic part it belongs to (title, descriptive, narrative). Additionally we increase the score of a sentence when either the previous or the next sentence is relevant. When topic expansion is applied, terms from relevant sentences (task 3) or from the first retrieved sentences (task 1) are added to the initial terms. With regard to the novelty part, a sentence is considered as novel if its similarity with each of previously processed -and selected as novel- sentences does not exceed a certain threshold. In addition, this sentence should not be too similar to a virtual sentence made of the n best-matching and previously selected sentences.
Bibtex
@inproceedings{DBLP:conf/trec/DkakiM04,
author = {Taoufiq Dkaki and Josiane Mothe},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {{TREC} Novelty Track at {IRIT-SIG}},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/irit-sig.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/DkakiM04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Novelty, Question Answering and Genomics: The University of Iowa Response¶
David Eichmann, Yi Zhang, Shannon Bradshaw, Xin Ying Qiu, Li Zhou, Padmini Srinivasan, Aditya Kumar Sehgal, Hudon Wong
- Participant: u.iowa
- Paper: http://trec.nist.gov/pubs/trec13/papers/uiowa.novelty.qa.geo.pdf
- Runs: UIowa04Nov14 | UIowa04Nov15 | UIowa04Nov11 | UIowa04Nov12 | UIowa04Nov13 | UIowa04Nov21 | UIowa04Nov31 | UIowa04Nov41 | UIowa04Nov22 | UIowa04Nov32 | UIowa04Nov42 | UIowa04Nov23 | UIowa04Nov24 | UIowa04Nov25 | UIowa04Nov33 | UIowa04Nov34 | UIowa04Nov35 | UIowa04Nov43 | UIowa04Nov44 | UIowa04Nov45
Abstract
Our system for novelty this year comprises three distinct variations. The first is a refinement of that used for last year involving named entity occurrences and functions as a comparative baseline. The second variation extends the baseline system in an exploration of the connection between word sense and novelty. The third variation involves more statistical similarity schemes in the positive sense for relevance and the negative sense for novelty.
Bibtex
@inproceedings{DBLP:conf/trec/EichmannZBQZSSW04,
author = {David Eichmann and Yi Zhang and Shannon Bradshaw and Xin Ying Qiu and Li Zhou and Padmini Srinivasan and Aditya Kumar Sehgal and Hudon Wong},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Novelty, Question Answering and Genomics: The University of Iowa Response},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/uiowa.novelty.qa.geo.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/EichmannZBQZSSW04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
The University of Michigan in Novelty 2004¶
Günes Erkan
- Participant: u.michigan
- Paper: http://trec.nist.gov/pubs/trec13/papers/umichigan.novelty.pdf
- Runs: umich0411 | umich0412 | umich0413 | umich0414 | umich0415 | umich0421 | umich0422 | umich0423 | umich0424 | umich0425 | umich0433 | umich0431 | umich0432 | umich0434 | umich0435 | umich0441 | umich0442 | umich0443 | umich0444
Abstract
This year we participated in the Novelty track. To find the relevant sentences, we combine sentence salience features that are inherited from text summarization domain with other heuristic features based on topic statements. We propose a novel method to extract the new sentences based on the graph-based ranking of the similarity relation between the sentences.
Bibtex
@inproceedings{DBLP:conf/trec/Erkan04,
author = {G{\"{u}}nes Erkan},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {The University of Michigan in Novelty 2004},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/umichigan.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/Erkan04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
UMass at TREC 2004: Novelty and HARD¶
Nasreen Abdul Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah S. Larkey, Xiaoyan Li, Mark D. Smucker, Courtney Wade
- Participant: u.mass
- Paper: http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf
- Runs: CIIRT1R2 | CIIRT1R1 | CIIRT1R3 | CIIRT1R5 | CIIRT1R6 | CIIRT3R1 | CIIRT3R2 | CIIRT3R3 | CIIRT2R1 | CIIRT2R2 | CIIRT3R4 | CIIRT3R5 | CIIRT4R1 | CIIRT4R2 | CIIRT4R3
Abstract
For the TREC 2004 Novelty track, UMass participated in all four tasks. Although finding relevant sentences was harder this year than last, we continue to show marked improvements over the baseline of calling all sentences relevant, with a variant of tfidf being the most successful approach. We achieve 5-9% improvements over the baseline in locating novel sentences, primarily by looking at the similarity of a sentence to earlier sentences and focusing on named entities. For the High Accuracy Retrieval from Documents (HARD) track, we investigated the use of clarification forms, fixed- and variable-length passage retrieval, and the use of metadata. Clarification form results indicate that passage level feedback can provide improvements comparable to user supplied related-text for document evaluation and outperforms related-text for passage evaluation. Document retrieval methods without a query expansion component show the most gains from related-text. We also found that displaying the top passages for feedback outperformed displaying centroid passages. Named entity feedback resulted in mixed performance. Our primary findings for passage retrieval are that document retrieval methods performed better than passage retrieval methods on the passage evaluation metric of binary preference at 12,000 characters, and that clarification forms improved passage retrieval for every retrieval method explored. We found no benefit to using variable-length passages over fixed-length passages for this corpus. Our use of geography and genre metadata resulted in no significant changes in retrieval performance.
Bibtex
@inproceedings{DBLP:conf/trec/JaleelACDLLSW04,
author = {Nasreen Abdul Jaleel and James Allan and W. Bruce Croft and Fernando Diaz and Leah S. Larkey and Xiaoyan Li and Mark D. Smucker and Courtney Wade},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {UMass at {TREC} 2004: Novelty and {HARD}},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/umass.novelty.hard.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/JaleelACDLLSW04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
ISI Novelty Track System for TREC 2004¶
Soo-Min Kim, Deepak Ravichandran, Eduard H. Hovy
- Participant: usc.isi.kim
- Paper: http://trec.nist.gov/pubs/trec13/papers/usc-isi.novelty.pdf
- Runs: ISIALL04 | ISIRUN204 | ISIRUN304 | ISIRUN404 | ISIRUN504
Abstract
We describe our system developed at ISI for the Novelty track at TREC 2004. The system's two modules recognize relevant event and opinion sentences respectively. We focused mainly on recognizing relevant opinion sentences using various opinion-bearing word lists. Of our 5 runs submitted for task 1, the best run provided an F-score of 0.390 (precision 0.30 and recall 0.71).
Bibtex
@inproceedings{DBLP:conf/trec/KimRH04,
author = {Soo{-}Min Kim and Deepak Ravichandran and Eduard H. Hovy},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {{ISI} Novelty Track System for {TREC} 2004},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/usc-isi.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/KimRH04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Evolving XML and Dictionary Strategies for Question Answering and Novelty Tasks¶
Kenneth C. Litkowski
- Participant: clresearch
- Paper: http://trec.nist.gov/pubs/trec13/papers/clresearch.qa.novelty.pdf
- Runs: clr04n1h2 | clr04n1h3 | clr04n2 | clr04n3h1f1 | clr04n3h1f2 | clr04n3h2f1 | clr04n3h2f2 | clr04n4
Abstract
CL Research participated in the question answering and novelty tracks in TREC 2004. The Knowledge Management System (KMS), which provides a single interface for question answering, text summarization, information extraction, and document exploration, was used for these tasks. Question answering is performed directly within KMS, which answers questions either from a repository or the Internet. The novelty task was performed with the XML Analyzer, which includes many of the functions used in the KMS summarization routines. These tasks are based on creating and exploiting an XML representation of the texts used for these two tracks. For the QA track, we submitted one run and our overall score was 0.156, with scores of 0.161 for factoid questions, 0.064 for list questions, and 0.239 for “other” questions; these scores are significantly improved from TREC 2003. For the novelty track, we submitted two runs for task 1, one run for task 2, four runs for task 3, and one run for task 4. For most tasks, our scores were above the median. We describe our system in some detail, particularly emphasizing strategies that are emerging in the use of XML and lexical resources for the question answering and novelty tasks.
Bibtex
@inproceedings{DBLP:conf/trec/Litkowski04,
author = {Kenneth C. Litkowski},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Evolving {XML} and Dictionary Strategies for Question Answering and Novelty Tasks},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/clresearch.qa.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/Litkowski04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Improved Feature Selection and Redundance Computing - THUIR at TREC 2004 Novelty Track¶
Liyun Ru, Le Zhao, Min Zhang, Shaoping Ma
- Participant: tsinghua.ma
- Paper: http://trec.nist.gov/pubs/trec13/papers/tsinghua-ma.novelty.pdf
- Runs: THUIRnv0411 | THUIRnv0412 | THUIRnv0413 | THUIRnv0414 | THUIRnv0415 | THUIRnv0421 | THUIRnv0422 | THUIRnv0424 | THUIRnv0423 | THUIRnv0425 | THUIRnv0431 | THUIRnv0432 | THUIRnv0433 | THUIRnv0434 | THUIRnv0435 | THUIRnv0441 | THUIRnv0442 | THUIRnv0443 | THUIRnv0444 | THUIRnv0445
Abstract
This is the third years that Tsinghua University Information Retrieval Group (THUIR) participates in Novelty task of TREC. Our research on this year's novelty track mainly focused on four aspects: (1) text feature selection and reduction; (2) improved sentence classification in finding relevant information; (3)efficient sentence redundancy computing; (4) effective result filtering. All experiments have been performed on TMiner IR system, developed by THU IR group last year.
Bibtex
@inproceedings{DBLP:conf/trec/RuZZM04,
author = {Liyun Ru and Le Zhao and Min Zhang and Shaoping Ma},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Improved Feature Selection and Redundance Computing - {THUIR} at {TREC} 2004 Novelty Track},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/tsinghua-ma.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/RuZZM04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Columbia University in the Novelty Track at TREC 2004¶
Barry Schiffman, Kathleen R. McKeown
- Participant: columbia.u.schiffman
- Paper: http://trec.nist.gov/pubs/trec13/papers/columbiau.novelty.pdf
- Runs: novcolp1 | novcolp2 | novcolrcl | novcosine | novcombo
Abstract
Our system for the Novelty Track at TREC 2004 looks beyond sentence boundaries as well as within sentences to identify novel, nonduplicative passages. It tries to identify text spans of two or more sentences that encompass mini-segments of new information. At the same time, we avoid any pairwise comparison of sentences, but rely on the presence of previously unseen terms to provide evidence of novelty. The system is guided by a number of parameters, both weights and thresholds, that are learned automatically with a randomized hill-climbing algorithm. During learning, we varied the target function to produce configurations that emphasize either precision or recall. We also implemented a straightforward vector-space model as a comparison and to test a combined approach.
Bibtex
@inproceedings{DBLP:conf/trec/SchiffmanM04,
author = {Barry Schiffman and Kathleen R. McKeown},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Columbia University in the Novelty Track at {TREC} 2004},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/columbiau.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/SchiffmanM04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Meiji University Web, Novelty and Genomic Track Experiments¶
Tomoe Tomiyama, Kosuke Karoji, Takeshi Kondo, Yuichi Kakuta, Tomohiro Takagi
- Participant: meiji.u
- Paper: http://trec.nist.gov/pubs/trec13/papers/meijiu.web.novelty.geo.pdf
- Runs: HIL10 | MeijiHIL1cfs | MeijiHIL1odp | MeijiHIL2RS | MeijiHIL3 | MeijiHIL2WRS | MeijiHIL2WR | MeijiHIL3Tc | MeijiHIL3TSc | MeijiHIL2WCS | MeijiHIL2CS | MeijiHIL4WRS | MeijiHIL4WR | MeijiHIL4WRc | MeijiHIL4RS | MeijiHIL4RSc
Abstract
We participated in Novelty track, the topic distillation task of Web track and ad hoc task of Genomic Track. Our main challenge is to deal with meaning of words and improve retrieval performance.
Bibtex
@inproceedings{DBLP:conf/trec/TomiyamaKKKT04,
author = {Tomoe Tomiyama and Kosuke Karoji and Takeshi Kondo and Yuichi Kakuta and Tomohiro Takagi},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Meiji University Web, Novelty and Genomic Track Experiments},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/meijiu.web.novelty.geo.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/TomiyamaKKKT04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Similarity Computation in Novelty Detection and Biomedical Text Categorization¶
Ming-Feng Tsai, Ming-Hung Hsu, Hsin-Hsi Chen
- Participant: ntu.chen
- Paper: http://trec.nist.gov/pubs/trec13/papers/ntu.novelty.pdf
- Runs: NTU11 | NTU12 | NTU13 | NTU14 | NTU15 | NTU21 | NTU22 | NTU23 | NTU24 | NTU25
Abstract
The novelty track was first introduced in TREC 2002. Given a TREC topic, the goal of this task in 2004 is to locate relevant and new information from a set of documents. From the results in TREC 2002 and 2003, we realized the major challenging issue of recognizing relevant sentences is the lack of information used in similarity computation among sentences. In this year, we utilized the method based on variants of employing an information retrieval (IR) system to find relevant and novel sentences. This methodology is called IR with reference corpus, which can also be considered as an information expansion of sentences. A sentence is considered as a query of a reference corpus, and similarity between sentences is measured in terms of the weighting vectors of document lists ranked by IR systems. Basically, relevant sentences are extracted by comparing their results on a certain information retrieval system. Two sentences are regarded as similar if their corresponding returned document lists by the IR system are similar. In novelty parts, we used similar approach to extract novel sentences from the sentences of the relevant part. An effectively dynamic threshold setting approach that is based on what percentage of relevant sentences is within a relevant document is presented. In this paper, we paid attention to three points: first, how to utilize the results of an IR system to compare the similarity between sentences; second, how to filter out the redundant sentences; third, how to determine appropriate relevance and novelty threshold.
Bibtex
@inproceedings{DBLP:conf/trec/TsaiHC04,
author = {Ming{-}Feng Tsai and Ming{-}Hung Hsu and Hsin{-}Hsi Chen},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Similarity Computation in Novelty Detection and Biomedical Text Categorization},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/ntu.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/TsaiHC04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Experiments in TREC 2004 Novelty Track at CAS-ICT¶
Huaping Zhang, Hongbo Xu, Shuo Bai, Bin Wang, Xueqi Cheng
- Participant: cas.ict.wang
- Paper: http://trec.nist.gov/pubs/trec13/papers/cas.ict.novelty.pdf
- Runs: ICTVSMCOSAP | ICTOKAPIOVLP | ICTVSMFDBKH | ICTVSMFDBKL | ICTVSMLCE | ICT2VSMOLP | ICT2VSMIG95 | ICT2VSMLCE | ICT2OKAPIAP | ICT2OKALCEAP | ICT3VSMOLP | ICT3OKAPIIG | ICT3OKAPIOLP | ICT3OKAPFDBK | ICT4OVERLAP | ICT4IG | ICT4OKAPIIG | ICT4OKAAP | ICT4OVLPCHI
Abstract
The main task in Novelty Track is to retrieve relevant sentences and remove duplicates from a document set given a TREC topic. This track took place for the first time in TREC 2002 and it is refined to four tasks in TREC 2003. Besides 25 relevant documents, irrelevant ones are given in this year of Novelty track. In other words, a given document is either relevant or irrelevant to the topic. There are 1808 documents in 50 TREC topics. Average 11.18 documents are noise for each topic. In topic N75, the number of noise is 45. Once we mistook an irrelevant document as relevance, all results in the document are wrong. Except the document retrieval, more limited information could be applied in the last three tasks than ever. Among the first 5 given documents, average 3.14 documents are relevant and average 2.76 are new. Especially, 9 topics have no relevant sentence in the first 5 ones. In TREC2004, ICT divided Novelty track into four sequential stages. It includes: customized language parsing on original dataset, document retrieval, sentence relevance and novelty detection. The architecture in novelty is given in Figure 1. In the first preprocessing stage, we applied sentence segmenter, tokenization, part-of-speech tagging, morphological analysis, stop word remover and query analyzer on topics and documents. As for query analysis, we categorized words in topics into description words and query words. Title, description and narrative parts are all merged into query with different weights. In the stage of document and sentence retrieval, we introduced vector space model (VSM) and its variance, probability model OKAPI and statistical language model. Based on VSM, we tried various query expansion strategies: pseu-feedback, term expansion with synset or synonym in WordNet[1] and expansion with highly local co-occurrence terms. With regard to the novelty stage, we defined three types of new degree: word overlapping and its extension, similarity comparison and information gain. In the last three tasks, we used the known results to adjust threshold, estimate the number of results, and turned to classifier, such as inductive and transductive SVM.
Bibtex
@inproceedings{DBLP:conf/trec/ZhangXBWC04,
author = {Huaping Zhang and Hongbo Xu and Shuo Bai and Bin Wang and Xueqi Cheng},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {Experiments in {TREC} 2004 Novelty Track at {CAS-ICT}},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/cas.ict.novelty.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/ZhangXBWC04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
From the Texts to the Contexts They Contain: A Chain of Linguistic Treatments¶
Ahmed Amrani, Jérôme Azé, Thomas Heitz, Yves Kodratoff, Mathieu Roche
- Participant: u.paris.lri
- Paper: http://trec.nist.gov/pubs/trec13/papers/uparis.novelty2.pdf
- Runs: LRIaze1 | LRIaze2 | LRIaze3 | LRIaze4 | LRIaze5 | LRIaze22 | LRIaze32 | LRIaze42 | LRIaze52 | LRIaze12
Abstract
The text-mining system we are building deals with the specific problem of identifying the instances of relevant concepts present in the texts. Our system relies therefore on interactions between a field expert and the various linguistic modules we use, often adapted from existing ones, such as Brill's tagger or CMU's Link parser. We have developed learning procedures adapted to various steps of the linguistic treatment, mainly for grammatical tagging, terminology, and concept learning. Our interaction with the expert differs from classical supervised learning, in that the expert is not simply a resource who is only able to provide examples, and unable to provide the formalized knowledge underlying these examples. We are developing specific programming languages which enable the field expert to intervene directly in some of the linguistic tasks. Our approach is thus devoted to help one expert in one field to detect the concepts relevant for his/her field, using a large amount of texts. Our approach is made of two steps. The first one is an automatic approach that finds relevant and novel sentences in the texts. The second one is based on the expert's knowledge and finds more specific relevant sentences. Working on 50 different domains without an expert has been a challenge in itself, and explains our relatively poor results for the first Novelty task.
Bibtex
@inproceedings{DBLP:conf/trec/AmraniAHKR04,
author = {Ahmed Amrani and J{\'{e}}r{\^{o}}me Az{\'{e}} and Thomas Heitz and Yves Kodratoff and Mathieu Roche},
editor = {Ellen M. Voorhees and Lori P. Buckland},
title = {From the Texts to the Contexts They Contain: {A} Chain of Linguistic Treatments},
booktitle = {Proceedings of the Thirteenth Text REtrieval Conference, {TREC} 2004, Gaithersburg, Maryland, USA, November 16-19, 2004},
series = {{NIST} Special Publication},
volume = {500-261},
publisher = {National Institute of Standards and Technology {(NIST)}},
year = {2004},
url = {http://trec.nist.gov/pubs/trec13/papers/uparis.novelty2.pdf},
timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
biburl = {https://dblp.org/rec/conf/trec/AmraniAHKR04.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}