Skip to content

Proceedings - Spanish 1995

A TREC Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval

Mark W. Davis, Ted Dunning

Abstract

In a Multi-lingual Text Retrieval (MLTR) system, queries in one language are used to retrieve documents in several languages. Although all of the collection documents could be translated to a single language, a more efficient approach is to simply translate the queries into each of the document lan-guages. We have investigated five methods for query translation that rely on lexical-transfer and corpus-based methods for creating multi-lingual queries. The resulting queries produced by these systems were then used in a competitive information-retrieval environment and the results evaluated by the TREC evaluation group.

Bibtex
@inproceedings{DBLP:conf/trec/DavisD95,
    author = {Mark W. Davis and Ted Dunning},
    editor = {Donna K. Harman},
    title = {A {TREC} Evaluation of Query Translation Methods For Multi-Lingual Text Retrieval},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/nmsu.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/DavisD95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Recent Experiments with INQUERY

James Allan, Lisa Ballesteros, James P. Callan, W. Bruce Croft, Zhihong Lu

Abstract

Past TREC experiments by the University of Massachusetts have focused primarily on ad-hoc query creation. Substantial effort was directed towards automatically translating TREC topics into queries, using a set of simple heuristics and query expansion. Less emphasis was placed on the routing task, although results were generally good. The Spanish experiments in TREC-3 concentrated on simple indexing, sophisticated stemming, and simple methods of creating queries. The TREC-4 experiments were a departure from the past. The ad-hoc experiments involved 'fine tuning' existing approaches, and modifications to the INQUERY term weighting algorithm. However, much of the research focus in TREC-4 was on the routing, Spanish, and collection merging experiments. These tracks more closely match our broader research interests in document routing, document filtering, distributed IR, and multilingual retrieval. The University of Massachusetts' experiments were conducted with version 3.0 of the INQUERY information retrieval system. INQUERY is based on the Bayesian inference network retrieval model. It is described elsewhere [7, 5, 12, 11], so this paper focuses on relevant differences to the previously published algorithms.

Bibtex
@inproceedings{DBLP:conf/trec/AllanBCCL95,
    author = {James Allan and Lisa Ballesteros and James P. Callan and W. Bruce Croft and Zhihong Lu},
    editor = {Donna K. Harman},
    title = {Recent Experiments with {INQUERY}},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/umass.ps.gz},
    timestamp = {Wed, 07 Jul 2021 16:44:22 +0200},
    biburl = {https://dblp.org/rec/conf/trec/AllanBCCL95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

New Retrieval Approaches Using SMART: TREC 4

Chris Buckley, Amit Singhal, Mandar Mitra

Abstract

The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 4, performing runs in the routing, ad-hoc, confused text, interactive, and foreign language environments.

Bibtex
@inproceedings{DBLP:conf/trec/BuckleySM95,
    author = {Chris Buckley and Amit Singhal and Mandar Mitra},
    editor = {Donna K. Harman},
    title = {New Retrieval Approaches Using {SMART:} {TREC} 4},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/Cornell\_trec4.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BuckleySM95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Similarity Measures for Short Queries

Ross Wilkinson, Justin Zobel, Ron Sacks-Davis

Abstract

Ad-hoc queries are usually short, of perhaps two to ten terms. However, in previous rounds of TREC we have concentrated on obtaining optimal performance for the long TREC topics. In this paper we investigate the behaviour of similarity measures on short queries, and show experimentally that two successful measures which give similar, good performance on long TREC topics do not work well for short queries. We explore methods for achieving greater effectiveness for short queries, and conclude that a successful approach is to combine these similarity measures with other evidence. We also briefly describe our experiments with the Spanish data.

Bibtex
@inproceedings{DBLP:conf/trec/WilkinsonZS95,
    author = {Ross Wilkinson and Justin Zobel and Ron Sacks{-}Davis},
    editor = {Donna K. Harman},
    title = {Similarity Measures for Short Queries},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/citri.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/WilkinsonZS95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

TREC-4 Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and POS Tagging of Spanish

Alan F. Smeaton, Fergus Kelledy, Ruairi O'Donnell

Abstract

In this paper we describe work done as part of the TREC-4 benchmarking exercise by a team from Dublin City University. In TREC-4 we had 3 activities as follows: In work on improving the efficiency of standard SMART-like query processing we have applied various thresholding processes to the postings list of an inverted file and we have limited the number of document score accumulators available during query processing. The first run we submitted for evaluation in TREC-4 (DCU951) used our best set of thresholding and accumulator set parameters. The second run we submitted is based upon a query expansion using terms from WordNet. Essentially, for each original query term we determine its level of specificity or abstraction; for broad terms we add more specific terms, for specific original terms we add broader ones; for ones in-between we add both broader and narrower terms. When the query is expanded we then delete all the original query terms in order to add to the judged pool, documents that our expansion would find that would not have been found by other retrieval. This is run DCU952. The third run we submitted was for Spanish data. We ran the entire document corpus through a POS tagger and indexed documents (and queries) by a combination of base form of non-stopwords plus their POS class. Retrieval is performed using SMART with extra weights for query and document terms depending on their POS class. The performance figures we obtained in terms of precision and recall are given at the end of the paper.

Bibtex
@inproceedings{DBLP:conf/trec/SmeatonKO95,
    author = {Alan F. Smeaton and Fergus Kelledy and Ruairi O'Donnell},
    editor = {Donna K. Harman},
    title = {{TREC-4} Experiments at Dublin City University: Thresholding Posting Lists, Query Expansion with WordNet and {POS} Tagging of Spanish},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/dublin.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/SmeatonKO95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Xerox Site Report: Four TREC-4 Tracks

Marti A. Hearst, Jan O. Pedersen, Peter Pirolli, Hinrich Schütze, Gregory Grefenstette, David A. Hull

Abstract

The Xerox research centers participated in four TREC-4 activities: the routing task, the filtering track, the Spanish track, and the interactive track. We addressed the core routing task as a problem in statistical classifica-tion: given a training set of judged documents, build an error-minimizing statistical classifier to assess the relevance of new test documents. This year, we built on the methodology developed in 21] by adding a combination strategy that pooled evidence across a number of separately trained classification schemes. Since many of our classifiers infer probability of relevance, adapting our routing methods to the filtering track consisted of obtaining probability estimates for the remaining classifiers and reporting those documents scoring above the probability thresholds determined by the three set linear utility func-tions. Our contribution to the Spanish track focussed on the effect of principled language analysis on a baseline retrieval system. We employed finite-state morphology [14] and hidden-Markov-model-based part-of-speech tagging 17] to analyze Spanish language text into canonical stemmed forms, and to identify verbs and noun phrases. Various combinations of these were then fed into SMART [1] for ranked retrieval. This year our activity on the ad hoc task focussed on the interactive track, which allows arbitrary user interaction in the process of finding relevant documents. We developed a graphical user interface to two interactive tools, Scatter/Gather [6] and Tilebars [11], and asked a number of subjects to use this tool to 'find as many good documents as you can for a topic, in around 30 minutes, without collecting too much rubbish'. We set up an experimental design to measure the value of each tool, and their combination, averaging out subject effects. That is, we were interested in determining how well the average user might perform with interactive tools rather than measuring the very best performance possible assuming an expert searcher. These efforts are described in more detail in the following sections.

Bibtex
@inproceedings{DBLP:conf/trec/HearstPPSGH95,
    author = {Marti A. Hearst and Jan O. Pedersen and Peter Pirolli and Hinrich Sch{\"{u}}tze and Gregory Grefenstette and David A. Hull},
    editor = {Donna K. Harman},
    title = {Xerox Site Report: Four {TREC-4} Tracks},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/xerox.ps.gz},
    timestamp = {Thu, 08 Oct 2020 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/HearstPPSGH95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Logistic Regression at TREC4: Probabilistic Retrieval from Full Text Document Collections

Fredric C. Gey, Aitao Chen, Jianzhang He, Jason Meggs

Abstract

The Berkeley experiments for TREC4 extend those of TREC3 in three ways: for ad-hoc retrieval we retain the manual reformulations of the topics and experiment with limited query expansion based upon the assumption that top documents are relevant (this experiment was an interesting failure); for routing retrieval we introduce a logistic regression which assumes relevance weights to be only one clue among several in predicting probability of relevance. Finally, for Spanish retrieval we retrain the basic logistic regression equations to apply to the statistical distributions of Spanish words. In addition we apply two approaches to Spanish stemming, one which attempts to resolve verb variants into a standardized form, the other of which eschews stemming in favor of a massive stop word list of variants of common words.

Bibtex
@inproceedings{DBLP:conf/trec/GeyCHM95,
    author = {Fredric C. Gey and Aitao Chen and Jianzhang He and Jason Meggs},
    editor = {Donna K. Harman},
    title = {Logistic Regression at {TREC4:} Probabilistic Retrieval from Full Text Document Collections},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/berkeley.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GeyCHM95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Multi-lingual Text Filtering Using Semantic Modeling

James R. Driscoll, S. Abbott, K. Hu, M. Miller, G. Theis

Abstract

Semantic Modeling is used to investigate multilingual text filtering. In our approach, the Entity-Relationship (ER) Model is used as a basis for descriptions of information preferences (profiles) in the information filtering process. A profile is viewed as having both a static aspect and a dynamic aspect. The static aspect of a profile can be represented as an ER schema; and the dynamic aspect of the profile can be represented by synonyms of schema components and domain values for schema attributes. For TREC-4, the routing task and the Spanish adhoc task are accomplished using this technique. For the routing task, a large amount of time was spent in an effort to optimize filter performance using the training data that was available for the routing topics. For the Spanish adhoc task, a large amount of time was spent using external sources to develop good filters; in addition, some time was spent implementing a program to help port our approach to this second language. A multi-lingual (English, French, German, and Spanish) experiment is also reported.

Bibtex
@inproceedings{DBLP:conf/trec/DriscollAHMT95,
    author = {James R. Driscoll and S. Abbott and K. Hu and M. Miller and G. Theis},
    editor = {Donna K. Harman},
    title = {Multi-lingual Text Filtering Using Semantic Modeling},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {https://trec.nist.gov/pubs/trec4/papers/ucentral-florida.pdf},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/DriscollAHMT95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Improving Accuracy and Run-Time Performance for TREC-4

David A. Grossman, David O. Holmes, Ophir Frieder, Matthew D. Nguyen, Christopher E. Kingsbury

Abstract

For TREC-4, we enhanced our existing prototype that implements relevance ranking using the AT&T DBC-1012 Model 4 parallel database machine to support the entire document collec-tion. Additionally, we developed a special purpose IR prototype to test a new index compression algorithm and to provide performance comparisons to the relational approach. We submitted official results for both automatic and manual adhoc queries for the entire 2GB English collection and the provided Spanish collection. Additionally, we submitted results using n-grams to process the corrupted data. In addition to implementing the vector-space model, we experimented with query reduction based on term frequency. Query reduction was shown to result in dramatically improved run-time performance and, in many cases, resulted in little or no degradation of precision/ recall.

Bibtex
@inproceedings{DBLP:conf/trec/GrossmanHFNK95,
    author = {David A. Grossman and David O. Holmes and Ophir Frieder and Matthew D. Nguyen and Christopher E. Kingsbury},
    editor = {Donna K. Harman},
    title = {Improving Accuracy and Run-Time Performance for {TREC-4}},
    booktitle = {Proceedings of The Fourth Text REtrieval Conference, {TREC} 1995, Gaithersburg, Maryland, USA, November 1-3, 1995},
    series = {{NIST} Special Publication},
    volume = {500-236},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1995},
    url = {http://trec.nist.gov/pubs/trec4/papers/gmu.ps.gz},
    timestamp = {Thu, 12 Mar 2020 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/GrossmanHFNK95.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}