Proceedings - Cross-Language 1998¶

Cross-Language Information Retrieval (CLIR) Track Overview¶

Martin Braschler, Jürgen Krause, Carol Peters, Peter Schäuble

Paper: 10.6028/NIST.SP.500-242.xlingual-overview

Abstract

This year, the TREC cross-language retrieval track took place for the second time. In TREC-7, we extended the task presented to the participants. The goal was for groups to use queries written in a single language in order to retrieve documents from a multilingual pool of documents written in many different languages. This is also a more comprehensive task than the usual definition of cross-language information retrieval, where systems work with a language pair, retrieving documents in a language Ll using queries in language L2. The document languages used this year were English, German, French, and, newly introduced for TREC-7, Italian. The queries were available in all of these languages. Because it seemed unlikely that all interested parties can work with all four languages, it was agreed that there would be a secondary evaluation involving a smaller task. Consequently, groups were allowed to send in runs using the English queries to retrieve documents from a subset of the pool containing just the English and French documents. Coordination of the track took place at ETH in Zurich, as for last year. The continued interest in the cross-language track showed the importance of this emerging area. There are many applications where information should be accessible to users regardless of its language. With the ever growing amount of information available to us all, situations when a user of an information retrieval system is faced with the task of querying a multilingual document collection are becoming increasingly common. Such collections can be made up of documents from multinational companies, from multilingual countries or from large international organizations such as the United Nations or the European Commission. Of course, the world wide web is also an example for such a document collection. A lot of users of such multilingual data sources have some foreign language knowledge, but their proficiency may not be good enough to formulate queries to appropriately express their information need. Such users will beneft greatly if they can enter queries in their native language, because they will be able to inspect the documents even if they are untranslated. Monolingual users, on the other hand, can use translation aids, manual or automatic, to help them access the search results.

Bibtex

@inproceedings{DBLP:conf/trec/BraschlerKPS98,
    author = {Martin Braschler and J{\"{u}}rgen Krause and Carol Peters and Peter Sch{\"{a}}uble},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Cross-Language Information Retrieval {(CLIR)} Track Overview},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {1--8},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/CLIROverview_Trec7L.pdf.gz},
    timestamp = {Tue, 26 Mar 2019 00:00:00 +0100},
    biburl = {https://dblp.org/rec/conf/trec/BraschlerKPS98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-overview}
}

EMIR at the CLIR track of TREC7¶

Frédérique Bisson, Jérôme Charron, Christian Fluhr, Dominique Schmit

Participant: CEA
Paper: 10.6028/NIST.SP.500-242.xlingual-CEA
Runs: ceat7f1 | ceat7f2 | ceat7e1 | ceat7e2 | ceat7d1 | ceat7d2 | ceat7e1n | ceat7e2n

Abstract

EMIR (European Multilingual Information retrieval) was a European ESPRIT project whose aim was to demonstrate the feasibility of a crosslingual interrogation based on the use of bilingual dictionaries. The project lasted from November 90 to April 94. A part of the results are included into a commercial product 'SPIRIT' released by the T.GID Company in France.

Bibtex

@inproceedings{DBLP:conf/trec/BissonCFS98,
    author = {Fr{\'{e}}d{\'{e}}rique Bisson and J{\'{e}}r{\^{o}}me Charron and Christian Fluhr and Dominique Schmit},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{EMIR} at the {CLIR} track of {TREC7}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {281--286},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/CEA.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/BissonCFS98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-CEA}
}

ClickIR: Text Retrieval using a Dynamic Hypertext Interface¶

Richard C. Bodner, Mark H. Chignell

Participant: toronto
Paper: 10.6028/NIST.SP.500-242.xlingual-toronto

Abstract

In this report we describe our model of dynamic hypertext and how the ClickIR system uses this model to assist users in interactive search. The system was used in both the ad hoc task and the interactive track. In the context of the ad hoc task we were interested in the effects relevance feedback would have on our system. Comparison of ClickIR performance with and without relevance feedback showed that relevance feedback was critical in boosting the performance of the system from below median performance to the upper rank of TREC-7 systems. In the interactive track we compared the ClickIR (experimental) system where the tasks of querying and browsing were integrated, with a system which closely approximated a Web search engine, where the task of querying is separated from the task of browsing a list of hits. A trade-off between recall and precision was observed, with ClickIR leading to significantly greater recall, but at the expense of significantly lower precision and longer time taken to perform the task.

Bibtex

@inproceedings{DBLP:conf/trec/BodnerC98,
    author = {Richard C. Bodner and Mark H. Chignell},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {ClickIR: Text Retrieval using a Dynamic Hypertext Interface},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {506--515},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/uoftimg_trec7_report2.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/BodnerC98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-toronto}
}

TREC-7 Evaluation of Conceptual Interlingua Document Retrieval (CINDOR) in English and French¶

Anne Diekema, Farhad Oroumchian, Paraic Sheridan, Elizabeth D. Liddy

Participant: TextWise
Paper: 10.6028/NIST.SP.500-242.xlingual-TextWise
Runs: TW1E2EF | TW2F2EF | TW3E2F | TW4F2E

Abstract

TextWise LLC. participated in the TREC-7 Cross-Language Retrieval track using the CINDOR system, which utilizes a 'conceptual interlingua' representation of documents and queries. The current CINDOR research system uses a conceptual interlingua constructed around the Princeton WordNet, which we are mapping into French and Spanish. The use of an interlingual representation of documents and queries allows us to perform retrieval on any combination of supported languages, rather than having to rely on pairwise translations, while the use of a resource like WordNet allows us to match equivalent terms (including synonyms) across languages. Although the analysis of our TREC-7 results is clouded somewhat by the kinds of system errors which inevitably occur in a first-time evaluation over large TREC corpora, our evaluation of the conceptual interlingua approach suggests that it provides highly effective cross-language retrieval performance. In particular, we notice that the CINDOR system achieves cross-language retrieval results equivalent in many cases to corresponding monolingual queries, without the loss in retrieval precision observed in many other approaches to cross-language retrieval. Future work on the CINDOR system, which was evaluated here in its research prototype form, will focus on improving further the coverage of our conceptual interlingua resources and the efficiency of our document processing modules. We are also investigating the construction of an interlingual resource of proper nouns, using technology from other TextWise products, since proper nouns constitute the largest category of 'out-of-vocabulary' terms with respect to our current conceptual interlingua knowledge base. We will also continue to adapt the CINDOR system to handle more languages.

Bibtex

@inproceedings{DBLP:conf/trec/DiekemaOSL98,
    author = {Anne Diekema and Farhad Oroumchian and Paraic Sheridan and Elizabeth D. Liddy},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TREC-7} Evaluation of Conceptual Interlingua Document Retrieval {(CINDOR)} in English and French},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {116--127},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/textwis1.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/DiekemaOSL98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-TextWise}
}

TREC-7 Experiments at the University of Maryland¶

Douglas W. Oard

Participant: UMD
Paper: 10.6028/NIST.SP.500-242.xlingual-UMD
Runs: umdxeof | umdxeot

Abstract

The University of Maryland participated in three TREC-7 tasks: ad hoc retrieval, cross-language retrieval, and spoken document retrieval. The principal focus of the work was evaluation of merging techniques for cross-language text retrieval from mixed language collections. The results show that biasing the merging strategy in favor of documents in the query language can be helpful. Ad hoc and spoken document retrieval results are also presented.

Bibtex

@inproceedings{DBLP:conf/trec/Oard98,
    author = {Douglas W. Oard},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TREC-7} Experiments at the University of Maryland},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {477--481},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/umdfinal.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Oard98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-UMD}
}

TREC-7 CLIR using a Probabilistic Translation Model¶

Jian-Yun Nie

Participant: montreal
Paper: 10.6028/NIST.SP.500-242.xlingual-montreal
Runs: RaliDicAPf2e | RaliAPf2e | RaliSDAe2f | RaliDicSDAef | RaliDicE2EF | RaliDicF2EF

Abstract

In this report, we describe the approach we used in TREC-7 Cross-Language IR (CLIR) track. The approach is based on a probabilistic translation model estimated from a parallel training corpus (Canadian HANSARD). The problem of translating a query from a language to another (between French and English) becomes the problem of determining the most probable words that may appear in the translation of the query. In this paper, we will describe the principle of building the probabilistic model, and the runs we submitted using the model as a translation tool.

Bibtex

@inproceedings{DBLP:conf/trec/Nie98,
    author = {Jian{-}Yun Nie},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {{TREC-7} {CLIR} using a Probabilistic Translation Model},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {482--488},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/Trec-nie.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Nie98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-montreal}
}

Ad hoc and Multilingual Information Retrieval at IBM¶

Martin Franz, J. Scott McCarley, Salim Roukos

Participant: ibm-franz
Paper: 10.6028/NIST.SP.500-242.xlingual-ibm-franz
Runs: ibmcl7al | ibmcl7cl | ibmcl7as | ibmcl7cs | ibmcl7ef

Abstract

IBM participated in two tracks at TREC-7: ad hoc and cross-language. the adhoc task we contrasted the performance of two different query expansion techniques: local context analysis and probabilistic model. Two themes characterize IBM's participation in the CLIR track at TREC-7. The first is the use of statistical methods. In order to use the document translation approach, we built a fast (translation time within an order of magnitude of the indexing time) French→English translation model trained from parallel corpora. We also trained German→French and Italian=French translation models entirely from comparable corpora. The unique characteristic of the work described here is that all bilingual resources and translation models were learned automatically from corpora (parallel and comparable.) The other theme is that the widely varying quality and availability of bilingual resources means that language pairs must be treated separately. We will describe methods for using one language as a pivot language in order to decrease the number pairs, as well as methods for merging the results from several retrievals.

Bibtex

@inproceedings{DBLP:conf/trec/FranzMR98,
    author = {Martin Franz and J. Scott McCarley and Salim Roukos},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Ad hoc and Multilingual Information Retrieval at {IBM}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {104--115},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/ibmy_t7_2.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/FranzMR98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-ibm-franz}
}

Twenty-One at TREC7: Ad-hoc and Cross-Language Track¶

Djoerd Hiemstra, Wessel Kraaij

Participant: TwentyOne
Paper: 10.6028/NIST.SP.500-242.xlingual-TwentyOne
Runs: tno7mx | tno7edp | tno7ddp | tno7egr | tno7edpx | tno7eef

Abstract

This paper describes the official runs of the Twenty-One group for TREC-7. The Twenty-One group participated in the ad-hoc and the cross-language track and made the following accom-plishments: We developed a new weighting algorithm, which outperforms the popular Cornell version of BM25 on the ad-hoc collection. For the CLIR task we developed a fuzzy matching algorithm to recover from missing translations and spelling variants of proper names. Also for CLIR we investigated translation strategies that make extensive use of information from our dictionaries by identifying preferred translations, main translations and synonym translations, by defining weights of possible translations and by experimenting with probabilistic boolean matching strategies.

Bibtex

@inproceedings{DBLP:conf/trec/HiemstraK98,
    author = {Djoerd Hiemstra and Wessel Kraaij},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Twenty-One at {TREC7:} Ad-hoc and Cross-Language Track},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {174--185},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/twentyone.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/HiemstraK98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.xlingual-TwentyOne}
}

Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire II at TREC-7¶

Fredric C. Gey, Hailing Jiang, Aitao Chen, Ray R. Larson

Participant: Berkeley
Paper: 10.6028/NIST.SP.500-242.interactive-Berkeley
Runs: BKYCL7ME | BKYCL7AG | BKYCL7AI | BKYCL7AF | BKYCL7AEF | BKYCL7MEF

Abstract

For TREC-7, the Berkeley ad-hoc experiments explored more phrase discovery in topics and documents. We utilized Boolean retrieval combined with probabilistic ranking for 17 topics in ad-hoc manual entry. Our cross-language experiments tested 3 different widely available machine translation software packages. For language pairs (e.g. German to French) for which no direct machine translation was available we made use of English as a universal intermediate language. For CLIR we also manually reformulated the English topics before doing machine translation, and this elicited a significant performance increase for both quad language retrieval and for English against English and French documents. In our Interactive Track entry eight searchers conducted eight searches each, half on the Cheshire II system and the other half on the Zprise system, for a total of 64 searches. Questionnaires were administered to gather information about basic demographic and searching experience, about each search, about each of the systems, and finally, about the user's perceptions of the systems.

Bibtex

@inproceedings{DBLP:conf/trec/GeyJCL98,
    author = {Fredric C. Gey and Hailing Jiang and Aitao Chen and Ray R. Larson},
    editor = {Ellen M. Voorhees and Donna K. Harman},
    title = {Manual Queries and Machine Translation in Cross-Language Retrieval and Interactive Retrieval with Cheshire {II} at {TREC-7}},
    booktitle = {Proceedings of The Seventh Text REtrieval Conference, {TREC} 1998, Gaithersburg, Maryland, USA, November 9-11, 1998},
    series = {{NIST} Special Publication},
    volume = {500-242},
    pages = {463--476},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {1998},
    url = {https://trec.nist.gov/pubs/trec7/papers/berkeley.trec7.pdf.gz},
    timestamp = {Tue, 07 Apr 2015 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/GeyJCL98.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org},
    doi = {10.6028/NIST.SP.500-242.interactive-Berkeley}
}