Skip to content

Proceedings - NeuCLIR 2022

Overview of the TREC 2022 NeuCLIR Track

Dawn J. Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

Abstract

This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval. The main task in this year's track was ad hoc ranked retrieval of Chinese, Persian, or Russian newswire documents using queries expressed in English. Topics were developed using standard TREC processes, except that topics developed by an annotator for one language were assessed by a different annotator when evaluating that topic on a different language. There were 172 total runs submitted by twelve teams.

Bibtex
@inproceedings{DBLP:conf/trec/LawrieMMMOSY22,
    author = {Dawn J. Lawrie and Sean MacAvaney and James Mayfield and Paul McNamee and Douglas W. Oard and Luca Soldaini and Eugene Yang},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Overview of the {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/Overview\_neuclir.pdf},
    timestamp = {Wed, 30 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LawrieMMMOSY22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

CFDA & CLIP at TREC 2022 NeuCLIR Track

Jia-Huei Ju, Wei-Chih Chen, Heng-Ta Chang, Cheng-Wei Lin, Ming-Feng Tsai, Chuan-Ju Wang

Abstract

In this notebook paper, we report our methods and submitted results for the NeuCLIR track in TREC 2022. We adopt the common multi-stage pipeline for the cross-language information retrieval task (CLIR). The pipeline includes machine translation, sparse passage retrieval, and cross-language passage re-ranking. Particularly, we fine-tune cross-language passage re-rankers with different settings of query formulation. In the empirical evaluation on the HC4 dataset, our passage re-rankers achieved better passage re-ranking effectiveness compared to the baseline multilingual re-rankers. The evaluation results of our submitted runs in NeuCLIR are also reported.

Bibtex
@inproceedings{DBLP:conf/trec/JuCCLTW22,
    author = {Jia{-}Huei Ju and Wei{-}Chih Chen and Heng{-}Ta Chang and Cheng{-}Wei Lin and Ming{-}Feng Tsai and Chuan{-}Ju Wang},
    editor = {Ian Soboroff and Angela Ellis},
    title = {{CFDA} {\&} {CLIP} at {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/CFDA\_CLIP.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/JuCCLTW22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

HNUST @ TREC 2022 NeuCLIR Track

Ge Zhang, Qiwen Ye, Mengmeng Wang, Dong Zhou

Abstract

With the rapid development of deep learning, neural-based cross-language information retrieval (CLIR) has attracted extensive attention from researchers. To explore the effectiveness of neural-based CLIR, large-scale efforts, and new platforms are in need. To that end, the TREC 2022 NeuCLIR track presents a cross-language information retrieval challenge. This paper describes our first participation in the TREC 2022 NeuCLIR track. We explored two approaches for CLIR: (1) the lexical-based CLIR method and (2) the neural-based CLIR method, where the lexical-based method consists of two steps of translation and retrieval, and the neural-based method introduces the DISTILDistilmBERT model, an end-to-end neural network. In our preliminary results, the lexical-based CLIR method performs better than the neural-based method.

Bibtex
@inproceedings{DBLP:conf/trec/ZhangYWZ22,
    author = {Ge Zhang and Qiwen Ye and Mengmeng Wang and Dong Zhou},
    editor = {Ian Soboroff and Angela Ellis},
    title = {{HNUST} @ {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/F4.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ZhangYWZ22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Extremely Fast Fine-Tuning for Cross Language Information Retrieval via Generalized Canonical Correlation

John M. Conroy, Neil P. Molino, Julia S. Yang

Abstract

Recently, work using language agnostic transformer neural sentence embeddings show promise for a robust multilingual sentence representation. Our submission to TREC was to test how well these embeddings could be fine-tuned cheaply to perform the task of cross-lingual information retrieval. We explore the use of the MS MARCO dataset with machine translations as a model problem. We demonstrate that a single generalized canonical correlation analysis (GCCA) model trained on previous queries significantly improves the ability of sentence embeddings to find relevant passages. The dominant computational cost for training is computing dense singular value decompositions (SVDs) of matrices derived from the fine-tuning training data. (The number of SVDs used is the number of languages retrieval views and query views plus 1). This approach illustrates that GCCA methods can be used as a rapid training alternative to fine-tuning a neural net, allowing models to be fine-tuned frequently based on a user's previous queries. This model was then used to prepare submissions for the re-ranking NeuCLIR task.

Bibtex
@inproceedings{DBLP:conf/trec/ConroyMY22,
    author = {John M. Conroy and Neil P. Molino and Julia S. Yang},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Extremely Fast Fine-Tuning for Cross Language Information Retrieval via Generalized Canonical Correlation},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/IDACCS.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/ConroyMY22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

KASYS at the TREC 2022 NeuCLIR Track

Kenya Abe

Abstract

This paper describes the KASYS team's participation in the TREC 2022 NeuCLIR track. Our approach is One-for-All, which employs a single multilingual pre-trained language model to retrieve documents of any languages in response to an English query. The basic architecture is the same as ColBERT and its application to CLIR, ColBERT-X, but only a single model was trained with the mixture of MS MARCO and its translated version, neuMARCO, in our approach. Through the run submission, we evaluated two variants of the One-for-All approach, namely, the end-to-end and reranking approaches. As the first-stage retriever, the former uses approximated nearest neighbor search proposed in ColBERT, while the latter uses the track organizers' (top 1,000 documents in the baseline run were used as the results of the first-stage retrieval). To evaluate our runs, we used the results provided by the track organizers as a baseline (document translation). The official evaluation results showed that the reranking approaches outperforms the baseline in all the languages. On the other hand, the end-to-end approach achieved higher scores than the baseline only in Russian. In addition to the submissions to the TREC 2022 NeuCLIR track, we also conducted experiments with the development data called HC4. The results in HC4 also showed a similar trend: the reranking approach was superior to the end-to-end approach in Persian and Russian. We also found the discrepancy that even in the same language, the performance of our approaches varies depending on the datasets.

Bibtex
@inproceedings{DBLP:conf/trec/Abe22,
    author = {Kenya Abe},
    editor = {Ian Soboroff and Angela Ellis},
    title = {{KASYS} at the {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/KASYS.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/Abe22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Carlos Lassance, Stéphane Clinchant

Abstract

This paper describes our participation in the 2022 TREC NeuCLIR challenge. We submitted runs to two out of the three languages (Farsi and Russian), with a focus on first-stage rankers and comparing mono-lingual strategies to Adhoc ones. For monolingual runs, we start from pretraining models on the target language using MLM+FLOPS and then finetuning using the MSMARCO translated to the language either with ColBERT or SPLADE as the retrieval model. While for the Adhoc task we test both query translation (to the target language) and back-translation of the documents (to english). Initial result analysis shows that the monolingual strategy is strong, but that for the moment Adhoc achieved the best results, with back-translating documents being better than translating queries.

Bibtex
@inproceedings{DBLP:conf/trec/LassanceC22,
    author = {Carlos Lassance and St{\'{e}}phane Clinchant},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Naver Labs Europe {(SPLADE)} @ {TREC} NeuCLIR 2022},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/NLE.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LassanceC22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

NeuralMind-UNICAMP at 2022 TREC NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval

Vitor Jeronymo, Roberto de Alencar Lotufo, Rodrigo Frassetto Nogueira

Abstract

This paper reports on a study of cross-lingual information retrieval (CLIR) using the mT5-XXL reranker on the NeuCLIR track of TREC 2022. Perhaps the biggest contribution of this study is the finding that despite the mT5 model being fine-tuned only on query-document pairs of the same language it proved to be viable for CLIR tasks, where query-document pairs are in different languages, even in the presence of suboptimal first-stage retrieval performance. The results of the study show outstanding performance across all tasks and languages, leading to a high number of winning positions. Finally, this study provides valuable insights into the use of mT5 in CLIR tasks and highlights its potential as a viable solution.

Bibtex
@inproceedings{DBLP:conf/trec/JeronymoLN22,
    author = {Vitor Jeronymo and Roberto de Alencar Lotufo and Rodrigo Frassetto Nogueira},
    editor = {Ian Soboroff and Angela Ellis},
    title = {NeuralMind-UNICAMP at 2022 {TREC} NeuCLIR: Large Boring Rerankers for Cross-lingual Retrieval},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/NM.unicamp.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/JeronymoLN22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Frassetto Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur, Jheng-Hong Yang, Xinyu Zhang

Abstract

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led to a confusing panoply of methods and reproducibility has lagged behind the state of the art. In this context, our work makes two important contributions: First, we provide a conceptual framework for organizing different approaches to cross-lingual retrieval using multi-stage architectures for mono-lingual retrieval as a scaffold. Second, we implement simple yet effective reproducible baselines in the Anserini and Pyserini IR toolkits for test collections from the TREC 2022 NeuCLIR Track, in Persian, Russian, and Chinese. Our efforts are built on a collaboration of the two teams that submitted the most effective runs to the TREC evaluation. These contributions provide a firm foundation for future advances.

Bibtex
@inproceedings{DBLP:conf/trec/LinAJKLNORTYZ22,
    author = {Jimmy Lin and David Alfonso{-}Hermelo and Vitor Jeronymo and Ehsan Kamalloo and Carlos Lassance and Rodrigo Frassetto Nogueira and Odunayo Ogundepo and Mehdi Rezagholizadeh and Nandan Thakur and Jheng{-}Hong Yang and Xinyu Zhang},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/h2oloo.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/LinAJKLNORTYZ22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

HLTCOE at TREC 2022 NeuCLIR Track

Eugene Yang, Dawn J. Lawrie, James Mayfield

Abstract

The HLTCOE team applied ColBERT-X to the TREC 2022 NeuCLIR track with two training techniques - translate-train (TT) and multilingual translate-train (MTT). TT trains ColBERT-X with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection. This results in three cross-language models for the track, one per language. MTT creates a single model for all three document languages by combining the translations of MS-MARCO passages in all three languages into mixed language batches. Thus the model learns about matching queries to passages simultaneously in all languages. While TT is more effective than MTT in each individual language due to its specificity, MTT still outperforms a strong baseline of BM25 with document translation. On average, MTT and TT perform 34% and 48% higher than the median in MAP with title queries, respectively.

Bibtex
@inproceedings{DBLP:conf/trec/YangLM22,
    author = {Eugene Yang and Dawn J. Lawrie and James Mayfield},
    editor = {Ian Soboroff and Angela Ellis},
    title = {{HLTCOE} at {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/hltcoe-jhu.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/YangLM22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Huawei Noah's Ark Lab at TREC NeuCLIR 2022

Ehsan Kamalloo, David Alfonso-Hermelo, Mehdi Rezagholizadeh

Abstract

In this paper, we describe our participation in the NeuCLIR track at TREC 2022. Our focus is to build strong ensembles of full-ranking models including dense retrievers, BM25 and learned sparse models.

Bibtex
@inproceedings{DBLP:conf/trec/KamallooAR22,
    author = {Ehsan Kamalloo and David Alfonso{-}Hermelo and Mehdi Rezagholizadeh},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Huawei Noah's Ark Lab at {TREC} NeuCLIR 2022},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/huaweimtl.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/KamallooAR22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Non-Neural Baselines Experiments for CLIR at TREC 2022

Paul McNamee

Abstract

Cross-Language Information Retrieval (CLIR) returned to TREC with the advent of the NeuCLIR track in 2022. The track provided document collections in three languages: Chinese, Farsi, and Russian, and the principal task involved ranking documents in response to English language queries. Our goal in participating in the NeuCLIR track was to provide a statistical baseline for retrieval, for which we used the HAIRCUT retrieval engine. Experiments included use of character n-gram indexing, use of pseudo-relevance feedback, and application of collection enrichment.

Bibtex
@inproceedings{DBLP:conf/trec/McNamee22,
    author = {Paul McNamee},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Non-Neural Baselines Experiments for {CLIR} at {TREC} 2022},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/jhu.mcnamee.N.pdf},
    timestamp = {Tue, 29 Aug 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/McNamee22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}

Probabilistic Structured Queries: The University of Maryland at the TREC 2022 NeuCLIR Track

Suraj Nair, Douglas W. Oard

Abstract

The University of Maryland submitted three baseline runs to the Ad Hoc CLIR Task of the TREC 2022 NeuCLIR track. This paper describes three baseline systems that cross the language barrier using a well-known translation-based CLIR technique, Probabilistic Structured Queries.

Bibtex
@inproceedings{DBLP:conf/trec/NairO22,
    author = {Suraj Nair and Douglas W. Oard},
    editor = {Ian Soboroff and Angela Ellis},
    title = {Probabilistic Structured Queries: The University of Maryland at the {TREC} 2022 NeuCLIR Track},
    booktitle = {Proceedings of the Thirty-First Text REtrieval Conference, {TREC} 2022, online, November 15-19, 2022},
    series = {{NIST} Special Publication},
    volume = {500-338},
    publisher = {National Institute of Standards and Technology {(NIST)}},
    year = {2022},
    url = {https://trec.nist.gov/pubs/trec31/papers/umcp.N.pdf},
    timestamp = {Wed, 06 Sep 2023 01:00:00 +0200},
    biburl = {https://dblp.org/rec/conf/trec/NairO22.bib},
    bibsource = {dblp computer science bibliography, https://dblp.org}
}