Proceedings - NeuCLIR 2024¶

IRLab-AMS at TREC’24 NeuCLIR Track¶

Andrew Yates, Jia-Huei Ju

Abstract

In this notebook paper, we describe our participation as IRLab-AMS in the NeuCLIR. Our submitted results for two tasks, multi-lingual information retrieval (MLIR) and cross-language report generation (ReportGen). For MLIR, we explore the learned sparse representations with multi-lingual retrieval settings. For ReportGen, we experiment with several pipelines for generating long-form reports, including standard retrieval-augmented generation (RAG) and post-hoc citation methods. Additionally, we add an extra retrieval augmentation module to handle the limitation of ad-hoc retriever. The module can serve as distinct purposes, including relevance ranking, novelty ranking, and summarization, or by combining them.

Bibtex

@inproceedings{IRLab-Amsterdam-trec2024-papers-proc-1,
    title = {IRLab-AMS at TREC’24 NeuCLIR Track},
    author = {Andrew Yates and Jia-Huei Ju},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

Overview of the TREC 2024 NeuCLIR Track¶

Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang

Abstract

The principal goal of the TREC Neural Cross-Language Information Retrieval (NeuCLIR) track is to study the effect of neural approaches on cross-language information access. The track has created test collections containing Chinese, Persian, and Russian news stories and Chinese academic abstracts. NeuCLIR includes four task types: Cross-Language Information Retrieval (CLIR) from news, Multilingual Information Retrieval (MLIR) from news, Report Generation from news, and CLIR from technical documents. A total of 274 runs were submitted by five participating teams (and as baselines by the track coordinators) for eight tasks across these four task types. Task descriptions and the available results are presented.

Bibtex

@inproceedings{hltcoe-trec2024-papers-proc-2,
    title = {Overview of the TREC 2024 NeuCLIR Track},
    author = {Dawn Lawrie and Sean MacAvaney and James Mayfield and Paul McNamee and Douglas W. Oard and Luca Soldaini and Eugene Yang},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

HLTCOE at TREC 2024 NeuCLIR Track¶

Eugene Yang, Dawn Lawrie, Orion Weller, James Mayfield

Abstract

The HLTCOE team applied PLAID, an mT5 reranker, GPT-4 reranker, score fusion, and document translation to the TREC 2024 NeuCLIR track. For PLAID we included a variety of models and training techniques – Translate Distill (TD), Generate Distill (GD) and multi-lingual translate-distill (MTD). TD uses scores from the mT5 model over English MS MARCO query-document pairs to learn how to score query-document pairs where the documents are translated to match the CLIR setting. GD follows TD but uses passages from the collection and queries generated by an LLM for training examples. MTD uses MS MARCO translated into multiple languages, allowing experiments on how to batch the data during training. Finally, for report generation we experimented with system combination over different runs. One family of systems used either GPT-4o or Claude-3.5-Sonnet to summarize the retrieved results from a series of decomposed sub-questions. Another system took the output from those two models and verified/combined them with Claude-3.5-Sonnet. The other family used GPT4o and GPT3.5Turbo to extract and group relevant facts from the retrieved documents based on the decomposed queries. The resulting submissions directly concatenate the grouped facts to form the report and their documents of origin as the citations. The team submitted runs to all NeuCLIR tasks: CLIR and MLIR news tasks as well as the technical documents task and the report generation task.

Bibtex

@inproceedings{hltcoe-trec2024-papers-proc-1,
    title = {HLTCOE at TREC 2024 NeuCLIR Track},
    author = {Eugene Yang and Dawn Lawrie and Orion Weller and James Mayfield},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}