Proceedings - RAG TREC Instrument for Multilingual Evaluation (RAGTIME) 2025¶

WueRAG at RAGTIME 2025: Retrieval, Fusion, and Citation for Grounded Report Generation¶

Julia Wunderle, Julian Schubert, Joachim Baumeister, Andreas Hotho

Participant: WueRAG
Paper: https://trec.nist.gov/pubs/trec34/papers/WueRAG.ragtime.pdf
Runs: WueRAG_2025_07_08_20_05_00 | WueRAG_2025_08_22

Abstract

We present WueRAG, a retrieval-augmented generation pipeline for the TREC 2025 RAGTIME English report generation task. Our approach combines query reformulation, remote candidate retrieval, and local per-topic reranking using a hybrid dense and lexical fusion. To ensure citation accuracy, grounding is enforced at two levels: first, through a generation phase that requires bracketed citations for factual claims and second, via a postprocessing filter that removes any remaining unverified sentences. On the official evaluation WueRAG achieved the highest F1 score (0.421) for the english subtask, indicating that combining multi-stage retrieval with explicit grounding constraints can effectively balance attribution accuracy and report quality.

Bibtex

@inproceedings{WueRAG-trec2025-papers-proc-1,
    title = {WueRAG at RAGTIME 2025: Retrieval, Fusion, and Citation for Grounded Report Generation},
    author = {Julia Wunderle and Julian Schubert and Joachim Baumeister and Andreas Hotho},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Question-Driven Multilingual Retrieval and Report Generation for the RAGTIME Track at TREC 2025¶

Suveyda Yeniterzi, Reyyan Yeniterzi

Participant: GenAIus
Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.ragtime.pdf
Runs: dry_all3_eng_1000 | dry_gq_eng_1000 | genaius-llama3-3-70B | genaius-gpt-4o | genaius-gpt-oss-120b | genaius-gpt-oss-20b | genaius-question | genaius-cluster

Abstract

We present GenAIus’s participation in the TREC 2025 RAGTIME track, focusing on multilingual retrieval and multilingual report generation in the news domain. Our approach follows a question-driven framework in which a set of targeted questions is generated for each report request and used to guide both document retrieval and report synthesis. For retrieval, we rely on the organizers’ multilingual search API and introduce a dynamic merging strategy that allocates an equal retrieval quota per generated question and aggregates scores across repeated document occurrences. For report generation, we explore two pipelines: a question-based approach that generates short, cited answers from multiple retrieved documents and synthesizes them into a final report, and a cluster-based approach that extracts nuggets from retrieved documents, clusters them by semantic similarity, and generates reports grounded in these structured clusters. We experiment with both proprietary and open-source LLMs for question generation, including GPT-4o and Llama3.3-70B. Our submissions achieve strong performance across tasks, including first place in MAP in the multilingual retrieval task and top rankings in several report-generation metrics. These results highlight the effectiveness of question-driven retrieval and structured evidence synthesis for multilingual report generation.

Bibtex

@inproceedings{GenAIus-trec2025-papers-proc-6,
    title = {Question-Driven Multilingual Retrieval and Report Generation for the RAGTIME Track at TREC 2025},
    author = {Suveyda Yeniterzi and Reyyan Yeniterzi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen¶

Laura Dietz, Bryan Li, James Mayfield, Dawn Lawrie, Eugene Yang, William Walden

Participant: HLTCOE
Paper: https://trec.nist.gov/pubs/trec34/papers/HLTCOE.dragun.rag.ragtime.pdf
Runs: cru-ansR- | cru-ansR-conf- | cru-ablR- | cru-ansR-LSR- | cru-ansR-PlaidX- | cru-ansR-mostcommon- | cru-ansR-bareconf- | cru-ablR-LSR- | cru-ablR-PlaidX- | cru-ablR-conf-

Abstract

The HLTCOE Evaluation team participated in several tracks focused on Retrieval-Augmented Generation (RAG), including RAG, RAGTIME, DRAGUN, and BioGen. Drawing inspiration from recent work on nugget-based evaluations, we introduce the Crucible system, which scrambles the traditional retrieval → generation → evaluation workflow of a RAG task by automatically curating a set of high-quality question-answer pairs (nuggets) from retrieved documents and then conditioning generation on this set. This not only enables us to study how effectively we can recover the set of gold nuggets for each request but additionally how nugget set quality impacts final performance.

Bibtex

@inproceedings{HLTCOE-trec2025-papers-proc-2,
    title = {HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen},
    author = {Laura Dietz and Bryan Li and James Mayfield and Dawn Lawrie and Eugene Yang and William Walden},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Improving Completeness in Deep Research Agents through Targeted Enrichment¶

Jesse Wonnink

Participant: UvA
Paper: https://trec.nist.gov/pubs/trec34/papers/UvA.ragtime.pdf
Runs: zetaalpha

Abstract

Deep research agents represent a significant advance in AI-assisted information synthesis, capable of conducting comprehensive investigations that traditionally required substantial human effort. However, ensuring completeness in automatically generated research reports remains challenging: existing systems rely on ad-hoc query decomposition through prompt engineering, providing no formal guarantees about coverage or diversity, and evaluation frameworks often assess single dimensions rather than holistic report quality. This thesis addresses these limitations through three primary contributions. First, we propose a multi-dimensional framework that operationalizes completeness as three interdependent aspects: coverage (breadth and relevance of information), grounding (citation accuracy and factual consistency), and presentation quality (clarity and structural coherence). Second, we present HERO (High Enrichment Retrieval Orchestrator), a hierarchical deep research architecture that combines submodular optimization for query diversification with a novel two-stage enrichment mechanism that identifies and addresses information gaps through targeted follow-up investigation. Third, we conduct comprehensive evaluation across both academic (ScholarQABench) and general knowledge (DeepResearchGym) domains, enabling holistic evaluation of Deep Research agents. HERO achieves state-of-the-art performance on both benchmarks, with the highest coverage metrics (Key Point Recall of 67.63 on DeepResearchGym), strongest grounding (Citation F1: 91.57), and superior presentation quality scores. Ablation studies reveal that submodular optimization and hierarchical enrichment each contribute distinct improvements, with synergistic effects when combined. However, important limitations remain: our analysis reveals systematic sycophantic bias where the system adapts argumentative positions to match query framing, demonstrating that architectural improvements alone cannot overcome inherent behavioral patterns in foundation models. This work contributes both a concrete system demonstrating measurable improvements in report completeness and a framework for multi-dimensional evaluation of deep research agents. As these systems evolve toward deployment in high-stakes domains, the architectural principles and evaluation methodology established here provide a foundation for building reliable, comprehensive AI research assistants.

Bibtex

@inproceedings{UvA-trec2025-papers-proc-1,
    title = {Improving Completeness in Deep Research Agents through Targeted Enrichment},
    author = {Jesse Wonnink},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Team IDACCS at TREC 2025: RAG and RAGTIME Tracks¶

John M. Conroy, Mike Green, Neil P. Molino, Yue “Ray” Wang, Julia S. Yang

Abstract

This paper gives an overview of team IDA/CCS’s submissions to the 2025 TREC RAG and RAGTIME tracks. Our approach builds on our 2024 RAG (team LAS) and NeuCLIR (team IDA/CCS) approaches. We started with the 2024 NeuCLIR task, fine-tuning it for the NeuCLIR pilot data. We then adapted this approach for both the RAGTIME and RAG generation task submissions. As the 2025 RAGTIME task is multilingual, instead of cross-lingual like 2024, it was natural to look at stratified retrieval and compare it to a multilingual ranking using the NeuCLIR pilot data. We found that stratified query retrieval with reranking, adapted from our RAG 2024 work, was particularly helpful for generating reports within 2K and 10K character limits. In addition, we show work on improving extraction, using occams and attribution. Finally, we include a detailed meta-analysis of the automatic and semi-automatic metrics.

Bibtex

@inproceedings{IDACCS-trec2025-papers-proc-1,
    title = {Team IDACCS at TREC 2025: RAG and RAGTIME Tracks},
    author = {John M. Conroy and Mike Green and Neil P. Molino and Yue “Ray” Wang and Julia S. Yang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks¶

Yue Wang, John M. Conroy, Neil Molino, Julia Yang, Mike Green

Participant: ncsu-las
Paper: https://trec.nist.gov/pubs/trec34/papers/ncsu-las.rag.ragtime.pdf
Runs: las_ag_sel_29 | las_ag_sel_all_4.1 | las_ag_sel_28 | las_ag_sel_new_prompt | las_ag_round_robin

Abstract

This report describes submissions by the Laboratory for Analytic Sciences to the TREC 2025 RAG and RAGTime tracks. By leveraging autonomous agent workflows, including query decomposition, planner-executor architectures, and ensemble retrieval techniques (e.g., BM25, SPLADE, T5 sentence embeddings), we examine whether “Agentic RAG” can surpass traditional RAG systems in terms of retrieval relevance, groundedness, factuality, and citation quality. Our evaluations using Open-RAG-Eval, Autonuggetizer, and other metrics indicate gains in nugget coverage, groundedness and citation accuracy, albeit with trade-offs in retrieval relevance and factual consistency. In addition, we explored trade-offs in retrieval methodologies for the TREC RAG retrieval-only task, and agentic generation for the Report Generation task in RAGTIME.

Bibtex

@inproceedings{ncsu-las-trec2025-papers-proc-1,
    title = {Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks},
    author = {Yue Wang and John M. Conroy and Neil Molino and Julia Yang and Mike Green},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Hybrid Sparse-Neural Fusion for Passage Retrieval¶

Georgios Arampatzis, Avi Arampatzis

Participant: DUTH
Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.ragtime.pdf
Runs: duth_mlir_eng_rrf | duth_mlir_xenc | duth-mlir-mlm6 | duth-mlir-mlm6loc | mlir-tblocal | mlir-pybm25 | mlir-mlm12 | mlir-elec | mlir-tb | mlir-fused | mlir-rrf-report | xenc-report | eng_mlm6 | eng_mlm6loc | eng_fused | tblocal | pybm25 | mlm12 | electra | tb

Abstract

This paper studies multilingual information retrieval (MLIR) and report generation under retrieval-augmented evaluation settings, with an emphasis on robustness, reproducibility, and interpretability. We focus on efficient and lightweight transformer-based cross-encoder architectures for passage re-ranking in a multilingual retrieval scenario. Our approach follows a two-stage retrieval framework. In the first stage, BM25 is used for initial candidate selection, while in the second stage lightweight transformer-based cross-encoders (MiniLM, ELECTRA, and TinyBERT) are applied for passage re-ranking. To integrate multiple model predictions, we employ Reciprocal Rank Fusion (RRF), enabling robust aggregation across heterogeneous ranking signals. Unlike multilingual fine-tuning or agent-based approaches, our system relies exclusively on English-only cross-encoders applied in a zero-shot setting. This design allows us to assess the cross-lingual generalization capacity of compact re-ranking models under strict efficiency and reproducibility constraints. Experimental results on the validation phase show that, while our runs do not match the absolute effectiveness of large-scale multi-agent or generative systems, they achieve stable and interpretable performance given their lightweight architecture. Overall, the findings highlight the trade-off between retrieval effectiveness and computational efficiency, and demonstrate that compact re-ranking architectures combined with simple fusion strategies remain viable baselines for multilingual retrieval in low-resource and efficiency-constrained settings.

Bibtex

@inproceedings{DUTH-trec2025-papers-proc-2,
    title = {Hybrid Sparse-Neural Fusion for Passage Retrieval},
    author = {Georgios Arampatzis and Avi Arampatzis},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}