Skip to content

Proceedings - Retrieval Augmented Generation (RAG) 2025

NITATREC at TREC RAG 2025: A Report on Exploring Sparse, Dense, and Hybrid Retrieval for Retrieval-Augmented Generation

Aparajita Sinha, Kunal Chakma

Abstract

This paper describes our participation in the TREC RAG 2025 shared task, which investigates Retrieval-Augmented methods for addressing complex information needs using the MS MARCO v2.1 document and segment collections. We submitted systems to all four subtasks: Retrieval (R), Augmented Generation (AG), Retrieval-Augmented Generation (RAG), and Relevance Judgment. For the retrieval task, we explored three approaches: a lexical BM25 baseline, a dense retrieval model based on DPR embeddings, and a hybrid pipeline combining sparse and dense retrieval with cross-encoder reranking. For the generation tasks, we employed an instruction-tuned language model to produce evidence-grounded responses with citations. Experimental results show that the hybrid retrieval system achieves the best performance, obtaining a MAP of 0.1037, an nDCG@30 of 0.527, and a Recall@100 of 0.158. In the generation tasks, the RAG system achieved a Strict Vital Score of 0.19 and a Weighted Precision/Recall of 0.472. In contrast, the AG submission had a weighted precision/recall of 0.481. These results highlight the importance of combining lexical and semantic retrieval signals to improve Retrieval-Augmented Generation.

Bibtex
@inproceedings{NIT_Agartala-trec2025-papers-proc-1,
    title = {NITATREC at TREC RAG 2025: A Report on Exploring Sparse, Dense, and Hybrid Retrieval for Retrieval-Augmented Generation},
    author = {Aparajita Sinha and Kunal Chakma},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Processing Queries with Complex Information Needs: University of Glasgow Terrier Team at TREC RAG 2025

Fangzheng Tian, Debasis Ganguly, Craig Macdonald

Abstract

In our participation in the TREC 2025 Retrieval-Augmented Generation (RAG) Track, we investigate how existing RAG frameworks can be adapted to answer descriptive queries with complex information needs. Unlike short topical queries, TREC RAG 2025 queries are expressed as paragraph-length descriptions that often involve multiple aspects of a topic. To address this setting, we adapt two existing RAG paradigms: (1) a single-hop RAG pipeline that decomposes the original query into sub-queries and answers them independently, and (2) an iterative agentic RAG pipeline that decomposes and answers sub-queries in a cascading manner. We submitted 2 runs that instantiated these two paradigms. Our results show that directly adapting existing RAG systems designed for short queries provides a practical baseline for this task, but remains limited by the quality of query decomposition and long-horizon generation. According to the evaluation results, the generated answers often address only part of the original information need and fail to cover some vital nuggets. These findings highlight the limitations of current RAG frameworks when applied to complex, multi-aspect queries and suggest directions for future research on query decomposition and generation strategies in this emerging setting.

Bibtex
@inproceedings{uogTr-trec2025-papers-proc-1,
    title = {Processing Queries with Complex Information Needs: University of Glasgow Terrier Team at TREC RAG 2025},
    author = {Fangzheng Tian and Debasis Ganguly and Craig Macdonald},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Submodular Evidence Selection for Grounded Answer Generation in TREC RAG 2025

Zizhen Li, Hai-Tao Yu

Abstract

This paper describes and analyzes four submissions to the TREC 2025 Retrieval-Augmented Generation Answer Generation (AG) task. All runs use the organizer-provided top-100 retrieved segments and differ only in evidence selection, answer construction, and post-hoc refinement. The primary system combines greedy submodular evidence selection, evidence-card compression, and citation-first answer generation, while three companion runs test post-hoc rewriting and a lighter concatenation-style baseline. On the organizer’s AG scores, the primary system family achieves the strongest coverage-oriented results, reaching a strict vital score of 0.35 and sub coverage of 0.51, compared with 0.18 and 0.37 for the unrefined concatenation baseline. Adding a refiner to the primary system preserves those two scores while improving weighted precision and weighted recall from 0.578 to 0.690. A refiner applied to the concatenation baseline reaches the highest weighted scores in the released package, but on weaker strict vital and sub coverage. Taken together, the results suggest that upstream evidence selection and grounding decisions matter more than surface-level rewriting for strong AG performance.

Bibtex
@inproceedings{WING-II-trec2025-papers-proc-1,
    title = {Submodular Evidence Selection for Grounded Answer Generation in TREC RAG 2025},
    author = {Zizhen Li and Hai-Tao Yu},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG

Paul Owoicho, Jeff Dalton

Abstract

This paper describes the GRILL Lab’s participation in the TREC 2025 Interactive Knowledge Assistance Track (IKAT) and the Retrieval-Augmented Generation (RAG) track, covering four sub-tasks: IKAT Passage Ranking/Response Generation, IKAT Simulation, RAG Retrieval Only, and RAG Full. Our approach centres on a modular, agentic pipeline that pursues high recall through iterative feedback. The system proceeds in three stages: (1) initial candidate generation via BM25; (2) document expansion using Query-by-Document techniques; and (3) an LLM-driven gap analysis phase in which the model identifies informational gaps and formulates supplementary queries. A key architectural feature is a fine-tuned GPT-4.1 nano binary relevance filter, trained on TREC CAsT 2022 and IKAT 2023 relevance judgments, which prunes irrelevant documents between each stage to contain topic drift.

Bibtex
@inproceedings{grilllab-trec2025-papers-proc-1,
    title = {GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG},
    author = {Paul Owoicho and Jeff Dalton},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

IIUoT at TREC 2025 Retrieval-Augmented Generation Track

YATING ZHANG, HAITAO YU

Abstract

In this paper, we present the University of Tsukuba’s submission to the TREC 2024 Retrieval-Augmented Generation (RAG) Track. Our work addresses the critical challenges of retrieval instability in deep candidate pools and the prevalent issue of hallucinated attributions in Large Language Models (LLMs). We propose a unified framework that tightly couples a progressive retrieval strategy with an evidence-constrained generation mechanism. To handle the limitations of context windows during retrieval, we employ a listwise LLM-based reranker utilizing a sliding window approach, effectively balancing recall and precision across broad document lists. In the generation phase, we introduce a method for claim-citation alignment that enforces a strict structural dependency, ensuring that every generated statement is immediately preceded by and grounded in a specific reference. By constraining the generation process to verified evidence indices, our system aims to produce highly attributable and factually consistent responses for open-domain information needs.

Bibtex
@inproceedings{ii_research-trec2025-papers-proc-1,
    title = {IIUoT at TREC 2025 Retrieval-Augmented Generation Track},
    author = {YATING ZHANG and HAITAO YU},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

From Nuggets to Clusters: Multi-Level Evidence Structuring for the TREC 2025 RAG Track

Reyyan Yeniterzi, Suveyda Yeniterzi

Abstract

We present GenAIus’s participation in the TREC RAG track, focusing on the augmented generation and relevance judgment tasks. For augmented generation, we build two LLM-based pipelines: a nugget-based approach that converts passages into concise evidence units for response generation, and a cluster-based approach that groups nuggets by subtopic before synthesizing a citation-grounded answer. For the relevance judgment task, we reuse the nuggets, clusters, and generated responses to automatically score passage relevance. We develop five ranking methods based on nugget counts, length-normalized nugget counts, cluster membership, unique cluster coverage, and citation frequency.

Bibtex
@inproceedings{GenAIus-trec2025-papers-proc-4,
    title = {From Nuggets to Clusters: Multi-Level Evidence Structuring for the TREC 2025 RAG Track},
    author = {Reyyan Yeniterzi and Suveyda Yeniterzi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Keystone-Docs RAG at TREC 2025: Coverage-Aware Few-Document Augmentation via Narrative Decomposition

Yukio Uematsu, Koyuki Otani, Tomoyuki Shiroyama, Masaaki Tsuchida

Abstract

The proliferation of large language models (LLMs) with long-context capabilities has enabled Retrieval-Augmented Generation (RAG) to process numerous context segments. However, the introduction of a large number of segments has been reported to degrade RAG’s answer generation accuracy. Specifically, recent reports indicate that high-precision retrieval often returns hard negative examples—documents containing similar but differing opinions or facts—which can negatively influence LLM generation by introducing confusion. The Narrative inputs targeted by TREC this year (long texts composed of multiple sentences) require diverse and explanatory responses that account for shifting viewpoints, temporal sequences, and implied relationships, rather than single, fact-oriented answers. We define a Keystone-Doc as a document rich in information covering various aspects of a Narrative. We propose Keystone-Docs RAG, which aims to mitigate the aforementioned problems by selecting a minimal, yet sufficient, set of documents and providing them as RAG context.

Bibtex
@inproceedings{tus-trec2025-papers-proc-1,
    title = {Keystone-Docs RAG at TREC 2025: Coverage-Aware Few-Document Augmentation via Narrative Decomposition},
    author = {Yukio Uematsu and Koyuki Otani and Tomoyuki Shiroyama and Masaaki Tsuchida},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen

Laura Dietz, Bryan Li, James Mayfield, Dawn Lawrie, Eugene Yang, William Walden

Abstract

The HLTCOE Evaluation team participated in several tracks focused on Retrieval-Augmented Generation (RAG), including RAG, RAGTIME, DRAGUN, and BioGen. Drawing inspiration from recent work on nugget-based evaluations, we introduce the Crucible system, which scrambles the traditional retrieval → generation → evaluation workflow of a RAG task by automatically curating a set of high-quality question-answer pairs (nuggets) from retrieved documents and then conditioning generation on this set. This not only enables us to study how effectively we can recover the set of gold nuggets for each request but additionally how nugget set quality impacts final performance.

Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-2,
    title = {HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen},
    author = {Laura Dietz and Bryan Li and James Mayfield and Dawn Lawrie and Eugene Yang and William Walden},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

MIT Lincoln Laboratory at TREC 2025 Retrieval-Augmented Generation Track

Daniel Gwon, Kynnedy Smith, Nour Jedidi

Abstract

This paper describes MIT Lincoln Laboratory’s participation in the TREC 2025 RAG track. We made submissions to each of the Retrieval (R) and Retrieval-Augmented Generation (RAG) tasks. We focused primarily on various steps in the retrieval pipeline and used a strong, proprietary LLM for the generation step. We describe a multistage retrieval pipeline that combines query decomposition, learned sparse retrieval, and pointwise reranking to retrieve the highest-quality ranked list prior to generation. The LLM uses the ranked list to generate a response with citations using a standard prompt.

Bibtex
@inproceedings{MITLL-trec2025-papers-proc-1,
    title = {MIT Lincoln Laboratory at TREC 2025 Retrieval-Augmented Generation Track},
    author = {Daniel Gwon and Kynnedy Smith and Nour Jedidi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Team IDACCS at TREC 2025: RAG and RAGTIME Tracks

John M. Conroy, Mike Green, Neil P. Molino, Yue “Ray” Wang, Julia S. Yang

Abstract

This paper gives an overview of team IDA/CCS’s submissions to the 2025 TREC RAG and RAGTIME tracks. Our approach builds on our 2024 RAG (team LAS) and NeuCLIR (team IDA/CCS) approaches. We started with the 2024 NeuCLIR task, fine-tuning it for the NeuCLIR pilot data. We then adapted this approach for both the RAGTIME and RAG generation task submissions. As the 2025 RAGTIME task is multilingual, instead of cross-lingual like 2024, it was natural to look at stratified retrieval and compare it to a multilingual ranking using the NeuCLIR pilot data. We found that stratified query retrieval with reranking, adapted from our RAG 2024 work, was particularly helpful for generating reports within 2K and 10K character limits. In addition, we show work on improving extraction, using occams and attribution. Finally, we include a detailed meta-analysis of the automatic and semi-automatic metrics.

Bibtex
@inproceedings{IDACCS-trec2025-papers-proc-1,
    title = {Team IDACCS at TREC 2025: RAG and RAGTIME Tracks},
    author = {John M. Conroy and Mike Green and Neil P. Molino and Yue “Ray” Wang and Julia S. Yang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks

Yue Wang, John M. Conroy, Neil Molino, Julia Yang, Mike Green

Abstract

This report describes submissions by the Laboratory for Analytic Sciences to the TREC 2025 RAG and RAGTime tracks. By leveraging autonomous agent workflows, including query decomposition, planner-executor architectures, and ensemble retrieval techniques (e.g., BM25, SPLADE, T5 sentence embeddings), we examine whether “Agentic RAG” can surpass traditional RAG systems in terms of retrieval relevance, groundedness, factuality, and citation quality. Our evaluations using Open-RAG-Eval, Autonuggetizer, and other metrics indicate gains in nugget coverage, groundedness and citation accuracy, albeit with trade-offs in retrieval relevance and factual consistency. In addition, we explored trade-offs in retrieval methodologies for the TREC RAG retrieval-only task, and agentic generation for the Report Generation task in RAGTIME.

Bibtex
@inproceedings{ncsu-las-trec2025-papers-proc-1,
    title = {Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks},
    author = {Yue Wang and John M. Conroy and Neil Molino and Julia Yang and Mike Green},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

UTokyo-HitU at TREC 2025 RAG Track: HyDE-Enhanced Sparse-Dense Retrieval Fusion with LLM Reranking

Sho Fukada, Atsushi Keyaki, Yusuke Matsui

Abstract

This paper describes our submission to TREC RAG 2025 from the University of Tokyo and Hitotsubashi University. Our approach integrates hybrid retrieval combining sparse retrievers (BM25 and SPLADE) with dense retrievers (BGE-small and Qwen3-Embedding-0.6B), Hypothetical Document Embeddings (HyDE) for query augmentation, and LLM-based reranking. A key contribution is our HyDE Vector Mix strategy, which creates a weighted combination of original query and hypothetical answer embeddings with a mixing ratio alpha. Our four-method hybrid retrieval system achieved first place in the Retrieval task among 12 participating teams (46 runs), with nDCG@30 of 0.693, nDCG@100 of 0.613, and recall@100 of 0.257. We participated in all three tasks: Retrieval (R), Augmented Generation (AG), and RAG.

Bibtex
@inproceedings{UTokyo-trec2025-papers-proc-1,
    title = {UTokyo-HitU at TREC 2025 RAG Track: HyDE-Enhanced Sparse-Dense Retrieval Fusion with LLM Reranking},
    author = {Sho Fukada and Atsushi Keyaki and Yusuke Matsui},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Justification Retrieval with LLMs, Retrieval-Augmented Generation, and Hybrid Labels

Georgios Arampatzis, Vasileios Perifanis, Avi Arampatzis

Abstract

This paper presents a hybrid justification labeling framework for retrieval-augmented generation (RAG), focusing exclusively on the Relevance Judgment (RJ) subtask of the TREC 2025 RAG Track. The proposed approach integrates open-weight large language models (LLMs) with traditional retrieval signals and confidence calibration mechanisms. Specifically, we combine deterministic pre-labeling using Qwen2.5-3B-Instruct and StableLM 2-1 6B-Chat with multi-stage confidence normalization and lexical-overlap heuristics. This design enables small open-weight models to approximate the reasoning behavior of larger proprietary systems while remaining transparent and fully reproducible. We describe the end-to-end pipeline used for both automatic and semi-manual relevance judgment runs, analyze their validation consistency, and examine the impact of calibration parameters on justification quality and coverage. Empirical results indicate that hybrid confidence blending improves mid-range justification reliability and reduces variance across topics. All runs were validated using the official evaluation infrastructure and correspond solely to submissions for the RAG Relevance Judgment task.

Bibtex
@inproceedings{DUTH-trec2025-papers-proc-3,
    title = {Justification Retrieval with LLMs, Retrieval-Augmented Generation, and Hybrid Labels},
    author = {Georgios Arampatzis and Vasileios Perifanis and Avi Arampatzis},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

WaterlooClarke at TREC 2025

Siqing Huo, Charles L. A. Clarke

Abstract

Participating as the WaterlooClarke group, we focused on the RAG track; we also submitted runs for the DRAGUN Track. For the full retrieval augmented generation (RAG) task, we explored four pipelines: 1) Nuggetizer pipeline. 2) Generate an Answer and support with Retrieved Evidence (GARE). 3) Automatic Retrieval and Generation Plan. 4) Combined. 5) Automatically Selected Best Response. For the DRAGUN task, we explored one-shot prompting with feedback in the loop.

Bibtex
@inproceedings{WaterlooClarke-trec2025-papers-proc-1,
    title = {WaterlooClarke at TREC 2025},
    author = {Siqing Huo and Charles L. A. Clarke},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}