Proceedings 2025¶
The 34th Text REtrieval Conference¶
The 34th Text REtrieval Conference¶
Ian Soboroff, George Awad
Abstract
TREC 2025 is the thirty-fourth edition of the Text REtrieval Conference (TREC). The main goal of TREC is to create the evaluation infrastructure required for large-scale testing of information retrieval (IR) technology. This includes research on best methods for evaluation as well as development of the evaluation materials themselves. “Retrieval technology” is broadly interpreted to include a variety of techniques that enable and/or facilitate access to information that is not specifically structured for machine use. The TREC 2025 meeting was held online on December 11–12, 2025.
Bibtex
@inproceedings{conf-overview-proc,
title = {The 34th Text REtrieval Conference},
author = {Ian Soboroff and George Awad},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Adhoc Video Search¶
TREC 2025 Ad-hoc Video Search (AVS) Track Overview¶
George Awad
Abstract
The Ad-hoc Video Search (AVS) task at TREC continues to serve as a long-running benchmark for measuring progress in open-vocabulary video retrieval. The 2025 cycle builds on more than a decade of work and reflects the rapidly evolving landscape of multimodal and vision-language models. This overview describes the task design, dataset characteristics, evaluation protocol, participating teams, and the general retrieval trends observed during this assessment cycle.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-3,
title = {TREC 2025 Ad-hoc Video Search (AVS) Track Overview},
author = {George Awad},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK¶
Fangyun Duan, Haixiang Ni, Xiusong Wang, Chao Liang
- Paper: https://trec.nist.gov/pubs/trec34/papers/WHU-NERCMS.avs.vqa.pdf
- Participant: WHU-NERCMS
- Runs: Fuse all sub-models | HPA | Proportional fusion | BLIP BLIP2 CLIP LaCLIP SLIP diffusion
Abstract
The WHU-NERCMS team participated in the Ad-hoc Vedio Search (AVS) and Video Question Answer(VQA) tasks at TRECVID 2025. For AVS task, we continued to use multiple visual semantic embedding methods, combined with ranking aggregation techniques to integrate different models and their outputs to generate the final ranked video shot list. For VQA task, we propose to use the VLM model to generate answer that serve as baseline answer. The answer is then embedded in the same vector space with the four options, and then compute the similarity of these vectors to sort the results.
Bibtex
@inproceedings{WHU-NERCMS-trec2025-papers-proc-1,
title = {WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK},
author = {Fangyun Duan and Haixiang Ni and Xiusong Wang and Chao Liang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Exploiting Temporal and Semantic Diversity: A Multi-Stage Retrieval–Reranking Pipeline for AVS 2025¶
Thuyen Tran Doan, Bao Tran, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh
- Paper: https://trec.nist.gov/pubs/trec34/papers/NII_UIT.avs.pdf
- Participant: NII_UIT
- Runs: T2V_VILA_NVILA_VideoLLaMA3_weights | Paraphrase_T2V_VILA_NVILA_VideoLLaMA3 | T2V_VILA_v2 | T2V_VILA_NVILA_VideoLLaMA3_v2 | T2V_VILA_NVILA_VideoLLaMA3_Aria
Abstract
With the explosive growth in video content and volume, efficient video retrieval systems have become increasingly essential. However, our system still underperforms on queries involving temporal or action-related information. This limitation stems from the reliance on Text-to-Image (T2I) retrieval models, such as BEiT and BLIP, whose architectures are inherently image-based. In contrast, Text-to-Video (T2V) retrieval models, such as CLIP4Clip and TS2Net, are built upon pretrained backbones like CLIP and incorporate simple yet effective temporal modeling mechanisms, which enhance the system’s ability to understand temporal aspects in textual queries. For our participation in the TRECVID 2025 Ad-hoc Video Search (AVS) task, we have integrated several T2V models into both the initial retrieval stage and the fusion step, in addition to the existing T2I models. This integration not only boosts the overall average precision (AP) score but also improves system diversity and recall. To further leverage the increased recall, we employ a reranking step using several Large Vision-Language Models (LVLMs). These models, equipped with advanced reasoning capabilities, can better interpret complex or ambiguous query elements, such as exclusion terms, that are often challenging for smaller T2I/T2V models to handle effectively. Evaluated on the AVS 24 and 25 main tasks, our system achieves xinfAP scores of 0.4334 and 0.4361, respectively, demonstrating the effectiveness of combining diverse T2V models with multi-VLLM reranking.
Bibtex
@inproceedings{NII_UIT-trec2025-papers-proc-1,
title = {Exploiting Temporal and Semantic Diversity: A Multi-Stage Retrieval–Reranking Pipeline for AVS 2025},
author = {Thuyen Tran Doan and Bao Tran and Tien Do and Tien-Dung Mai and Thanh Duc Ngo and Duy-Dinh Le and Shin'ichi Satoh},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
VLM-based Binary Judgment Re-ranking for TREC 2025 Ad-hoc Video Search¶
Kazuya Ueki
- Paper: https://trec.nist.gov/pubs/trec34/papers/meisei.avs.pdf
- Participant: meisei
- Runs: tv25_Meisei_A1 | tv25_Meisei_A2 | tv25_Meisei_A3 | tv25_Meisei_A4
Abstract
We participated in the Ad-hoc Video Search (AVS) task at TREC 2025. Building upon our previous system, we aimed to further enhance search performance through a re-ranking approach. Our method employs multiple state-of-the-art Vision-Language Models (VLMs) to verify whether retrieved video shots truly match a given query, enabling more accurate semantic filtering. Among the 29 systems submitted, our four runs achieved the top four ranks, demonstrating the effectiveness of the proposed VLM-based binary judgment strategy. These results confirm the strong potential of recent VLMs to improve large-scale video retrieval.
Bibtex
@inproceedings{meisei-trec2025-papers-proc-1,
title = {VLM-based Binary Judgment Re-ranking for TREC 2025 Ad-hoc Video Search},
author = {Kazuya Ueki},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Doshisha University at TREC 2025 AVS Task¶
Dai Morisaki, Miho Ohsaki, Kimiaki Shirahama
- Paper: https://trec.nist.gov/pubs/trec34/papers/ccilab.avs.pdf
- Participant: ccilab
- Runs: ccilab1
Abstract
This paper presents the result obtained by Co-creation Informatics Laboratory (ccilab), Doshisha University on Ad-hoc Video Search (AVS) task. Our initial plan was to test the performance of a latest vision-language model, especially SigLip2 [1]. However, in our preliminary experiments, features extracted by the pre-trained SigLip2 available on [2] did not work at all. Thus, our submitted run F_M_C_D_ccilab.25.1 was obtained using a basic vision-language model, namely OpenAI CLIP [3], especially the pre-trained one released on [4]. The MAPs of our submitted run are 0.082 and 0.055 when using the 2024 ground truth and the combination of the 2024 and 2025 ground truths, respectively.
Bibtex
@inproceedings{ccilab-trec2025-papers-proc-1,
title = {Doshisha University at TREC 2025 AVS Task},
author = {Dai Morisaki and Miho Ohsaki and Kimiaki Shirahama},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Laboratory for Analytic Sciences in TREC 2025 Ad-hoc Video Search¶
Edward Sheriff, John Nolan, Yue Wang, Xi Niu
- Paper: https://trec.nist.gov/pubs/trec34/papers/ncsu-las.avs.pdf
- Participant: ncsu-las
- Runs: gpt | fg-clip | phi-only | clap | phi-subgroup | decomp
Abstract
This paper describes the Laboratory for Analytic Sciences (LAS) participation in the 2025 TREC Ad-hoc Video Search (AVS) task on the V3C2 collection. Motivated by deployment settings with constrained bandwidth and compute, our systems use a scalable keyframe-based text-to-video retrieval pipeline with dense 1 fps indexing and cosine-similarity search. We profile contrastive vision–language embedding models to select an efficient visual backbone for large-scale indexing over 4.5M keyframes. At query time, we evaluate training-free enhancements aimed at improving recall and ranking, including LLM-based semantic expansion, modality-aware query decomposition, Vision–Language Model (VLM)–based relevance scoring and reranking, and selectively triggered CLAP audio fusion for topics implying non-speech sounds. Our results emphasize the role of Vision–Language Models (VLMs), including large multimodal generative models such as gpt-4.1-mini and phi-3.5-vision, as relevance judges. VLM scores align closely with human judgments on prior topics and provide a useful reranking signal, though some mismatches persist. Disagreements are largely attributable to lexical ambiguity, subjective topic phrasing, and temporal uncertainty from single-keyframe evidence. Across six official runs spanning four workflows, VLM reranking yields the largest gains, with VLM quality emerging as the most influential variable: when operating over identical candidate pools, gpt-4.1-mini improves performance on 19 of 20 topics relative to phi-3.5-vision. Semantic expansion followed by gpt-4.1-mini reranking achieves our best score of 0.399 mean infAP.
Bibtex
@inproceedings{ncsu-las-trec2025-papers-proc-2,
title = {Laboratory for Analytic Sciences in TREC 2025 Ad-hoc Video Search},
author = {Edward Sheriff and John Nolan and Yue Wang and Xi Niu},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
AFRL at TREC 2025: Zero-Shot Human-Free Video Search with Caption Expansion¶
Andrew Young, Emily Conway, Jeremy Gwinnup
- Paper: https://trec.nist.gov/pubs/trec34/papers/AFRL.avs.pdf
- Participant: AFRL
- Runs: InternVL3 Baseline
Abstract
We describe the Air Force Research Laboratory’s submission to the TREC 2025 Ad-hoc Video Search (AVS) task. Our approach addresses the challenge of indexing and searching video content without human-generated metadata by employing modern multimodal large language models for caption generation. We upgrade from traditional short-caption baselines generating 7-word descriptions to state-of-the-art models producing up to 100-word descriptions with inter-frame context awareness. Our system integrates three components: caption expansion for richer semantic coverage, context-aware learning through weighted concept banks and unlikelihood training for better embeddings, and query decomposition with question-answering models for precise re-ranking. While our official submission encountered critical integration failures resulting in poor performance, we present our methodology, analyze the failures, and demonstrate the viability of our core approach through preliminary validation results.
Bibtex
@inproceedings{AFRL-trec2025-papers-proc-1,
title = {AFRL at TREC 2025: Zero-Shot Human-Free Video Search with Caption Expansion},
author = {Andrew Young and Emily Conway and Jeremy Gwinnup},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025¶
Andreas Goulas, Damianos Galanopoulos, Ioannis Patras, Vasileios Mezaris
- Paper: https://trec.nist.gov/pubs/trec34/papers/CERTH-ITI.avs.vqa.pdf
- Participant: CERTH-ITI
- Runs: run_1 | run_2 | run_3 | run_4
Abstract
This paper presents the overview of the runs related to the Ad-hoc Video Search (AVS) and Video Question Answering (VQA) tracks of TRECVID 2025 on behalf of the CERTH-ITI team. For the AVS track, we introduce a two-stage framework built on foundation models. In the first stage, multi-ple vision–language models (VLMs) encode both the input query, augmented through LLM-generated rephrasings, and the candidate video shots, producing weighted similarity scores for initial retrieval. In the second stage, we utilize a Multimodal-LLM(MLLM)-based reranking module that evaluates the semantic alignment between each shot among the top-N highest-ranked ones and the original query, generating updated relevance scores for reordering these shots. This MLLM-driven reranking significantly improves contextual matching and produces more accurate final rankings without requiring any model training. Regarding the VQA track, we fine-tune an audio-visual MLLM model on the provided TRECVID training dataset and we implement an inference-time scaling technique to enhance the mul-timodal understanding capabilities of the MLLM. For the open-ended Answer Generation (AG) task, we aggregate multiple model responses per question via a majority vote. The responses are generated with greedy sampling from different random frame subsets of the video and they are ranked based on the number of votes. For the Multiple-Choice (MC) task, instead of voting, we use mean pooling on the logits assigned by the fine-tuned model to each candidate response. Through the combination of fine-tuning and frame subset ensembling we achieve the highest score across 3 metrics in the VQA AG task and the second highest in the VQA MC task.
Bibtex
@inproceedings{CERTH-ITI-trec2025-papers-proc-1,
title = {MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025},
author = {Andreas Goulas and Damianos Galanopoulos and Ioannis Patras and Vasileios Mezaris},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
BioGen¶
Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track¶
Deepak Gupta, Dina Demner-Fushman, William Hersh, Steven Bedrick, Kirk Roberts
Abstract
This overview of the TREC 2025 BioGen track discussed the tasks, datasets, evaluation metrics, participating systems, and their performance. We evaluated the performance of the submitted runs across multiple levels (answers, citations, and documents) using expert and automated evaluation. For Task A, most teams followed the NLI base approach. For Task B, most teams used a two-step RAG approach: first, they retrieved documents via lexical search (BM25), then re-ranked them to obtain the top-k relevant documents/snippets. In the second stage, LLMs were used to generate the answer, citing appropriate documents. We hope that introducing the task has created ground-truth datasets to foster research on designing systems that generate answers to health-related questions with appropriate citations, thereby providing a trusted, reliable source to support the assertions in the answers.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-4,
title = {Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track},
author = {Deepak Gupta and Dina Demner-Fushman and William Hersh and Steven Bedrick and Kirk Roberts},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Contradiction-Aware Grounded QA System for TREC 2025 BioGen: Lexical Retrieval Trade-offs and Citation Attribution¶
Soumya Ranjan Sahoo, Gagan N., Sanand Sasidharan, Divya Bharti
- Paper: https://trec.nist.gov/pubs/trec34/papers/GEHC-HTIC.biogen.pdf
- Participant: GEHC-HTIC
- Runs: gehc_htic_task_a | gehc_htic_task_b | task_a_gehc_htic_run2
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in biomedical question-answering tasks, yet their tendency to generate plausible yet unverified claims poses significant risks in clinical contexts. To mitigate the clinical risks of LLM hallucinations, the TREC 2025 BioGen track mandates grounded answers that explicitly surface contradictory evidence (Task A) and the generation of narrative-driven, fully attributed responses (Task B). Addressing the critical absence of target ground truth, we present a proxy-based development framework utilizing the SciFact dataset to systematically optimize retrieval architectures. Our iterative evaluation revealed a “Simplicity Paradox”: complex adversarial dense retrieval strategies failed catastrophically on contradiction detection (MRR 0.023) due to Semantic Collapse, where negation signals were indistinguishable in vector space. Furthermore, we identified a distinct Retrieval Asymmetry: while filtering dense embeddings improved contradiction detection, it degraded support recall, compromising holistic reliability. We resolve this via a Decoupled Lexical Architecture utilizing a unified BM25 backbone to balance semantic support recall (0.810) with precise contradiction surfacing (0.750). This approach achieves the highest Weighted MRR (0.790) on the proxy benchmark while remaining the only computationally viable strategy for scaling to the 30-million-document PubMed corpus. For answer generation, we introduce Narrative-Aware Reranking and One-Shot In-Context Learning, which improved citation coverage from 50% (zero-shot) to 100%. Our official TREC evaluation results confirm these findings: our system ranks 2nd among all teams on Task A contradiction F1 and 3rd out of 50 runs on Task B citation coverage (98.77%), achieving zero citation contradict rate. Our work transforms LLMs from stochastic generators into honest evidence synthesizers, demonstrating that epistemic integrity in biomedical AI requires prioritizing lexical precision and architectural scalability over isolated metric optimization.
Bibtex
@inproceedings{GEHC-HTIC-trec2025-papers-proc-1,
title = {Contradiction-Aware Grounded QA System for TREC 2025 BioGen: Lexical Retrieval Trade-offs and Citation Attribution},
author = {Soumya Ranjan Sahoo and Gagan N. and Sanand Sasidharan and Divya Bharti},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
TREC BioGen 2025: A Retrieval and NLI-Based Approach for Biomedical Evidence Grounding¶
Riccardo Lunardi, Riccardo Zamolo, Maria Elena Zuliani, Stefano Mizzaro, Vincenzo Della Mea, Kevin Roitero
- Paper: https://trec.nist.gov/pubs/trec34/papers/uniud.biogen.pdf
- Participant: uniud
- Runs: run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct | run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct | run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct | run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct | run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct | run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct | run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct | run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct | run9_no_rerank_index-passages-sparse_gpt-4o-mini | run10_no_rerank_index-passages-dense_gpt-4o-mini | run1_no-rerank_index-passages-sparse | run2_rerank_index-passages-sparse | run3_no-rerank_index-passages-dense | run4_rerank_index-passages-dense
Abstract
This technical report presents the system developed for the TREC 2025 Biomedical Generative Retrieval Track. Our approach integrates both sparse and dense retrieval, LLM-based reranking, and Natural Language Inference (NLI) to identify both supporting and contradicting evidence for grounded answer generation, evaluating the interplay between them. We submitted 14 runs, four for Task A and ten for Task B, aimed to analyze how different retrieval and grounding configurations impact the factuality and reliability of biomedical answer generation.
Bibtex
@inproceedings{uniud-trec2025-papers-proc-1,
title = {TREC BioGen 2025: A Retrieval and NLI-Based Approach for Biomedical Evidence Grounding},
author = {Riccardo Lunardi and Riccardo Zamolo and Maria Elena Zuliani and Stefano Mizzaro and Vincenzo Della Mea and Kevin Roitero},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Generating Long-Form Answers to Biomedical Questions: Effectiveness and Efficiency¶
Jan Bakker, Jaap Kamps
- Paper: https://trec.nist.gov/pubs/trec34/papers/UAmsterdam.biogen.pdf
- Participant: UAmsterdam
- Runs: UAmsterdam_bergen | UAmsterdam_bergen_llama-8b | UAmsterdam_bergen_llama-70b | UAmsterdam_bergen_mistral-7b | UAmsterdam_bergen_pisco-llama | UAmsterdam_bergen_pisco-mistral
Abstract
In this paper, we report on the University of Amsterdam’s participation in the TREC 2025 BioGen Track (Gupta et al., 2025). The goal is to generate answers to biomedical questions that are grounded with appropriate citations. First, we investigated the zero-shot generalization of PISCO, an efficient open-domain question answering method, to this task. Second, we compared the use of small and large Llama models. Our findings reveal a trade-off between answer quality and generation time. Therefore, we emphasize the need for better approaches to combining LLM reliability with efficiency.
Bibtex
@inproceedings{UAmsterdam-trec2025-papers-proc-3,
title = {Generating Long-Form Answers to Biomedical Questions: Effectiveness and Efficiency},
author = {Jan Bakker and Jaap Kamps},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Multi-System Biomedical QA with Post-Hoc Sentence Attribution for TREC BioGen¶
Harikrishnan Gurushankar Saisudha, Sabine Bergler
- Paper: https://trec.nist.gov/pubs/trec34/papers/CLaC.biogen.pdf
- Participant: CLaC
- Runs: simpleQA_BM25 | simpleQA_hybrid | MedHopQA_BM25 | MedHopQA_FAISS | LLM_BM25 | LLM_NLI_BM25
Abstract
This paper presents six systems developed for the TREC 2024 Biomedical Reference Attribution task, covering both Task A (grounding answers) and Task B (reference attribution). For Task A, two LLM-based citation attribution systems were designed to identify supporting and contradicting evidence from PubMed, using a combination of sparse retrieval, dense reranking, and optional NLI-based contradiction filtering. For Task B, four systems were developed: two SimpleQA pipelines emphasizing lightweight, narrative-aligned answer generation, and two MedHopQA-based pipelines adapted from a multi-hop biomedical QA framework. Across systems, we evaluate the impact of retrieval strategies, question decomposition, and reranking on evidence selection and answer quality. Our analysis highlights key challenges in grounding citations, particularly for generic statements and contradiction detection, and identifies opportunities to improve LLM-based attribution through refined retrieval strategies and specialized fine-tuning. Together, these systems provide a comprehensive exploration of reference attribution in biomedical QA and illustrate the trade-offs between retrieval coverage, narrative alignment, and attribution accuracy.
Bibtex
@inproceedings{CLaC-trec2025-papers-proc-1,
title = {Multi-System Biomedical QA with Post-Hoc Sentence Attribution for TREC BioGen},
author = {Harikrishnan Gurushankar Saisudha and Sabine Bergler},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Modular Sparse and Dense Retrieval for Evidence-Constrained Biomedical Question Answering¶
Ganesh Chandrasekar, Benjamin Lofo Follo, Aleksandr Vinokhodov, Sabine Bergler
- Paper: https://trec.nist.gov/pubs/trec34/papers/CLaC Lab.biogen.pdf
- Participant: CLaC Lab
- Runs: system_a | system_b_tfidf | system_c_medcpt | system_d_medcpt_wide | system_e
Abstract
We describe a submission to TREC BioGen Task B, which generates short answers grounded in PubMed and cited with PMIDs. Our approach is modular: a sparse pipeline (BM25 with cross-encoder reranking) and a dense pipeline (MedCPT bi-encoder with cross-encoder) are both coupled with evidence-constrained generation that preserves PMIDs end to end. Each pipeline runs under two retrieval budgets (narrow and wide) using a shared query set formed by the original question plus three reformulations, yielding five runs in total including a baseline. We outline design choices for reformulation, reranking, evidence selection, and citation control, and we report official results for answer quality and citation metrics.
Bibtex
@inproceedings{CLaC_Lab-trec2025-papers-proc-1,
title = {Modular Sparse and Dense Retrieval for Evidence-Constrained Biomedical Question Answering},
author = {Ganesh Chandrasekar and Benjamin Lofo Follo and Aleksandr Vinokhodov and Sabine Bergler},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
SIB Text-Mining group at TREC BioGen 2025¶
Luc Mottin, Anaïs Mottaz, Alexandre Flament, Julien Knafou, Patrick Ruch
- Paper: https://trec.nist.gov/pubs/trec34/papers/SIB.biogen.pdf
- Participant: SIB
- Runs: SIB-task-a-3 | SIB-task-a-4 | SIB-task-a-2 | SIB-task-a-1 | SIB-task-a-6 | SIB-task-a-5 | SIB-task-a-7 | SIB-task-b-1 | SIB-task-b-2 | SIB-task-b-3 | SIB-task-b-4
Abstract
In the 2025 TREC Biomedical Generative Retrieval (BioGen) Track, we evaluated approaches for producing evidence-grounded biomedical answers across two tasks: sentence-level grounding (Task A) and citation-attributed answer generation (Task B). Our pipelines combined specialized retrieval components with both open-source and instruction-tuned large language models (LLMs), integrating sparse and dense retrieval, re-ranking, and, in later runs, LLM-based claim decomposition, splitting complex claims into positive and negative subclaims for more precise evidence evaluation. Retrieved evidence was used to guide models such as Qwen2.5-7B-Instruct and Meta-Llama-3-8B in classifying supporting and contradicting references, with PMIDs selected using round-robin strategies to balance subclaim coverage. For Task B, this approach was extended to produce complete citation-grounded biomedical summaries. Across all runs, we aimed at evaluating the impact of hybrid retrieval, model adaptation, and structured prompting on factual grounding, interpretability, and traceability of LLM outputs.
Bibtex
@inproceedings{SIB-trec2025-papers-proc-1,
title = {SIB Text-Mining group at TREC BioGen 2025},
author = {Luc Mottin and Anaïs Mottaz and Alexandre Flament and Julien Knafou and Patrick Ruch},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Dal@TREC25: Improving Biomedical QA with Adaptive Retrieval and Multi-Stage RAG¶
Jitansh Arora, Aman Jaiswal, Dr. Juan Ramirez-Orta, Dr. Evangelos Milios
- Paper: https://trec.nist.gov/pubs/trec34/papers/dal.biogen.pdf
- Participant: dal
- Runs: rrf_monot5-msmarco_deepseek-r1 | rrf_monot5-msmarco_llama70b | rmmdn | rmmln | afmmdn | afmmd | expd | empd | expert_prompt | emotional_prompt
Abstract
This paper presents the design and implementation of Retrieval-Augmented Generation (RAG) pipelines developed for the TREC BioGen 2025 Challenge by the Dalhousie team, which focuses on grounding and attribution in Biomedical Question Answering. Our approach integrates Hybrid Retrieval, Re-Ranking, and Large Language Model (LLM) reasoning to improve Factual Accuracy and Citation Fidelity. For Task A (Grounding Answer), we introduce a pipeline that reformulates questions and answer sentences into supportive and contradictory queries, retrieves relevant PubMed articles, and classifies their stance using LLM-based reasoning. For Task B (Reference Attribution), we extend this framework to generate concise, evidence-grounded biomedical answers through a combination of Retrieval–Generation and Generate–then–Retrieve architectures. Multiple prompting strategies and validation mechanisms were explored to balance retrieval coverage, precision, and logical consistency. The proposed systems emphasize modularity, reproducibility, and adaptability to evolving biomedical corpora, providing a robust foundation for advancing trustworthy, citation-grounded question answering in the biomedical domain.
Bibtex
@inproceedings{dal-trec2025-papers-proc-1,
title = {Dal@TREC25: Improving Biomedical QA with Adaptive Retrieval and Multi-Stage RAG},
author = {Jitansh Arora and Aman Jaiswal and Dr. Juan Ramirez-Orta and Dr. Evangelos Milios},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Detection, Retrieval, and Augmented Generation for Understanding News (DRAGUN)¶
Overview of the TREC 2025 DRAGUN Track: Detection, Retrieval, and Augmented Generation for Understanding News¶
Dake Zhang, Mark D. Smucker, Charles L. A. Clarke
Abstract
Many internet users struggle to assess whether online information is trustworthy, a critical skill in today’s digital environment where accurate content coexists with false or misleading material. As the successor to the previous TREC 2024 Lateral Reading Track, the TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track aims to advance research on supporting readers in assessing the trustworthiness of online news by providing reader-oriented, well-attributed reports. The track had two tasks: (1) Question Generation, which asked participants to propose critical, ranked questions a reader might investigate for a given article; and (2) Report Generation, which asked participants to produce a short (up to 250 words) background report grounded in the MS MARCO V2.1 Segmented Corpus. Using assessor-built rubrics with importance-weighted questions and short answers, we evaluated question coverage and report support/contradiction. We release topics, rubrics, annotations, runs, and evaluation results to support research on developing systems to help people assess the trustworthiness of news.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-6,
title = {Overview of the TREC 2025 DRAGUN Track: Detection, Retrieval, and Augmented Generation for Understanding News},
author = {Dake Zhang and Mark D. Smucker and Charles L. A. Clarke},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
An Iterative Multi-agent RAG System for the TREC 2025 DRAGUN Track¶
Dake Zhang
- Paper: https://trec.nist.gov/pubs/trec34/papers/UWaterlooMDS.dragun.pdf
- Participant: UWaterlooMDS
Abstract
The main goal of the TREC 2025 DRAGUN Track is to advance research on helping people assess the trustworthiness of online news articles via two tasks: question generation (producing critical questions that readers should consider when evaluating a news article’s trustworthiness) and report generation (creating well-sourced reports that provide readers with useful background and context for more informed trustworthiness evaluation). In this paper, we describe our organizer baselines for both tasks, including a starter kit made available to participants at the track’s launch. This multi-agent system uses an iterative retrieval-augmented generation pipeline consisting of a query generator, segment retriever, information evaluator, question generator, and report generator. The system is available at: https://github.com/trec-dragun/2025-starter-kit.
Bibtex
@inproceedings{UWaterlooMDS-trec2025-papers-proc-1,
title = {An Iterative Multi-agent RAG System for the TREC 2025 DRAGUN Track},
author = {Dake Zhang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
TREMA-UNH at TREC 2025 DRAGUN Track: Iterative Multi-Agent Pipeline for News Verification via Adversarial Credibility Analysis with Local LLMs¶
Naghmeh Farzi, Laura Dietz
- Paper: https://trec.nist.gov/pubs/trec34/papers/TREMA-UNH.dragun.pdf
- Participant: TREMA-UNH
- Runs: SK_MI_1 | SK_MI_2 | SK_Critique_MI_5 | SK_Critique_MI_5_RG | SK_MI_2_RG | SK_ConvinceF_MI_2 | SK_ConvinceF_MI_2_RG | ConvF_all-t12_5 | ConvF_all-t12_5_RG | ConvF_all_MI_5 | ConvF_all_MI_5_RG
Abstract
This notebook describes our submission to the TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track. We adapt the official starter kit to use local large language models via Ollama, implement an adversarial module that produces both balanced and aggressive critiques of news articles, highlighting potential weaknesses, unsupported claims, contradictions, and source biases to inform and guide subsequent query generation and evidence retrieval. Our system generates investigative questions (Task 1) and, in consequence, a report (Task 2) through an iterative retrieval-augmented generation approach.
Bibtex
@inproceedings{TREMA-UNH-trec2025-papers-proc-1,
title = {TREMA-UNH at TREC 2025 DRAGUN Track: Iterative Multi-Agent Pipeline for News Verification via Adversarial Credibility Analysis with Local LLMs},
author = {Naghmeh Farzi and Laura Dietz},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen¶
Laura Dietz, Bryan Li, James Mayfield, Dawn Lawrie, Eugene Yang, William Walden
- Paper: https://trec.nist.gov/pubs/trec34/papers/HLTCOE.dragun.rag.ragtime.pdf
- Participant: HLTCOE
- Runs: cru-claude-chatty | cru-most_common | cru-claude | cru-ablR_ | cru-ablR-conf_ | cru-confirm-ansR_ | cru-clod-ablR-conf_ | cru-cloch-ablR-conf_
Abstract
The HLTCOE Evaluation team participated in several tracks focused on Retrieval-Augmented Generation (RAG), including RAG, RAGTIME, DRAGUN, and BioGen. Drawing inspiration from recent work on nugget-based evaluations, we introduce the Crucible system, which scrambles the traditional retrieval → generation → evaluation workflow of a RAG task by automatically curating a set of high-quality question-answer pairs (nuggets) from retrieved documents and then conditioning generation on this set. This not only enables us to study how effectively we can recover the set of gold nuggets for each request but additionally how nugget set quality impacts final performance.
Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-2,
title = {HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen},
author = {Laura Dietz and Bryan Li and James Mayfield and Dawn Lawrie and Eugene Yang and William Walden},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Intelligent News Comprehension through Query Expansion and LLM-Augmented Generation¶
Jack Cheverton, Oliwia Majtyka, Ting Liu
- Paper: https://trec.nist.gov/pubs/trec34/papers/SCIAI.dragun.pdf
- Participant: SCIAI
- Runs: Team02_Run01_1000SegmentsExpansion | Team02_Run02_100SegmentsExpansion | Team02_Run03_100SegmentsNoExpansion | Team01_Run01_Winner | Team02_Task1 | 03_01_Baseline | SCIAI_03_02_Three | SCIAI_03_03_Five | SCIAI_03_04_Eight
Abstract
This paper discusses our work and participation in the Text Retrieval Conference (TREC) Detection, Retrieval, and Augmented Generation for Understanding News (DRAGUN) track of 2025. In our current digital landscape, it can be difficult to determine the accuracy of what we see online. This is especially true with the rise in fake news. The DRAGUN track challenges participants to develop a system that analyzes an article and generates a list of questions a thoughtful reader should ask if they are trying to determine trustworthiness, as well as generate a report that answers many of these questions. This paper discusses our team’s use of query expansion techniques and Large Language Models to approach this task.
Bibtex
@inproceedings{SCIAI-trec2025-papers-proc-2,
title = {Intelligent News Comprehension through Query Expansion and LLM-Augmented Generation},
author = {Jack Cheverton and Oliwia Majtyka and Ting Liu},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
From Questions to Trust Reports: A LLM-IR Framework for the TREC 2025 DRAGUN Track¶
Ignacy Alwasiak, Kene Nnolim, Jaclyn Thi, Samy Ateia, Markus Bink, Gregor Donabauer, David Elsweiler, Udo Kruschwitz
- Paper: https://trec.nist.gov/pubs/trec34/papers/UR_trecking.dragun.pdf
- Participant: UR_trecking
- Runs: UR_IW_run_1 | UR_IW_run_1_task2
Abstract
The DRAGUN Track at TREC 2025 targets the growing need for effective support tools that help users evaluate the trustworthiness of online news. We describe the UR_Trecking system submitted for both Task 1 (critical question generation) and Task 2 (retrieval augmented trustworthiness reporting). Our approach combines LLM-based question generation with semantic filtering, diversity enforcement using clustering, and several query expansion strategies (including reasoning-based Chain-of-Thought expansion) to retrieve relevant evidence from the MS MARCO V2.1 segmented corpus. Retrieved documents are re-ranked using a monoT5 model and filtered using an LLM relevance judge together with a domain-level trustworthiness dataset. For Task 2, selected evidence is synthesized by an LLM into concise trustworthiness reports with citations. Results from the official evaluation indicate that Chain-of-Thought query expansion and re-ranking substantially improve both relevance and domain trust compared to baseline retrieval, while question-generation performance shows moderate quality with room for improvement. We conclude by outlining key challenges encountered and suggesting directions for enhancing robustness and trustworthiness assessment in future iterations of the system.
Bibtex
@inproceedings{UR_trecking-trec2025-papers-proc-1,
title = {From Questions to Trust Reports: A LLM-IR Framework for the TREC 2025 DRAGUN Track},
author = {Ignacy Alwasiak and Kene Nnolim and Jaclyn Thi and Samy Ateia and Markus Bink and Gregor Donabauer and David Elsweiler and Udo Kruschwitz},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
CITADEL — Citation-Driven Draft-Evaluate Loop¶
Daniel Seredensky, Dylan Iddings, Sharon G. Small
- Paper: https://trec.nist.gov/pubs/trec34/papers/SCIAI.dragun.pdf
- Participant: SCIAI
- Runs: Team02_Run01_1000SegmentsExpansion | Team02_Run02_100SegmentsExpansion | Team02_Run03_100SegmentsNoExpansion | Team01_Run01_Winner | Team02_Task1 | 03_01_Baseline | SCIAI_03_02_Three | SCIAI_03_03_Five | SCIAI_03_04_Eight
Abstract
This paper describes our team submission from the Siena University Institute for Artificial Intelligence for the Text Retrieval Conference (TREC) 2025 Detection, Retrieval, and Augmented Generation for Understanding News Track (DRAGUN). Our approach combines classical retrieval methods with the usage of multiple large language model (LLM) agents to generate concise, evidence-based reports. First, a hybrid retrieval pipeline integrates BM25, synonym-based query expansion, and cross-encoder reranking to maximize recall and precision across the corpus. Then, retrieved passages are processed through a generation-evaluation loop, where different LLM agents separately generate and critique reports according to rubric-based criteria for coverage, accuracy, and citation quality. This design emphasizes factual accuracy and access to citations to align with DRAGUN’s goal of supporting critical engagement with news.
Bibtex
@inproceedings{SCIAI-trec2025-papers-proc-1,
title = {CITADEL — Citation-Driven Draft-Evaluate Loop},
author = {Daniel Seredensky and Dylan Iddings and Sharon G. Small},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
A LangChain-Based Framework for Investigative Question Generation Using Large Language Models¶
Adnan Faisal, Shiti Chowdhury
- Paper: https://trec.nist.gov/pubs/trec34/papers/CUET.dragun.pdf
- Participant: CUET
- Runs: CUET-DeepSeek-R1-Qwen-32B | CUET-qwen4B-v2 | CUET-unsloth-Mistral-Small | CUET-qwen4B-v3 | CUET-qwen14B-v1 | CUET-qwen14B-v2 | CUET-qwen14B-v3 | CUET-qwen14B-v5 | CUET-Mistral-Small-24B | CUET-QwQ-32B
Abstract
The increasing prevalence of online misinformation has amplified the demand for automated approaches that assist readers in assessing the credibility of news articles. The TREC 2025 DRAGUN (Detection, Retrieval and Augmented Generation for Understanding News) Track addresses this need through its Question Generation task, which requires systems to formulate ranked investigative questions that support reader-oriented credibility assessment. This study presents a LangChain-based pipeline for generating focused and investigative questions from news articles in the MS MARCO V2.1 segmented corpus. The proposed framework combines structured prompt design, controlled decoding and semantic reranking to improve question relevance, coherence and interpretability. We have evaluated several experimental configurations covering Qwen-based, Mistral-based and reasoning-oriented large language models using the rectified DRAGUN evaluation protocol, where compound questions are removed prior to scoring. Our experimental results indicate that reasoning-aligned models exhibit stronger and more consistent performance under strict evaluation constraints, with the CUET-QwQ-32B configuration achieving the highest average score among our submissions. At the same time, Qwen-14B variants demonstrate stable and competitive performance across diverse topics, showing substantial agreement with assessor-defined evaluation rubrics. Overall, our findings demonstrate that a structured and modular question-generation pipeline can effectively translate large language model reasoning into practical support for reader-centric news trustworthiness assessment, while also providing insights for extending such systems toward multi-source report generation.
Bibtex
@inproceedings{CUET-trec2025-papers-proc-1,
title = {A LangChain-Based Framework for Investigative Question Generation Using Large Language Models},
author = {Adnan Faisal and Shiti Chowdhury},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
LLM-Based Question Generation and Retrieval-Augmented Reporting for News Credibility¶
Georgios Arampatzis, Ioannis Maslaris, Avi Arampatzis
- Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.dragun.pdf
- Participant: DUTH
- Runs: garamp_qwen25_7b_imp | garamp_qwen25_14b | garamp_yi15_9b | garamp_qwen25_72b | garamp_mistral_7b | garamp_qwen25_14b_r4 | garamp_dragun_t2_q7b | garamp_yi9b_t2_v1 | garamp_qwen25_3b_t2 | garamp_zephyr7b_t2
Abstract
This paper presents the participation of the DUTH team in both tasks of the TREC 2025 DRAGUN (Detection, Retrieval, and Augmented Generation for Understanding News) Track. The track addresses the challenge of misinformation and biased narratives in digital news through two complementary tasks: Question Generation (Task 1) and Report Generation (Task 2). Task 1 focuses on generating investigative questions that assist readers in assessing news credibility, while Task 2 evaluates systems’ ability to retrieve evidence and generate grounded, well-attributed reports. Our approach employs recent open-weight, instruction-tuned large language models (LLMs), including Qwen2.5, Yi-1.5, and Mistral-7B, combined with prompt engineering, semantic filtering, and retrieval-grounded generation pipelines. All systems were implemented locally using the transformers and accelerate libraries, without external fine-tuning or API access, ensuring full reproducibility and controlled model comparison. Experimental results show that mid-sized instruction-tuned models, most notably Mistral-7B-Instruct, achieve the strongest rubric coverage among the DUTH submissions in the Question Generation task. In the Report Generation task, all evaluated systems exhibit very low contradiction rates, indicating robust factual grounding, but achieve limited rubric coverage under strict retrieval and attribution constraints. Overall, these findings suggest that prompt design, question specificity, and retrieval quality play a more decisive role than raw model scale in supporting explainable and evidence-based news trustworthiness assessment.
Bibtex
@inproceedings{DUTH-trec2025-papers-proc-1,
title = {LLM-Based Question Generation and Retrieval-Augmented Reporting for News Credibility},
author = {Georgios Arampatzis and Ioannis Maslaris and Avi Arampatzis},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
WaterlooClarke at TREC 2025¶
Siqing Huo, Charles L. A. Clarke
- Paper: https://trec.nist.gov/pubs/trec34/papers/WaterlooClarke.dragun.rag.pdf
- Participant: WaterlooClarke
- Runs: feedbackintheloop | garag_rubric
Abstract
Participating as the WaterlooClarke group, we focused on the RAG track; we also submitted runs for the DRAGUN Track. For the full retrieval augmented generation (RAG) task, we explored four pipelines: 1) Nuggetizer pipeline. 2) Generate an Answer and support with Retrieved Evidence (GARE). 3) Automatic Retrieval and Generation Plan. 4) Combined. 5) Automatically Selected Best Response. For the DRAGUN task, we explored one-shot prompting with feedback in the loop.
Bibtex
@inproceedings{WaterlooClarke-trec2025-papers-proc-1,
title = {WaterlooClarke at TREC 2025},
author = {Siqing Huo and Charles L. A. Clarke},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Interactive Knowledge Assistance Track (IKAT)¶
TREC iKAT 2025: The Interactive Knowledge Assistance Track Overview¶
Mohammad Aliannejadi, Simon Lupart, Marcel Gohsen, Zahra Abbasiantaeb, Nailia Mirzakhmedova, Johannes Kiesel, Jeffrey Dalton
Abstract
Conversational information seeking has evolved rapidly in the last few years with the development of large language models (LLMs) providing the basis for interpreting and responding in a naturalistic manner to user requests. iKAT emphasizes the creation and research of conversational search agents that adapt responses based on the user’s prior interactions and present context, maintaining a long-term memory of user-system interactions. This means that the same question might yield varied answers, contingent on the user’s profile and preferences. The challenge lies in enabling conversational search agents (CSA) to incorporate personalized context to guide users through the relevant information effectively. iKAT’s third year introduced an interactive conversation task, attracting seven teams and a total of 47 runs. Most of the runs leveraged LLMs in their pipelines, for single or multiple query rewriting, some also adopted agentic pipelines.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-5,
title = {TREC iKAT 2025: The Interactive Knowledge Assistance Track Overview},
author = {Mohammad Aliannejadi and Simon Lupart and Marcel Gohsen and Zahra Abbasiantaeb and Nailia Mirzakhmedova and Johannes Kiesel and Jeffrey Dalton},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
USIIR at TREC 2025 iKAT Track¶
Lili Lu
- Paper: https://trec.nist.gov/pubs/trec34/papers/usiir.ikat.pdf
- Participant: usiir
- Runs: usiir_run1 | usiir_run2
Abstract
This year’s TREC iKAT track contains several tasks such as passage ranking, response generation and Personal Text Knowledge Base (PTKB) statement classification. We focus on response generation (only) task due to time and budget limitations. This task is to generate a response based on retrieved passages, given the additional context. We submitted two runs for this task mainly to explore the impact of user profiles on the generation quality of responses. In this short report, we describe the method that was used for generation and present the results.
Bibtex
@inproceedings{usiir-trec2025-papers-proc-1,
title = {USIIR at TREC 2025 iKAT Track},
author = {Lili Lu},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG¶
Paul Owoicho, Jeff Dalton
- Paper: https://trec.nist.gov/pubs/trec34/papers/grilllab.ikat.rag.pdf
- Participant: grilllab
- Runs: grilllab-larf-finetuned | grilllab-larf-finetuned-10-rounds | grilllab-larf-finetuned-rankllm | grilllab-larf-finetuned-22-rounds | grilllab-agentic-gpt4.1 | grilllab-agentic-gpt4.1-larf | grilllab-agentic-gpt4.1-larf-v2 | grilllab-larf-fine-tuned-judge
Abstract
This paper describes the GRILL Lab’s participation in the TREC 2025 Interactive Knowledge Assistance Track (IKAT) and the Retrieval-Augmented Generation (RAG) track, covering four sub-tasks: IKAT Passage Ranking/Response Generation, IKAT Simulation, RAG Retrieval Only, and RAG Full. Our approach centres on a modular, agentic pipeline that pursues high recall through iterative feedback. The system proceeds in three stages: (1) initial candidate generation via BM25; (2) document expansion using Query-by-Document techniques; and (3) an LLM-driven gap analysis phase in which the model identifies informational gaps and formulates supplementary queries. A key architectural feature is a fine-tuned GPT-4.1 nano binary relevance filter, trained on TREC CAsT 2022 and IKAT 2023 relevance judgments, which prunes irrelevant documents between each stage to contain topic drift.
Bibtex
@inproceedings{grilllab-trec2025-papers-proc-1,
title = {GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG},
author = {Paul Owoicho and Jeff Dalton},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
GUIDANCE@TREC iKAT 2025¶
Ahmed Rayane Kebir, Victor Morand, Pierre-Antoine Lequeu, Zineddine Tighidet, Mitodru Niyogi, Jonah Turner, Rishu Kumar, Benjamin Piwowarski
- Paper: https://trec.nist.gov/pubs/trec34/papers/guidance.ikat.pdf
- Participant: guidance
- Runs: cosine-orconvqa-sum-top10 | agg_true-qrec-mse-sum-top10 | agg_false-qrec-mse-sum-top10 | gpt-clarif-sum-top10
Abstract
The report describes the work conducted by several teams involved in the ANR GUIDANCE project for the iKAT evaluation campaign. The ANR GUIDANCE aims to advance research in General Purpose Dialogue-assisted Digital Information Access. This involves tackling challenges such as the design and adaptation of large language models (LLMs) for better information access, enhancing LLMs' generalization capabilities to new domains and languages, ensuring the truthfulness of outputs, and addressing the lack of open-access state-of-the-art models. The GUIDANCE project also seeks to unite the French Information Retrieval community and produce open-access resources for model evaluation and development.
Bibtex
@inproceedings{guidance-trec2025-papers-proc-1,
title = {GUIDANCE@TREC iKAT 2025},
author = {Ahmed Rayane Kebir and Victor Morand and Pierre-Antoine Lequeu and Zineddine Tighidet and Mitodru Niyogi and Jonah Turner and Rishu Kumar and Benjamin Piwowarski},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Evaluating Full Dialogue History vs. Summarized Context for Personalized Knowledge Assistance: Findings from the TREC 2025 iKAT Track¶
Suveyda Yeniterzi, Reyyan Yeniterzi
- Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.ikat.pdf
- Participant: GenAIus
- Runs: genaius-genonly-summary-gpt4o | genaius-genonly-full-gpt4o | genaius-full-rewrite | genaius-summary-rewrite
Abstract
We present GenAIus’s participation in the TREC 2025 Interactive Knowledge Assistance Track (iKAT), focusing on personalized and context-aware response generation in both offline and interactive settings. We develop a multi-stage pipeline that integrates conversation summarization, Personal Textual Knowledge Base (PTKB) classification, query rewriting, passage retrieval, and grounded response generation. To study the impact of conversational context modeling, we compare two configurations: conditioning on the full dialogue history versus using an evolving conversation summary updated at each turn. Experimental results show that full-history conditioning yields slightly stronger performance in offline generation and dialogue-level interactive metrics, while summary-based conditioning achieves comparable overall results with improvements in engagement and contextual efficiency. Both approaches rank within the top tier of participating systems, demonstrating the robustness of our pipeline and the viability of structured conversational summarization as a scalable alternative to full-history conditioning.
Bibtex
@inproceedings{GenAIus-trec2025-papers-proc-5,
title = {Evaluating Full Dialogue History vs. Summarized Context for Personalized Knowledge Assistance: Findings from the TREC 2025 iKAT Track},
author = {Suveyda Yeniterzi and Reyyan Yeniterzi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
CFDA & CLIP at TREC iKAT 2025: Enhancing Personalized Conversational Search via Query Reformulation and Rank Fusion¶
Yu-Cheng Chang, Guan-Wei Yeo, Quah Eugene, Fan-Jie Shih, Yuan-Ching Kuo, Tsung-En Yu, Hung-Chun Hsu, Ming-Feng Tsai, Chuan-Ju Wang
- Paper: https://trec.nist.gov/pubs/trec34/papers/cfda.ikat.pdf
- Participant: cfda
- Runs: cfda-auto-3 | cfda-auto-4 | cfda-auto-1 | cfda-auto-2 | cfda-gen-only-2 | cfda-gen-only-1 | cfda-adarewriter-chiq-llm4cs-splade | cfda-chiq-llm4cs-splade-rrf
Abstract
The 2025 TREC Interactive Knowledge Assistance Track (iKAT) featured both interactive and offline submission tasks. The former requires systems to operate under real-time constraints, making robustness and efficiency as important as accuracy, while the latter enables controlled evaluation of passage ranking and response generation with pre-defined datasets. To address this, we explored query rewriting and retrieval fusion as core strategies. We built our pipelines around Best-of-N selection and Reciprocal Rank Fusion (RRF) strategies to handle different submission tasks. Results show that reranking and fusion improve robustness while revealing trade-offs between effectiveness and efficiency across both tasks.
Bibtex
@inproceedings{cfda-trec2025-papers-proc-1,
title = {CFDA \& CLIP at TREC iKAT 2025: Enhancing Personalized Conversational Search via Query Reformulation and Rank Fusion},
author = {Yu-Cheng Chang and Guan-Wei Yeo and Quah Eugene and Fan-Jie Shih and Yuan-Ching Kuo and Tsung-En Yu and Hung-Chun Hsu and Ming-Feng Tsai and Chuan-Ju Wang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
UvA-IRLab at iKAT25: Exploring Learned Sparse Retrieval and Query Rewriting for Personalized Conversational QA¶
Simon Lupart, Zahra Abbasiantaeb, Mohammad Aliannejadi
- Paper: https://trec.nist.gov/pubs/trec34/papers/uva.ikat.pdf
- Participant: uva
- Runs: genonly-noptkb | genonly-ptkb | disco-qrecc-norerank | mq4cs-gpt41-bm25 | mq4cs-gpt41-splade | mq4cs-llamaft-splade | uva-gpt5-bm25-debertav3-gpt5 | uva-gpt5-bm25-debertav3-gpt5mini-nopersonal | uva-gpt5mini-bm25-debertav3-gpt5mini | uva-gpt5mini-no-no-gpt5mini
Abstract
The TREC interactive Knowledge Assistant Track (iKAT) 2025 is the third edition of the iKAT shared task. It focuses on developing conversational assistants that can adapt their responses using personal user knowledge from a Personal Textual Knowledge Base (PTKB). This year’s edition also introduces a new interactive task that evaluates systems using a user simulator. Since query rewriting is an effective way to handle conversational context, we study the use of Large Language Models (LLMs) as query rewriters. In our runs, we generate multiple query aspects using the MQ4CS framework and frontier LLMs (GPT-4.1), as well as open-source LLMs finetuned for the task (Llama-8B). We also strengthen the approach with SPLADE-based sparse retrieval and cross-encoder reranking. Finally, we also explore a rewrite-free technique, based on learned sparse retrieval (LSR) using the DiSCo model. Our results show that multi-aspect query generation improves performance when paired with strong retrieval and reranking models. They also suggest that LLM-based query rewriting can support better personalization in conversational search.
Bibtex
@inproceedings{uva-trec2025-papers-proc-1,
title = {UvA-IRLab at iKAT25: Exploring Learned Sparse Retrieval and Query Rewriting for Personalized Conversational QA},
author = {Simon Lupart and Zahra Abbasiantaeb and Mohammad Aliannejadi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Million LLMs Track (MLLM)¶
Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking¶
Gabrielle Poerwawinata, Jingfen Qiao
- Paper: https://trec.nist.gov/pubs/trec34/papers/uvairlab.mllm.pdf
- Participant: uvairlab
- Runs: lightgbm_job266431
Abstract
The Million LLMs Track (TREC Million LLMs) focuses on methods for ranking large language models (LLMs) based on their expected ability to answer a given query. As tasks increasingly require a combination of both general-purpose and domain-specific models, it is vital to predict which LLM is best suited for a given query without needing to query each model directly. We propose a supervised learning-to-rank approach that exploits next-token log probabilities from pre-generated responses as zero-cost pseudo-relevance signals. For each (query, LLM) pair, we derive a soft relevance label from the mean next-token log probability of the model’s prior response, indicating the model’s confidence as a proxy for answer quality. A LightGBM-based LambdaRank model is trained on feature vectors combining query embeddings (Sentence-BERT), categorical LLM identifiers, and global token-level statistics filtered at the top percentile. On the TREC Million LLMs test set, our best configuration achieves NDCG@10 of 0.3695, substantially outperforming tag-based (0.013) and response-based (0.195) baselines. Our ablation analysis shows that LLM-level statistics contribute more to ranking quality than query-specific embeddings, suggesting that global model capability is a dominant signal in the current evaluation setting.
Bibtex
@inproceedings{uvairlab-trec2025-papers-proc-1,
title = {Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking},
author = {Gabrielle Poerwawinata and Jingfen Qiao},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
How to Choose the Right LLM? Exploring Methods for the Million LLMs Track¶
Catalina Riano, Hui Fang
- Paper: https://trec.nist.gov/pubs/trec34/papers/UDInfo.mllm.pdf
- Participant: UDInfo
- Runs: infolab_UD_run1 | infolab_UD_run2 | infolab_UD_run3 | infolab_UD_run4 | infolab_UD_run5
Abstract
The Million LLMs Track, part of TREC 2025, addresses the challenge of determining which LLMs are best suited to answer a specific user query. The main goal of the track is to evaluate the ability of a system to predict LLM expertise: given a query and a set of LLM IDs, the system must rank the models by how likely they are to provide a high quality answer. Unlike traditional IR settings, this ranking must be produced without querying the models at test time. Instead, participants must rely on precomputed discovery data, including LLM responses, metadata, and development labels, to infer each model’s strengths and capabilities. This report explains our submission to the Million LLMs Track and outlines the methods we implemented for the task.
Bibtex
@inproceedings{UDInfo-trec2025-papers-proc-1,
title = {How to Choose the Right LLM? Exploring Methods for the Million LLMs Track},
author = {Catalina Riano and Hui Fang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks¶
Hongyu Li, Yuming Zhang, Junyu Zhou, Yongwei Zhang, Shanshan Jiang, Bin Dong
- Paper: https://trec.nist.gov/pubs/trec34/papers/SRCB.mllm.tot.pdf
- Participant: SRCB
- Runs: submission_5 | submission_4 | ensemble_v1 | ensemble_v2 | ensemble_v3
Abstract
This paper reports the performance of SRCB’s system in the Million LLMs and Tip-of-the-Tongue tracks. For the Million LLM Track, we mainly rely on powerful LLMs and various methods to construct the missing training labels. And then, we use the constructed training data and devise several approaches to achieve the ranking target and conduct experiments. For the Tip-of-the-Tongue task, we propose a retrieval framework that integrates dense and LLM-based components. Original queries are transformed into cue lists, and additional data are used to fine-tune both the dense retriever and re-ranker. Furthermore, retrieval results from LLMs are incorporated to supplement the reranker, and the final ranking is produced using an LLM reranker.
Bibtex
@inproceedings{SRCB-trec2025-papers-proc-1,
title = {SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks},
author = {Hongyu Li and Yuming Zhang and Junyu Zhou and Yongwei Zhang and Shanshan Jiang and Bin Dong},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track¶
Reyyan Yeniterzi, Suveyda Yeniterzi
- Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.mllm.pdf
- Participant: GenAIus
- Runs: llm-f-100 | q-f-rm3-100-ss | r-f-rm3-1000-ss | r-ef-rm3-1000-ss | r-f-rm3-1000-rr-ss
Abstract
We address the Million LLM ranking task by formulating it as an expert retrieval problem and adapting established Information Retrieval techniques to estimate the expertise of large language models. Our approach centers on two families of methods: a profile-based strategy that aggregates all query–response pairs from each LLM into a unified representation, and document-based strategies that operate either at the response level or at the query level. Before applying these models, we introduce a two-stage data filtering pipeline to remove uninformative and low-confidence responses, yielding a cleaner signal for expertise estimation. Experimental results on the development set show that response-based aggregation provides the most fine-grained and reliable ranking of LLMs, outperforming both profile-based and question-based variants. Guided by these findings, we prepared five submissions combining different retrieval, filtering, and aggregation configurations, including a re-ranking variant using naver-splade-v3. Our study demonstrates that classical expert retrieval methods, when adapted appropriately, can effectively model and rank LLM expertise.
Bibtex
@inproceedings{GenAIus-trec2025-papers-proc-3,
title = {Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track},
author = {Reyyan Yeniterzi and Suveyda Yeniterzi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Product Search and Recommendation¶
Precision by Design: RM3 and Fusion in Product Search¶
Georgios Arampatzis, Symeon Symeonidis, Avi Arampatzis
- Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.product.pdf
- Participant: DUTH
- Runs: garamp_rm3_v1 | garamp_bm25_v1 | garamp_prf_v1 | gar_rm3_f120d10w3 | garamp_rm3_f40d5_w05
Abstract
In this work, we present the fully lexical and reproducible system developed by the DUTH team for the TREC 2025 Product Search & Recommendation track, aiming to improve performance on task-oriented e-commerce queries. Such queries (e.g., home office makeover, birthday party essentials) often perform poorly in purely lexical retrieval systems because they express high-level user intents rather than concrete product attributes. Our system indexes approximately 1.08M products using Lucene/Pyserini, retrieves with BM25 (tuned to k1=0.9, b=0.4), and bridges the intent–metadata gap through carefully calibrated RM3 pseudo-relevance feedback. For the interactive setting, we automatically generate four PRF-based query reformulations per topic and aggregate complementary signals using weighted Reciprocal Rank Fusion. The system requires neither neural re-ranking nor external resources, runs efficiently on a single CPU node, and produces standard six-column TREC runs with strict de-duplication. Official evaluation results confirm that RM3 and fusion yield consistent improvements over the BM25 baseline across task completion nDCG, MAP, and Essential Recall@1000. These findings highlight that thoughtful lexical reformulation, classical PRF, and simple fusion strategies remain strong and efficient baselines for task-oriented product search.
Bibtex
@inproceedings{DUTH-trec2025-papers-proc-5,
title = {Precision by Design: RM3 and Fusion in Product Search},
author = {Georgios Arampatzis and Symeon Symeonidis and Avi Arampatzis},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
JBNU at TREC 2025 Product Search and Recommendations Track¶
Seong-Hyuk Yim, Jae-Young Park, Woo-Seok Choi, Gi-Taek An, Kyung-Soon Lee
- Paper: https://trec.nist.gov/pubs/trec34/papers/JBNU.product.pdf
- Participant: JBNU
- Runs: jbnu-r01 | jbnu-r02 | jbnu-r03 | jbnu-r04 | jbnu-r05 | jbnu-s01 | jbnu-s02 | jbnu-s03 | jbnu-s04
Abstract
This paper presents the JBNU team’s participation in the TREC 2025 Product Search and Recommendations Track. For the Search Task, we develop two complementary query reformulation strategies: an LLM-driven method that generates structured Lucene-style reformulations to reduce query ambiguity, and a multimodal approach that leverages a vision–language model (VLM) to extract additional semantic cues from web-sourced images. For the Recommendation Task, we adopt a two-stage architecture in which neural retrieval models (dense and learned sparse) generate candidate products, and relation classification—performed by either an LLM or a fine-tuned BERT model—reranks them as substitutes or complements, with final lists refined through weighted score aggregation. Experimental results show that both LLM-based query reformulation and classification-driven reranking consistently improve effectiveness across tasks. Overall, the study demonstrates that lightweight LLM components, when strategically integrated into retrieval and recommendation pipelines, provide a scalable and robust approach to product understanding in the TREC setting.
Bibtex
@inproceedings{JBNU-trec2025-papers-proc-1,
title = {JBNU at TREC 2025 Product Search and Recommendations Track},
author = {Seong-Hyuk Yim and Jae-Young Park and Woo-Seok Choi and Gi-Taek An and Kyung-Soon Lee},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Retrieval Augmented Generation (RAG)¶
NITATREC at TREC RAG 2025: A Report on Exploring Sparse, Dense, and Hybrid Retrieval for Retrieval-Augmented Generation¶
Aparajita Sinha, Kunal Chakma
- Paper: https://trec.nist.gov/pubs/trec34/papers/NIT Agartala.rag.pdf
- Participant: NIT Agartala
- Runs: NITA-Qrels
Abstract
This paper describes our participation in the TREC RAG 2025 shared task, which investigates Retrieval-Augmented methods for addressing complex information needs using the MS MARCO v2.1 document and segment collections. We submitted systems to all four subtasks: Retrieval (R), Augmented Generation (AG), Retrieval-Augmented Generation (RAG), and Relevance Judgment. For the retrieval task, we explored three approaches: a lexical BM25 baseline, a dense retrieval model based on DPR embeddings, and a hybrid pipeline combining sparse and dense retrieval with cross-encoder reranking. For the generation tasks, we employed an instruction-tuned language model to produce evidence-grounded responses with citations. Experimental results show that the hybrid retrieval system achieves the best performance, obtaining a MAP of 0.1037, an nDCG@30 of 0.527, and a Recall@100 of 0.158. In the generation tasks, the RAG system achieved a Strict Vital Score of 0.19 and a Weighted Precision/Recall of 0.472. In contrast, the AG submission had a weighted precision/recall of 0.481. These results highlight the importance of combining lexical and semantic retrieval signals to improve Retrieval-Augmented Generation.
Bibtex
@inproceedings{NIT_Agartala-trec2025-papers-proc-1,
title = {NITATREC at TREC RAG 2025: A Report on Exploring Sparse, Dense, and Hybrid Retrieval for Retrieval-Augmented Generation},
author = {Aparajita Sinha and Kunal Chakma},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Processing Queries with Complex Information Needs: University of Glasgow Terrier Team at TREC RAG 2025¶
Fangzheng Tian, Debasis Ganguly, Craig Macdonald
- Paper: https://trec.nist.gov/pubs/trec34/papers/uogTr.rag.pdf
- Participant: uogTr
- Runs: e5_monot5_searchR1 | genSubQ_merge
Abstract
In our participation in the TREC 2025 Retrieval-Augmented Generation (RAG) Track, we investigate how existing RAG frameworks can be adapted to answer descriptive queries with complex information needs. Unlike short topical queries, TREC RAG 2025 queries are expressed as paragraph-length descriptions that often involve multiple aspects of a topic. To address this setting, we adapt two existing RAG paradigms: (1) a single-hop RAG pipeline that decomposes the original query into sub-queries and answers them independently, and (2) an iterative agentic RAG pipeline that decomposes and answers sub-queries in a cascading manner. We submitted 2 runs that instantiated these two paradigms. Our results show that directly adapting existing RAG systems designed for short queries provides a practical baseline for this task, but remains limited by the quality of query decomposition and long-horizon generation. According to the evaluation results, the generated answers often address only part of the original information need and fail to cover some vital nuggets. These findings highlight the limitations of current RAG frameworks when applied to complex, multi-aspect queries and suggest directions for future research on query decomposition and generation strategies in this emerging setting.
Bibtex
@inproceedings{uogTr-trec2025-papers-proc-1,
title = {Processing Queries with Complex Information Needs: University of Glasgow Terrier Team at TREC RAG 2025},
author = {Fangzheng Tian and Debasis Ganguly and Craig Macdonald},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Submodular Evidence Selection for Grounded Answer Generation in TREC RAG 2025¶
Zizhen Li, Hai-Tao Yu
- Paper: https://trec.nist.gov/pubs/trec34/papers/WING-II.rag.pdf
- Participant: WING-II
- Runs: wingii-v3-gpt | wingii-3-rl-refined | no-llm-refined | no-llm
Abstract
This paper describes and analyzes four submissions to the TREC 2025 Retrieval-Augmented Generation Answer Generation (AG) task. All runs use the organizer-provided top-100 retrieved segments and differ only in evidence selection, answer construction, and post-hoc refinement. The primary system combines greedy submodular evidence selection, evidence-card compression, and citation-first answer generation, while three companion runs test post-hoc rewriting and a lighter concatenation-style baseline. On the organizer’s AG scores, the primary system family achieves the strongest coverage-oriented results, reaching a strict vital score of 0.35 and sub coverage of 0.51, compared with 0.18 and 0.37 for the unrefined concatenation baseline. Adding a refiner to the primary system preserves those two scores while improving weighted precision and weighted recall from 0.578 to 0.690. A refiner applied to the concatenation baseline reaches the highest weighted scores in the released package, but on weaker strict vital and sub coverage. Taken together, the results suggest that upstream evidence selection and grounding decisions matter more than surface-level rewriting for strong AG performance.
Bibtex
@inproceedings{WING-II-trec2025-papers-proc-1,
title = {Submodular Evidence Selection for Grounded Answer Generation in TREC RAG 2025},
author = {Zizhen Li and Hai-Tao Yu},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG¶
Paul Owoicho, Jeff Dalton
- Paper: https://trec.nist.gov/pubs/trec34/papers/grilllab.ikat.rag.pdf
- Participant: grilllab
- Runs: grilllab-agentic-gpt4 | grilllab-agentic-gpt4-generation | grilllab-agent-gpt45 | grilllab-gpt45-gen
Abstract
This paper describes the GRILL Lab’s participation in the TREC 2025 Interactive Knowledge Assistance Track (IKAT) and the Retrieval-Augmented Generation (RAG) track, covering four sub-tasks: IKAT Passage Ranking/Response Generation, IKAT Simulation, RAG Retrieval Only, and RAG Full. Our approach centres on a modular, agentic pipeline that pursues high recall through iterative feedback. The system proceeds in three stages: (1) initial candidate generation via BM25; (2) document expansion using Query-by-Document techniques; and (3) an LLM-driven gap analysis phase in which the model identifies informational gaps and formulates supplementary queries. A key architectural feature is a fine-tuned GPT-4.1 nano binary relevance filter, trained on TREC CAsT 2022 and IKAT 2023 relevance judgments, which prunes irrelevant documents between each stage to contain topic drift.
Bibtex
@inproceedings{grilllab-trec2025-papers-proc-1,
title = {GRILL Lab at TREC 2025: Agentic Iterative Retrieval and Gap-Aware Refinement for TREC IKAT and TREC RAG},
author = {Paul Owoicho and Jeff Dalton},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
IIUoT at TREC 2025 Retrieval-Augmented Generation Track¶
YATING ZHANG, HAITAO YU
- Paper: https://trec.nist.gov/pubs/trec34/papers/ii_research.rag.pdf
- Participant: ii_research
- Runs: bm25-rz7b-2025a
Abstract
In this paper, we present the University of Tsukuba’s submission to the TREC 2024 Retrieval-Augmented Generation (RAG) Track. Our work addresses the critical challenges of retrieval instability in deep candidate pools and the prevalent issue of hallucinated attributions in Large Language Models (LLMs). We propose a unified framework that tightly couples a progressive retrieval strategy with an evidence-constrained generation mechanism. To handle the limitations of context windows during retrieval, we employ a listwise LLM-based reranker utilizing a sliding window approach, effectively balancing recall and precision across broad document lists. In the generation phase, we introduce a method for claim-citation alignment that enforces a strict structural dependency, ensuring that every generated statement is immediately preceded by and grounded in a specific reference. By constraining the generation process to verified evidence indices, our system aims to produce highly attributable and factually consistent responses for open-domain information needs.
Bibtex
@inproceedings{ii_research-trec2025-papers-proc-1,
title = {IIUoT at TREC 2025 Retrieval-Augmented Generation Track},
author = {YATING ZHANG and HAITAO YU},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
From Nuggets to Clusters: Multi-Level Evidence Structuring for the TREC 2025 RAG Track¶
Reyyan Yeniterzi, Suveyda Yeniterzi
- Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.rag.pdf
- Participant: GenAIus
- Runs: nugget-generation | cluster-generation | unique_cluster_cnt | cluster_cnt | citation_cnt | nugget_cnt | norm_nugget_cnt
Abstract
We present GenAIus’s participation in the TREC RAG track, focusing on the augmented generation and relevance judgment tasks. For augmented generation, we build two LLM-based pipelines: a nugget-based approach that converts passages into concise evidence units for response generation, and a cluster-based approach that groups nuggets by subtopic before synthesizing a citation-grounded answer. For the relevance judgment task, we reuse the nuggets, clusters, and generated responses to automatically score passage relevance. We develop five ranking methods based on nugget counts, length-normalized nugget counts, cluster membership, unique cluster coverage, and citation frequency.
Bibtex
@inproceedings{GenAIus-trec2025-papers-proc-4,
title = {From Nuggets to Clusters: Multi-Level Evidence Structuring for the TREC 2025 RAG Track},
author = {Reyyan Yeniterzi and Suveyda Yeniterzi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Keystone-Docs RAG at TREC 2025: Coverage-Aware Few-Document Augmentation via Narrative Decomposition¶
Yukio Uematsu, Koyuki Otani, Tomoyuki Shiroyama, Masaaki Tsuchida
- Paper: https://trec.nist.gov/pubs/trec34/papers/tus.rag.pdf
- Participant: tus
- Runs: uema2lab_B4 | uema2lab_base | uema2lab_rag_org | uema2lab_rrf | uema2lab_rrf_k10 | uema2lab_rag_fewdoc | uema2lab_narrative | uema2lab_segment
Abstract
The proliferation of large language models (LLMs) with long-context capabilities has enabled Retrieval-Augmented Generation (RAG) to process numerous context segments. However, the introduction of a large number of segments has been reported to degrade RAG’s answer generation accuracy. Specifically, recent reports indicate that high-precision retrieval often returns hard negative examples—documents containing similar but differing opinions or facts—which can negatively influence LLM generation by introducing confusion. The Narrative inputs targeted by TREC this year (long texts composed of multiple sentences) require diverse and explanatory responses that account for shifting viewpoints, temporal sequences, and implied relationships, rather than single, fact-oriented answers. We define a Keystone-Doc as a document rich in information covering various aspects of a Narrative. We propose Keystone-Docs RAG, which aims to mitigate the aforementioned problems by selecting a minimal, yet sufficient, set of documents and providing them as RAG context.
Bibtex
@inproceedings{tus-trec2025-papers-proc-1,
title = {Keystone-Docs RAG at TREC 2025: Coverage-Aware Few-Document Augmentation via Narrative Decomposition},
author = {Yukio Uematsu and Koyuki Otani and Tomoyuki Shiroyama and Masaaki Tsuchida},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen¶
Laura Dietz, Bryan Li, James Mayfield, Dawn Lawrie, Eugene Yang, William Walden
- Paper: https://trec.nist.gov/pubs/trec34/papers/HLTCOE.dragun.rag.ragtime.pdf
- Participant: HLTCOE
- Runs: cru-ansR | cru-ansR-conf | cru-ablR | cru-ablR-conf | cru-ansR-bareconf | jcru-ansR | jcru-ansR-all | jcru-ablR | jcru-ablR-all
Abstract
The HLTCOE Evaluation team participated in several tracks focused on Retrieval-Augmented Generation (RAG), including RAG, RAGTIME, DRAGUN, and BioGen. Drawing inspiration from recent work on nugget-based evaluations, we introduce the Crucible system, which scrambles the traditional retrieval → generation → evaluation workflow of a RAG task by automatically curating a set of high-quality question-answer pairs (nuggets) from retrieved documents and then conditioning generation on this set. This not only enables us to study how effectively we can recover the set of gold nuggets for each request but additionally how nugget set quality impacts final performance.
Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-2,
title = {HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen},
author = {Laura Dietz and Bryan Li and James Mayfield and Dawn Lawrie and Eugene Yang and William Walden},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
MIT Lincoln Laboratory at TREC 2025 Retrieval-Augmented Generation Track¶
Daniel Gwon, Kynnedy Smith, Nour Jedidi
- Paper: https://trec.nist.gov/pubs/trec34/papers/MITLL.rag.pdf
- Participant: MITLL
- Runs: full | no-decomp | no-reranker | no-decomp-reranker | full-ret | ret-gemma | ret-no-decomp | ret-no-reranker | ret-splade-only
Abstract
This paper describes MIT Lincoln Laboratory’s participation in the TREC 2025 RAG track. We made submissions to each of the Retrieval (R) and Retrieval-Augmented Generation (RAG) tasks. We focused primarily on various steps in the retrieval pipeline and used a strong, proprietary LLM for the generation step. We describe a multistage retrieval pipeline that combines query decomposition, learned sparse retrieval, and pointwise reranking to retrieve the highest-quality ranked list prior to generation. The LLM uses the ranked list to generate a response with citations using a standard prompt.
Bibtex
@inproceedings{MITLL-trec2025-papers-proc-1,
title = {MIT Lincoln Laboratory at TREC 2025 Retrieval-Augmented Generation Track},
author = {Daniel Gwon and Kynnedy Smith and Nour Jedidi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Team IDACCS at TREC 2025: RAG and RAGTIME Tracks¶
John M. Conroy, Mike Green, Neil P. Molino, Yue “Ray” Wang, Julia S. Yang
- Paper: https://trec.nist.gov/pubs/trec34/papers/IDACCS.rag.ragtime.pdf
- Participant: IDACCS
- Runs: IDACCS-hybrid-gpt4-1 | IDACCS-nugg-gpt-4-1 | IDACCSabstrct-gpt4-1 | IDACCS-hybrid-gpt4o | IDACCS-nugg-gpt-4o
Abstract
This paper gives an overview of team IDA/CCS’s submissions to the 2025 TREC RAG and RAGTIME tracks. Our approach builds on our 2024 RAG (team LAS) and NeuCLIR (team IDA/CCS) approaches. We started with the 2024 NeuCLIR task, fine-tuning it for the NeuCLIR pilot data. We then adapted this approach for both the RAGTIME and RAG generation task submissions. As the 2025 RAGTIME task is multilingual, instead of cross-lingual like 2024, it was natural to look at stratified retrieval and compare it to a multilingual ranking using the NeuCLIR pilot data. We found that stratified query retrieval with reranking, adapted from our RAG 2024 work, was particularly helpful for generating reports within 2K and 10K character limits. In addition, we show work on improving extraction, using occams and attribution. Finally, we include a detailed meta-analysis of the automatic and semi-automatic metrics.
Bibtex
@inproceedings{IDACCS-trec2025-papers-proc-1,
title = {Team IDACCS at TREC 2025: RAG and RAGTIME Tracks},
author = {John M. Conroy and Mike Green and Neil P. Molino and Yue “Ray” Wang and Julia S. Yang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks¶
Yue Wang, John M. Conroy, Neil Molino, Julia Yang, Mike Green
- Paper: https://trec.nist.gov/pubs/trec34/papers/ncsu-las.rag.ragtime.pdf
- Participant: ncsu-las
- Runs: LAS-agentic-RAG-selector | LAS-agentic-RAG-agent | single-agent-trim | selector-agent-trim | LAS_con-que | LAS_sep-que | LAS_con-que-con-nug | LAS_con-que-sep-nug
Abstract
This report describes submissions by the Laboratory for Analytic Sciences to the TREC 2025 RAG and RAGTime tracks. By leveraging autonomous agent workflows, including query decomposition, planner-executor architectures, and ensemble retrieval techniques (e.g., BM25, SPLADE, T5 sentence embeddings), we examine whether “Agentic RAG” can surpass traditional RAG systems in terms of retrieval relevance, groundedness, factuality, and citation quality. Our evaluations using Open-RAG-Eval, Autonuggetizer, and other metrics indicate gains in nugget coverage, groundedness and citation accuracy, albeit with trade-offs in retrieval relevance and factual consistency. In addition, we explored trade-offs in retrieval methodologies for the TREC RAG retrieval-only task, and agentic generation for the Report Generation task in RAGTIME.
Bibtex
@inproceedings{ncsu-las-trec2025-papers-proc-1,
title = {Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks},
author = {Yue Wang and John M. Conroy and Neil Molino and Julia Yang and Mike Green},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
UTokyo-HitU at TREC 2025 RAG Track: HyDE-Enhanced Sparse-Dense Retrieval Fusion with LLM Reranking¶
Sho Fukada, Atsushi Keyaki, Yusuke Matsui
- Paper: https://trec.nist.gov/pubs/trec34/papers/UTokyo.rag.pdf
- Participant: UTokyo
- Runs: qwen_splade | 4method_merge | gpt41 | r_2method_ag_gpt41 | r_4method_ag_gpt41
Abstract
This paper describes our submission to TREC RAG 2025 from the University of Tokyo and Hitotsubashi University. Our approach integrates hybrid retrieval combining sparse retrievers (BM25 and SPLADE) with dense retrievers (BGE-small and Qwen3-Embedding-0.6B), Hypothetical Document Embeddings (HyDE) for query augmentation, and LLM-based reranking. A key contribution is our HyDE Vector Mix strategy, which creates a weighted combination of original query and hypothetical answer embeddings with a mixing ratio alpha. Our four-method hybrid retrieval system achieved first place in the Retrieval task among 12 participating teams (46 runs), with nDCG@30 of 0.693, nDCG@100 of 0.613, and recall@100 of 0.257. We participated in all three tasks: Retrieval (R), Augmented Generation (AG), and RAG.
Bibtex
@inproceedings{UTokyo-trec2025-papers-proc-1,
title = {UTokyo-HitU at TREC 2025 RAG Track: HyDE-Enhanced Sparse-Dense Retrieval Fusion with LLM Reranking},
author = {Sho Fukada and Atsushi Keyaki and Yusuke Matsui},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Justification Retrieval with LLMs, Retrieval-Augmented Generation, and Hybrid Labels¶
Georgios Arampatzis, Vasileios Perifanis, Avi Arampatzis
- Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.rag.pdf
- Participant: DUTH
- Runs: duth_stablelm2_rj_v1 | duth.hybrid.qwen.cal | duth.hybrid.stableri | duth.hybrid.qwencon | hybrid.stable.loose2
Abstract
This paper presents a hybrid justification labeling framework for retrieval-augmented generation (RAG), focusing exclusively on the Relevance Judgment (RJ) subtask of the TREC 2025 RAG Track. The proposed approach integrates open-weight large language models (LLMs) with traditional retrieval signals and confidence calibration mechanisms. Specifically, we combine deterministic pre-labeling using Qwen2.5-3B-Instruct and StableLM 2-1 6B-Chat with multi-stage confidence normalization and lexical-overlap heuristics. This design enables small open-weight models to approximate the reasoning behavior of larger proprietary systems while remaining transparent and fully reproducible. We describe the end-to-end pipeline used for both automatic and semi-manual relevance judgment runs, analyze their validation consistency, and examine the impact of calibration parameters on justification quality and coverage. Empirical results indicate that hybrid confidence blending improves mid-range justification reliability and reduces variance across topics. All runs were validated using the official evaluation infrastructure and correspond solely to submissions for the RAG Relevance Judgment task.
Bibtex
@inproceedings{DUTH-trec2025-papers-proc-3,
title = {Justification Retrieval with LLMs, Retrieval-Augmented Generation, and Hybrid Labels},
author = {Georgios Arampatzis and Vasileios Perifanis and Avi Arampatzis},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
WaterlooClarke at TREC 2025¶
Siqing Huo, Charles L. A. Clarke
- Paper: https://trec.nist.gov/pubs/trec34/papers/WaterlooClarke.dragun.rag.pdf
- Participant: WaterlooClarke
- Runs: ronly_combined | ronly_auto_plan | ronly_garag | ronly_nuggetizer | ronly_auto_selected | auto_selected | combined | auto_plan | garag | nuggetizer
Abstract
Participating as the WaterlooClarke group, we focused on the RAG track; we also submitted runs for the DRAGUN Track. For the full retrieval augmented generation (RAG) task, we explored four pipelines: 1) Nuggetizer pipeline. 2) Generate an Answer and support with Retrieved Evidence (GARE). 3) Automatic Retrieval and Generation Plan. 4) Combined. 5) Automatically Selected Best Response. For the DRAGUN task, we explored one-shot prompting with feedback in the loop.
Bibtex
@inproceedings{WaterlooClarke-trec2025-papers-proc-1,
title = {WaterlooClarke at TREC 2025},
author = {Siqing Huo and Charles L. A. Clarke},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
RAG TREC Instrument for Multilingual Evaluation (RAGTIME)¶
WueRAG at RAGTIME 2025: Retrieval, Fusion, and Citation for Grounded Report Generation¶
Julia Wunderle, Julian Schubert, Joachim Baumeister, Andreas Hotho
- Paper: https://trec.nist.gov/pubs/trec34/papers/WueRAG.ragtime.pdf
- Participant: WueRAG
- Runs: WueRAG_2025_07_08_20_05_00 | WueRAG_2025_08_22
Abstract
We present WueRAG, a retrieval-augmented generation pipeline for the TREC 2025 RAGTIME English report generation task. Our approach combines query reformulation, remote candidate retrieval, and local per-topic reranking using a hybrid dense and lexical fusion. To ensure citation accuracy, grounding is enforced at two levels: first, through a generation phase that requires bracketed citations for factual claims and second, via a postprocessing filter that removes any remaining unverified sentences. On the official evaluation WueRAG achieved the highest F1 score (0.421) for the english subtask, indicating that combining multi-stage retrieval with explicit grounding constraints can effectively balance attribution accuracy and report quality.
Bibtex
@inproceedings{WueRAG-trec2025-papers-proc-1,
title = {WueRAG at RAGTIME 2025: Retrieval, Fusion, and Citation for Grounded Report Generation},
author = {Julia Wunderle and Julian Schubert and Joachim Baumeister and Andreas Hotho},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Question-Driven Multilingual Retrieval and Report Generation for the RAGTIME Track at TREC 2025¶
Suveyda Yeniterzi, Reyyan Yeniterzi
- Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.ragtime.pdf
- Participant: GenAIus
- Runs: dry_all3_eng_1000 | dry_gq_eng_1000 | genaius-llama3-3-70B | genaius-gpt-4o | genaius-gpt-oss-120b | genaius-gpt-oss-20b | genaius-question | genaius-cluster
Abstract
We present GenAIus’s participation in the TREC 2025 RAGTIME track, focusing on multilingual retrieval and multilingual report generation in the news domain. Our approach follows a question-driven framework in which a set of targeted questions is generated for each report request and used to guide both document retrieval and report synthesis. For retrieval, we rely on the organizers’ multilingual search API and introduce a dynamic merging strategy that allocates an equal retrieval quota per generated question and aggregates scores across repeated document occurrences. For report generation, we explore two pipelines: a question-based approach that generates short, cited answers from multiple retrieved documents and synthesizes them into a final report, and a cluster-based approach that extracts nuggets from retrieved documents, clusters them by semantic similarity, and generates reports grounded in these structured clusters. We experiment with both proprietary and open-source LLMs for question generation, including GPT-4o and Llama3.3-70B. Our submissions achieve strong performance across tasks, including first place in MAP in the multilingual retrieval task and top rankings in several report-generation metrics. These results highlight the effectiveness of question-driven retrieval and structured evidence synthesis for multilingual report generation.
Bibtex
@inproceedings{GenAIus-trec2025-papers-proc-6,
title = {Question-Driven Multilingual Retrieval and Report Generation for the RAGTIME Track at TREC 2025},
author = {Suveyda Yeniterzi and Reyyan Yeniterzi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen¶
Laura Dietz, Bryan Li, James Mayfield, Dawn Lawrie, Eugene Yang, William Walden
- Paper: https://trec.nist.gov/pubs/trec34/papers/HLTCOE.dragun.rag.ragtime.pdf
- Participant: HLTCOE
- Runs: cru-ansR- | cru-ansR-conf- | cru-ablR- | cru-ansR-LSR- | cru-ansR-PlaidX- | cru-ansR-mostcommon- | cru-ansR-bareconf- | cru-ablR-LSR- | cru-ablR-PlaidX- | cru-ablR-conf-
Abstract
The HLTCOE Evaluation team participated in several tracks focused on Retrieval-Augmented Generation (RAG), including RAG, RAGTIME, DRAGUN, and BioGen. Drawing inspiration from recent work on nugget-based evaluations, we introduce the Crucible system, which scrambles the traditional retrieval → generation → evaluation workflow of a RAG task by automatically curating a set of high-quality question-answer pairs (nuggets) from retrieved documents and then conditioning generation on this set. This not only enables us to study how effectively we can recover the set of gold nuggets for each request but additionally how nugget set quality impacts final performance.
Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-2,
title = {HLTCOE Evaluation Team at TREC 2025: RAG, RAGTIME, DRAGUN, and BioGen},
author = {Laura Dietz and Bryan Li and James Mayfield and Dawn Lawrie and Eugene Yang and William Walden},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Improving Completeness in Deep Research Agents through Targeted Enrichment¶
Jesse Wonnink
- Paper: https://trec.nist.gov/pubs/trec34/papers/UvA.ragtime.pdf
- Participant: UvA
- Runs: zetaalpha
Abstract
Deep research agents represent a significant advance in AI-assisted information synthesis, capable of conducting comprehensive investigations that traditionally required substantial human effort. However, ensuring completeness in automatically generated research reports remains challenging: existing systems rely on ad-hoc query decomposition through prompt engineering, providing no formal guarantees about coverage or diversity, and evaluation frameworks often assess single dimensions rather than holistic report quality. This thesis addresses these limitations through three primary contributions. First, we propose a multi-dimensional framework that operationalizes completeness as three interdependent aspects: coverage (breadth and relevance of information), grounding (citation accuracy and factual consistency), and presentation quality (clarity and structural coherence). Second, we present HERO (High Enrichment Retrieval Orchestrator), a hierarchical deep research architecture that combines submodular optimization for query diversification with a novel two-stage enrichment mechanism that identifies and addresses information gaps through targeted follow-up investigation. Third, we conduct comprehensive evaluation across both academic (ScholarQABench) and general knowledge (DeepResearchGym) domains, enabling holistic evaluation of Deep Research agents. HERO achieves state-of-the-art performance on both benchmarks, with the highest coverage metrics (Key Point Recall of 67.63 on DeepResearchGym), strongest grounding (Citation F1: 91.57), and superior presentation quality scores. Ablation studies reveal that submodular optimization and hierarchical enrichment each contribute distinct improvements, with synergistic effects when combined. However, important limitations remain: our analysis reveals systematic sycophantic bias where the system adapts argumentative positions to match query framing, demonstrating that architectural improvements alone cannot overcome inherent behavioral patterns in foundation models. This work contributes both a concrete system demonstrating measurable improvements in report completeness and a framework for multi-dimensional evaluation of deep research agents. As these systems evolve toward deployment in high-stakes domains, the architectural principles and evaluation methodology established here provide a foundation for building reliable, comprehensive AI research assistants.
Bibtex
@inproceedings{UvA-trec2025-papers-proc-1,
title = {Improving Completeness in Deep Research Agents through Targeted Enrichment},
author = {Jesse Wonnink},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Team IDACCS at TREC 2025: RAG and RAGTIME Tracks¶
John M. Conroy, Mike Green, Neil P. Molino, Yue “Ray” Wang, Julia S. Yang
- Paper: https://trec.nist.gov/pubs/trec34/papers/IDACCS.rag.ragtime.pdf
- Participant: IDACCS
- Runs: ragtime_run_75_topics_25_docs_per_query_per_languge_1_20250707_00_08_occams_budget_4000_hybrid_raranked_concat_by_lang | ragtime_run_100_topics_25_docs_per_query_per_languge_1_rus_arb_zho_eng_20250708_14_33_occams_budget_1000_hybrid_10_reranked | ragtime_run_100_topics_25_docs_per_query_per_languge_1_rus_arb_zho_eng_20250708_14_33_occams_budget_1000_nuggets_10_reranked | IDACCS_nugget_4.1 | IDACCS_hybrid_4.1 | IDACCS_nugget_tb4.1 | IDACCS_extract_4.1 | IDACCS_hybridtb_4.1
Abstract
This paper gives an overview of team IDA/CCS’s submissions to the 2025 TREC RAG and RAGTIME tracks. Our approach builds on our 2024 RAG (team LAS) and NeuCLIR (team IDA/CCS) approaches. We started with the 2024 NeuCLIR task, fine-tuning it for the NeuCLIR pilot data. We then adapted this approach for both the RAGTIME and RAG generation task submissions. As the 2025 RAGTIME task is multilingual, instead of cross-lingual like 2024, it was natural to look at stratified retrieval and compare it to a multilingual ranking using the NeuCLIR pilot data. We found that stratified query retrieval with reranking, adapted from our RAG 2024 work, was particularly helpful for generating reports within 2K and 10K character limits. In addition, we show work on improving extraction, using occams and attribution. Finally, we include a detailed meta-analysis of the automatic and semi-automatic metrics.
Bibtex
@inproceedings{IDACCS-trec2025-papers-proc-1,
title = {Team IDACCS at TREC 2025: RAG and RAGTIME Tracks},
author = {John M. Conroy and Mike Green and Neil P. Molino and Yue “Ray” Wang and Julia S. Yang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks¶
Yue Wang, John M. Conroy, Neil Molino, Julia Yang, Mike Green
- Paper: https://trec.nist.gov/pubs/trec34/papers/ncsu-las.rag.ragtime.pdf
- Participant: ncsu-las
- Runs: las_ag_sel_29 | las_ag_sel_all_4.1 | las_ag_sel_28 | las_ag_sel_new_prompt | las_ag_round_robin
Abstract
This report describes submissions by the Laboratory for Analytic Sciences to the TREC 2025 RAG and RAGTime tracks. By leveraging autonomous agent workflows, including query decomposition, planner-executor architectures, and ensemble retrieval techniques (e.g., BM25, SPLADE, T5 sentence embeddings), we examine whether “Agentic RAG” can surpass traditional RAG systems in terms of retrieval relevance, groundedness, factuality, and citation quality. Our evaluations using Open-RAG-Eval, Autonuggetizer, and other metrics indicate gains in nugget coverage, groundedness and citation accuracy, albeit with trade-offs in retrieval relevance and factual consistency. In addition, we explored trade-offs in retrieval methodologies for the TREC RAG retrieval-only task, and agentic generation for the Report Generation task in RAGTIME.
Bibtex
@inproceedings{ncsu-las-trec2025-papers-proc-1,
title = {Laboratory for Analytic Sciences in TREC 2025 RAG and RAGTIME Tracks},
author = {Yue Wang and John M. Conroy and Neil Molino and Julia Yang and Mike Green},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Hybrid Sparse-Neural Fusion for Passage Retrieval¶
Georgios Arampatzis, Avi Arampatzis
- Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.ragtime.pdf
- Participant: DUTH
- Runs: duth_mlir_eng_rrf | duth_mlir_xenc | duth-mlir-mlm6 | duth-mlir-mlm6loc | mlir-tblocal | mlir-pybm25 | mlir-mlm12 | mlir-elec | mlir-tb | mlir-fused | mlir-rrf-report | xenc-report | eng_mlm6 | eng_mlm6loc | eng_fused | tblocal | pybm25 | mlm12 | electra | tb
Abstract
This paper studies multilingual information retrieval (MLIR) and report generation under retrieval-augmented evaluation settings, with an emphasis on robustness, reproducibility, and interpretability. We focus on efficient and lightweight transformer-based cross-encoder architectures for passage re-ranking in a multilingual retrieval scenario. Our approach follows a two-stage retrieval framework. In the first stage, BM25 is used for initial candidate selection, while in the second stage lightweight transformer-based cross-encoders (MiniLM, ELECTRA, and TinyBERT) are applied for passage re-ranking. To integrate multiple model predictions, we employ Reciprocal Rank Fusion (RRF), enabling robust aggregation across heterogeneous ranking signals. Unlike multilingual fine-tuning or agent-based approaches, our system relies exclusively on English-only cross-encoders applied in a zero-shot setting. This design allows us to assess the cross-lingual generalization capacity of compact re-ranking models under strict efficiency and reproducibility constraints. Experimental results on the validation phase show that, while our runs do not match the absolute effectiveness of large-scale multi-agent or generative systems, they achieve stable and interpretable performance given their lightweight architecture. Overall, the findings highlight the trade-off between retrieval effectiveness and computational efficiency, and demonstrate that compact re-ranking architectures combined with simple fusion strategies remain viable baselines for multilingual retrieval in low-resource and efficiency-constrained settings.
Bibtex
@inproceedings{DUTH-trec2025-papers-proc-2,
title = {Hybrid Sparse-Neural Fusion for Passage Retrieval},
author = {Georgios Arampatzis and Avi Arampatzis},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Tip of the Tongue (TOT)¶
OVERVIEW OF THE TREC 2025 TIP-OF-THE-TONGUE TRACK¶
Jaime Arguello, Fernando Diaz, Maik Fröebe, To Eun Kim, Bhaskar Mitra
Abstract
Tip-of-the-tongue (ToT) known-item retrieval involves re-finding an item for which the searcher does not reliably recall an identifier. ToT information requests (or queries) are verbose and tend to include several complex phenomena, making them especially difficult for existing information retrieval systems. The TREC 2025 ToT track focused on a single ad-hoc retrieval task. This year, we extended the track to general domain and incorporated different sets of test queries from diverse sources, namely from the MS-ToT dataset, manual topic development, and LLM-based synthetic query generation. This year, 9 groups (including the track coordinators) submitted 32 runs.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-2,
title = {OVERVIEW OF THE TREC 2025 TIP-OF-THE-TONGUE TRACK},
author = {Jaime Arguello and Fernando Diaz and Maik Fröebe and To Eun Kim and Bhaskar Mitra},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Webis at TREC 2025: Tip-of-the-Tongue Track and AutoJudge¶
Maik Fröbe, Jan Heinrich Merker, Eric Oliver Schmidt, Martin Potthast, Matthias Hagen
- Paper: https://trec.nist.gov/pubs/trec34/papers/webis.tot.pdf
- Participant: webis
- Runs: webis-bm25-gpt-oss | webis-bm25-llama
Abstract
This paper describes the Webis Group’s participation in the 2025 edition of TREC. We participated in the Tip-of-the-Tongue track and the pilot round of the AutoJudge track. For our participation in the Tip-of-the-Tongue track, we re-executed our query relaxation strategies that we developed in our previous years submissions (removing terms that likely reduce retrieval effectiveness). For the pilot round of the AutoJudge track we apply axiomatic thinking by using preferences and features from all 29 axiomatic constraints for retrieval augmented generation that are implemented in the ir_axioms package (evaluation is in progress).
Bibtex
@inproceedings{webis-trec2025-papers-proc-1,
title = {Webis at TREC 2025: Tip-of-the-Tongue Track and AutoJudge},
author = {Maik Fröbe and Jan Heinrich Merker and Eric Oliver Schmidt and Martin Potthast and Matthias Hagen},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks¶
Hongyu Li, Yuming Zhang, Junyu Zhou, Yongwei Zhang, Shanshan Jiang, Bin Dong
- Paper: https://trec.nist.gov/pubs/trec34/papers/SRCB.mllm.tot.pdf
- Participant: SRCB
- Runs: scrb-tot-01 | scrb-tot-02 | scrb-tot-03 | scrb-tot-04
Abstract
This paper reports the performance of SRCB’s system in the Million LLMs and Tip-of-the-Tongue tracks. For the Million LLM Track, we mainly rely on powerful LLMs and various methods to construct the missing training labels. And then, we use the constructed training data and devise several approaches to achieve the ranking target and conduct experiments. For the Tip-of-the-Tongue task, we propose a retrieval framework that integrates dense and LLM-based components. Original queries are transformed into cue lists, and additional data are used to fine-tune both the dense retriever and re-ranker. Furthermore, retrieval results from LLMs are incorporated to supplement the reranker, and the final ranking is produced using an LLM reranker.
Bibtex
@inproceedings{SRCB-trec2025-papers-proc-1,
title = {SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks},
author = {Hongyu Li and Yuming Zhang and Junyu Zhou and Yongwei Zhang and Shanshan Jiang and Bin Dong},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Hedge Removal, Query Drift, and the Simulation Gap in Tip-of-the-Tongue Retrieval¶
Bruno N. Sotic, Jaap Kamps
- Paper: https://trec.nist.gov/pubs/trec34/papers/UAmsterdam.tot.pdf
- Participant: UAmsterdam
- Runs: bm25_hedge_aware | bm25_hedges_neg | bm25_negations | rm3_hedges | rm3_negations | rm3_hedge_neg
Abstract
Retrieving known items from vague, verbose queries (the “Tip-of-the-Tongue” or ToT problem) poses a unique challenge for information retrieval. In the TREC 2025 ToT Track, we investigated linguistic preprocessing strategies (such as hedge removal and negation penalties) and hybrid retrieval methods across simulated and human-generated queries. Our experiments reveal a substantial divergence between LLM-simulated development data and the official test set. Hedge removal yielded large gains on verbose synthetic queries (+25.4% NDCG on dev3), but minimal improvement on sparse human queries (+2.7% on test, not significant). Negation penalties produced no measurable effect across all conditions. Pseudo-Relevance Feedback (RM3) consistently degraded performance, amplifying query drift rather than resolving vocabulary mismatch. Analysis of the recall pool reveals a fundamental bottleneck: only 32% of target documents appeared among the top 100 BM25 results on the test set. Within this constraint, hybrid dense reranking improved early precision by 10.6% on hard queries where lexical matching failed, but degraded performance on queries with strong term overlap. We conclude that systems optimized on LLM-simulated ToT data risk overfitting to synthetic linguistic patterns that do not reflect the sparse, fragmented nature of real human memory retrieval.
Bibtex
@inproceedings{UAmsterdam-trec2025-papers-proc-2,
title = {Hedge Removal, Query Drift, and the Simulation Gap in Tip-of-the-Tongue Retrieval},
author = {Bruno N. Sotic and Jaap Kamps},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking¶
Wenxin Zhou, Ritesh Mehta, Anthony Miyaguchi
- Paper: https://trec.nist.gov/pubs/trec34/papers/[email protected]
- Participant: DS@GT
- Runs: bge-m3 | gmn-rerank-500 | lambdamart-rerank | gm27q-LMART-1000 | gm27q-comb-500 | gemini-retrieval | top_model_dense
Abstract
We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and LLM-based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges LLM-based retrieval, sparse (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and LLM-based reranking. To support model training, we generate 5000 synthetic ToT queries using LLMs. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.
Bibtex
@inproceedings{DS_GT-trec2025-papers-proc-1,
title = {DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking},
author = {Wenxin Zhou and Ritesh Mehta and Anthony Miyaguchi},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
UFMG at TREC 2025: Retriever-Aligned Query Rewriting for Tip-of-the-Tongue Retrieval¶
Arthur Pontes Nader, Rodrygo L. T. Santos
- Paper: https://trec.nist.gov/pubs/trec34/papers/ufmg.tot.pdf
- Participant: ufmg
- Runs: runid1 | runid2 | runid3 | runid4
Abstract
Tip-of-the-Tongue queries are difficult to rewrite due to vague user descriptions and limited supervised training data. We address this by generating rewrite preference pairs automatically from dense and cross-encoder retrieval scores, enabling a reliable dataset for fine-tuning directly on ranker preferences. We compare prompt tuning, domain-specific DPO, and general DPO models within a Tree-of-Thoughts rewriting and retrieval pipeline. Results on the TREC Tip-of-the-Tongue track show steady gains from prompt tuning to DPO, with a GPT-5-nano ensemble of all runs achieving our best performance among our submissions (NDCG@1000 = 0.277, MRR@1000 = 0.199).
Bibtex
@inproceedings{ufmg-trec2025-papers-proc-1,
title = {UFMG at TREC 2025: Retriever-Aligned Query Rewriting for Tip-of-the-Tongue Retrieval},
author = {Arthur Pontes Nader and Rodrygo L. T. Santos},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval¶
Debayan Mukhopadhyay, Utshab Kumar Ghosh, Shubham Chatterjee
- Paper: https://trec.nist.gov/pubs/trec34/papers/mst.tot.pdf
- Participant: mst
- Runs: llama_norm_fusion_z | llama_norm_fusion_v2
Abstract
Retrieving known items from vague, partial, or inaccurate descriptions, a phenomenon known as Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge for modern information retrieval systems. Our approach integrates a single call to an 8B-parameter Large Language Model (LLM) using a model agnostic prompt for query reformulation and controlled expansion to bridge the gap between ill-formed ToT queries and well-specified information needs in scenarios where Pseudo-Relevance Feedback (PRF) based expansion methods are rendered ineffective, due to poor first stage Recall and Ranking using the raw queries, resulting in expansion using the incorrect documents. Importantly, the LLM employed in our framework was deliberately kept generic: it was not fine-tuned for Tip-of-the-Tongue (ToT) queries, nor adapted to any specific content domains (e.g., movies, books, landmarks). This design choice underscores that the observed gains stem from the formulation of the proposed prompting and expansion strategy itself, rather than from task or domain-specific specialization of the underlying model. Rewritten queries are then processed by a multi-stage retrieval pipeline consisting of an initial sparse retrieval stage (BM25), followed by an ensemble of bi-encoder and late-interaction re-rankers (Contriever, E5-large-v2, and ColBERTv2), cross-encoder re-ranking using monoT5, and a final list-wise re-ranking stage powered by a 72B-parameter LLM (Qwen 2.5 Instruct, 4-bit quantized). Experiments on the datasets provided in the 2025 TREC-ToT track show that supplying original user queries directly to this otherwise competitive multi-stage ranking pipeline still yields poor retrieval effectiveness, highlighting the central role of query formulation in ToT scenarios. In contrast our light-weight LLM based pre-retrieval query transformation improves the initial Recall net by 20.61% and then subsequent re-ranking using the re-written queries improves nDCG@10 by 33.88%, MRR by 29.92% and Map@10 by 29.98% over the same pipeline operating on raw queries, indicating that it serves as a highly cost-effective intervention, unlocking substantial performance gains and enabling downstream retrievers and rankers to realize their full potential in Tip-of-the-Tongue retrieval.
Bibtex
@inproceedings{mst-trec2025-papers-proc-1,
title = {Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval},
author = {Debayan Mukhopadhyay and Utshab Kumar Ghosh and Shubham Chatterjee},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Bridging Lexical and Neural Ranking for Topic-Oriented Retrieval¶
Georgios Arampatzis, Konstantina Safouri, Avi Arampatzis
- Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.tot.pdf
- Participant: DUTH
- Runs: lex-stronger-test | bm25-porterblk-test | lex-stronger-testv2
Abstract
This paper presents the DUTH team’s participation in the TREC Tip-of-the-Tongue (TREC-TOT) 2025 shared task. Although we explored a hybrid retrieval pipeline combining BM25 with Sentence-BERT dense embeddings and a MiniLM-based cross-encoder during development, our official submitted runs rely exclusively on unsupervised lexical retrieval models implemented in the Terrier and PyTerrier frameworks. The submitted systems integrate multiple probabilistic models—including BM25, Divergence From Randomness variants, and query-likelihood language models—with RM3 pseudo-relevance feedback and Reciprocal Rank Fusion. This multi-stage lexical architecture aims to maximize early precision and robust recall for underspecified Tip-of-the-Tongue queries. Experiments on the official TREC-TOT 2025 development split show that the fused lexical pipelines achieve strong performance across early ranking and recall-based metrics, highlighting the competitiveness of carefully tuned lexical ensembles for memory-based retrieval. Hybrid dense re-ranking demonstrated improvements during development but was not part of the official submission.
Bibtex
@inproceedings{DUTH-trec2025-papers-proc-4,
title = {Bridging Lexical and Neural Ranking for Topic-Oriented Retrieval},
author = {Georgios Arampatzis and Konstantina Safouri and Avi Arampatzis},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Video Question Answering (VQA)¶
Video Question Answering (VQA) 2025 Track¶
George Awad, Sanjay Purushotham, Afzal Godil
Abstract
Recent advancements in large multimodal models have significantly improved AI’s ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, which requires AI systems to integrate visual, auditory, and temporal information to answer questions in a meaningful way. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants developed and tested models that answer a diverse set of video segments-based questions covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track serves as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current multimodal AI architectures. By fostering innovation in multimodal learning, this track contributes to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information. The track completed its first pilot year, which included two subtasks: Answer Generation Task and Multiple Choice Task. Based on lessons learned and participant feedback, we plan to run the track again in 2026.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-1,
title = {Video Question Answering (VQA) 2025 Track},
author = {George Awad and Sanjay Purushotham and Afzal Godil},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Nagaoka University of Technology at TREC 2025 Video Question Answering¶
Isabel Gonzalez, Shungo Kubosaka, Takashi Yukawa
- Paper: https://trec.nist.gov/pubs/trec34/papers/kslab.vqa.pdf
- Participant: kslab
- Runs: nut-kslab-2025 | tv25-kslab-mc
Abstract
This paper details our approach to two tasks in Video Question Answering (VQA) for TREC 2025 challenge, using VideoLLaMA3-2B as the base model. For both the Answer Generation (AG) task and Multiple Choice (MC) task, the primary training data was the dataset provided by TREC, which was used to finetune the model using LoRA. For the AG task, the approach we used was a methodology to generate diverse answers using sampling and ranked them using the model's average log-probability, which proved effective with a NDCG_BERT score of 0.993. Our generated answers were found to be semantically similar (BERT score: 0.893) but lexically different (METEOR: 0.226). We also identified a potential bias in confidence-based ranking that favors shorter answers. For the MC task, our approach was a two-stage process where the model first generates a "ground truth" answer via greedy decoding, which is then used by a Sentence-Transformer to rank the given options based on cosine similarity. This approach achieved a Top1 Accuracy of 0.499 and a Mean Reciprocal Rank (MRR) of 0.686. The system's effectiveness depended on both the accuracy of the "ground truth" generation and the Sentence-Transformer's similarity measurement. We learned this "generate-then-compare" strategy is viable, but its main limitation is error propagation from the first step. This paper outlines these methods, our experimental findings, and their limitations.
Bibtex
@inproceedings{kslab-trec2025-papers-proc-1,
title = {Nagaoka University of Technology at TREC 2025 Video Question Answering},
author = {Isabel Gonzalez and Shungo Kubosaka and Takashi Yukawa},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK¶
Fangyun Duan, Haixiang Ni, Xiusong Wang, Chao Liang
- Paper: https://trec.nist.gov/pubs/trec34/papers/WHU-NERCMS.avs.vqa.pdf
- Participant: WHU-NERCMS
- Runs: videollama-2B | videollama-2B-prompt | videollama-7B-prompt | videollama-7B-vqa
Abstract
The WHU-NERCMS team participated in the Ad-hoc Vedio Search (AVS) and Video Question Answer(VQA) tasks at TRECVID 2025. For AVS task, we continued to use multiple visual semantic embedding methods, combined with ranking aggregation techniques to integrate different models and their outputs to generate the final ranked video shot list. For VQA task, we propose to use the VLM model to generate answer that serve as baseline answer. The answer is then embedded in the same vector space with the four options, and then compute the similarity of these vectors to sort the results.
Bibtex
@inproceedings{WHU-NERCMS-trec2025-papers-proc-1,
title = {WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK},
author = {Fangyun Duan and Haixiang Ni and Xiusong Wang and Chao Liang},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Spatio-Temporal Input Densification for Efficient and Robust Open-Domain Video Question Answering¶
Bao Tran, Thuyen Tran Doan, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh
- Paper: https://trec.nist.gov/pubs/trec34/papers/NII_UIT.vqa.pdf
- Participant: NII_UIT
- Runs: Aria8x3.5B_VidLLaMa | HIGHEST_PIPELINE
Abstract
Video Question Answering (VQA) requires systems to jointly reason over visual, auditory, and linguistic cues, and remains challenging due to complex temporal dependencies and the diverse, open-ended nature of real-world queries. Recent approaches often depend on supervised finetuning of large vision-language models, which yields strong in-distribution performance but comes with substantial data and computational demands. Furthermore, finetuned systems can struggle with temporal reasoning, small-object recognition, and effective use of audio information, limiting their robustness in open-domain benchmarks such as TRECVID. In this work, we introduce a VQA framework that enhances existing multimodal models without task-specific finetuning. Its core component is a spatio-temporal input densification strategy that reorganizes video evidence using dense frame sampling and spatial tiling, enabling finer visual understanding and more reliable temporal inference. The framework also incorporates lightweight modules for textualized audio integration, question type–aware prompting, and output normalization, contributing to improved robustness and answer consistency. Despite requiring no task-specific finetuning, the proposed system achieves strong results on the TRECVID 2025 VQA test set. Our Multiple Choice submission attains a Top-1 Accuracy of 0.774 and an MRR of 0.859, ranking as the top-performing run. For Answer Generation, the system reaches METEOR 0.173, BERTScore 0.887, STS 0.270, and NDCGBERTScore 0.996, placing it among the highest-ranked submissions. These results demonstrate the effectiveness of inference-time input densification as a scalable alternative to supervised finetuning.
Bibtex
@inproceedings{NII_UIT-trec2025-papers-proc-2,
title = {Spatio-Temporal Input Densification for Efficient and Robust Open-Domain Video Question Answering},
author = {Bao Tran and Thuyen Tran Doan and Tien Do and Tien-Dung Mai and Thanh Duc Ngo and Duy-Dinh Le and Shin'ichi Satoh},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
HLTCOE Evaluation Team at TREC 2025: VQA Track¶
Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme
- Paper: https://trec.nist.gov/pubs/trec34/papers/HLTCOE.vqa.pdf
- Participant: HLTCOE
- Runs: MARS-GenR_1
Abstract
The HLTCOE Evaluation team participated in TREC VQA’s Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video–question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-3,
title = {HLTCOE Evaluation Team at TREC 2025: VQA Track},
author = {Dengjia Zhang and Charles Weng and Katherine Guerrerio and Yi Lu and Kenton Murray and Alexander Martin and Reno Kriz and Benjamin Van Durme},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025¶
Andreas Goulas, Damianos Galanopoulos, Ioannis Patras, Vasileios Mezaris
- Paper: https://trec.nist.gov/pubs/trec34/papers/CERTH-ITI.avs.vqa.pdf
- Participant: CERTH-ITI
- Runs: certh.vqa.ag.run_1 | certh.vqa.ag.run_2 | certh.vqa.ag.run_3 | certh.vqa.ag.run_4 | certh.vqa.mc.run_1 | certh.vqa.mc.run_2 | certh.vqa.mc.run_3 | certh.vqa.mc.run_4
Abstract
This paper presents the overview of the runs related to the Ad-hoc Video Search (AVS) and Video Question Answering (VQA) tracks of TRECVID 2025 on behalf of the CERTH-ITI team. For the AVS track, we introduce a two-stage framework built on foundation models. In the first stage, multi-ple vision–language models (VLMs) encode both the input query, augmented through LLM-generated rephrasings, and the candidate video shots, producing weighted similarity scores for initial retrieval. In the second stage, we utilize a Multimodal-LLM(MLLM)-based reranking module that evaluates the semantic alignment between each shot among the top-N highest-ranked ones and the original query, generating updated relevance scores for reordering these shots. This MLLM-driven reranking significantly improves contextual matching and produces more accurate final rankings without requiring any model training. Regarding the VQA track, we fine-tune an audio-visual MLLM model on the provided TRECVID training dataset and we implement an inference-time scaling technique to enhance the mul-timodal understanding capabilities of the MLLM. For the open-ended Answer Generation (AG) task, we aggregate multiple model responses per question via a majority vote. The responses are generated with greedy sampling from different random frame subsets of the video and they are ranked based on the number of votes. For the Multiple-Choice (MC) task, instead of voting, we use mean pooling on the logits assigned by the fine-tuned model to each candidate response. Through the combination of fine-tuning and frame subset ensembling we achieve the highest score across 3 metrics in the VQA AG task and the second highest in the VQA MC task.
Bibtex
@inproceedings{CERTH-ITI-trec2025-papers-proc-1,
title = {MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025},
author = {Andreas Goulas and Damianos Galanopoulos and Ioannis Patras and Vasileios Mezaris},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}