Proceedings - BioGen 2025¶
Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track¶
Deepak Gupta, Dina Demner-Fushman, William Hersh, Steven Bedrick, Kirk Roberts
Abstract
This overview of the TREC 2025 BioGen track discussed the tasks, datasets, evaluation metrics, participating systems, and their performance. We evaluated the performance of the submitted runs across multiple levels (answers, citations, and documents) using expert and automated evaluation. For Task A, most teams followed the NLI base approach. For Task B, most teams used a two-step RAG approach: first, they retrieved documents via lexical search (BM25), then re-ranked them to obtain the top-k relevant documents/snippets. In the second stage, LLMs were used to generate the answer, citing appropriate documents. We hope that introducing the task has created ground-truth datasets to foster research on designing systems that generate answers to health-related questions with appropriate citations, thereby providing a trusted, reliable source to support the assertions in the answers.
Bibtex
@inproceedings{coordinators-trec2025-papers-proc-4,
title = {Overview of TREC 2025 Biomedical Generative Retrieval (BioGen) Track},
author = {Deepak Gupta and Dina Demner-Fushman and William Hersh and Steven Bedrick and Kirk Roberts},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Contradiction-Aware Grounded QA System for TREC 2025 BioGen: Lexical Retrieval Trade-offs and Citation Attribution¶
Soumya Ranjan Sahoo, Gagan N., Sanand Sasidharan, Divya Bharti
- Participant: GEHC-HTIC
- Paper: https://trec.nist.gov/pubs/trec34/papers/GEHC-HTIC.biogen.pdf
- Runs: gehc_htic_task_a | gehc_htic_task_b | task_a_gehc_htic_run2
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in biomedical question-answering tasks, yet their tendency to generate plausible yet unverified claims poses significant risks in clinical contexts. To mitigate the clinical risks of LLM hallucinations, the TREC 2025 BioGen track mandates grounded answers that explicitly surface contradictory evidence (Task A) and the generation of narrative-driven, fully attributed responses (Task B). Addressing the critical absence of target ground truth, we present a proxy-based development framework utilizing the SciFact dataset to systematically optimize retrieval architectures. Our iterative evaluation revealed a “Simplicity Paradox”: complex adversarial dense retrieval strategies failed catastrophically on contradiction detection (MRR 0.023) due to Semantic Collapse, where negation signals were indistinguishable in vector space. Furthermore, we identified a distinct Retrieval Asymmetry: while filtering dense embeddings improved contradiction detection, it degraded support recall, compromising holistic reliability. We resolve this via a Decoupled Lexical Architecture utilizing a unified BM25 backbone to balance semantic support recall (0.810) with precise contradiction surfacing (0.750). This approach achieves the highest Weighted MRR (0.790) on the proxy benchmark while remaining the only computationally viable strategy for scaling to the 30-million-document PubMed corpus. For answer generation, we introduce Narrative-Aware Reranking and One-Shot In-Context Learning, which improved citation coverage from 50% (zero-shot) to 100%. Our official TREC evaluation results confirm these findings: our system ranks 2nd among all teams on Task A contradiction F1 and 3rd out of 50 runs on Task B citation coverage (98.77%), achieving zero citation contradict rate. Our work transforms LLMs from stochastic generators into honest evidence synthesizers, demonstrating that epistemic integrity in biomedical AI requires prioritizing lexical precision and architectural scalability over isolated metric optimization.
Bibtex
@inproceedings{GEHC-HTIC-trec2025-papers-proc-1,
title = {Contradiction-Aware Grounded QA System for TREC 2025 BioGen: Lexical Retrieval Trade-offs and Citation Attribution},
author = {Soumya Ranjan Sahoo and Gagan N. and Sanand Sasidharan and Divya Bharti},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
TREC BioGen 2025: A Retrieval and NLI-Based Approach for Biomedical Evidence Grounding¶
Riccardo Lunardi, Riccardo Zamolo, Maria Elena Zuliani, Stefano Mizzaro, Vincenzo Della Mea, Kevin Roitero
- Participant: uniud
- Paper: https://trec.nist.gov/pubs/trec34/papers/uniud.biogen.pdf
- Runs: run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct | run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct | run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct | run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct | run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct | run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct | run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct | run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct | run9_no_rerank_index-passages-sparse_gpt-4o-mini | run10_no_rerank_index-passages-dense_gpt-4o-mini | run1_no-rerank_index-passages-sparse | run2_rerank_index-passages-sparse | run3_no-rerank_index-passages-dense | run4_rerank_index-passages-dense
Abstract
This technical report presents the system developed for the TREC 2025 Biomedical Generative Retrieval Track. Our approach integrates both sparse and dense retrieval, LLM-based reranking, and Natural Language Inference (NLI) to identify both supporting and contradicting evidence for grounded answer generation, evaluating the interplay between them. We submitted 14 runs, four for Task A and ten for Task B, aimed to analyze how different retrieval and grounding configurations impact the factuality and reliability of biomedical answer generation.
Bibtex
@inproceedings{uniud-trec2025-papers-proc-1,
title = {TREC BioGen 2025: A Retrieval and NLI-Based Approach for Biomedical Evidence Grounding},
author = {Riccardo Lunardi and Riccardo Zamolo and Maria Elena Zuliani and Stefano Mizzaro and Vincenzo Della Mea and Kevin Roitero},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Generating Long-Form Answers to Biomedical Questions: Effectiveness and Efficiency¶
Jan Bakker, Jaap Kamps
- Participant: UAmsterdam
- Paper: https://trec.nist.gov/pubs/trec34/papers/UAmsterdam.biogen.pdf
- Runs: UAmsterdam_bergen | UAmsterdam_bergen_llama-8b | UAmsterdam_bergen_llama-70b | UAmsterdam_bergen_mistral-7b | UAmsterdam_bergen_pisco-llama | UAmsterdam_bergen_pisco-mistral
Abstract
In this paper, we report on the University of Amsterdam’s participation in the TREC 2025 BioGen Track (Gupta et al., 2025). The goal is to generate answers to biomedical questions that are grounded with appropriate citations. First, we investigated the zero-shot generalization of PISCO, an efficient open-domain question answering method, to this task. Second, we compared the use of small and large Llama models. Our findings reveal a trade-off between answer quality and generation time. Therefore, we emphasize the need for better approaches to combining LLM reliability with efficiency.
Bibtex
@inproceedings{UAmsterdam-trec2025-papers-proc-3,
title = {Generating Long-Form Answers to Biomedical Questions: Effectiveness and Efficiency},
author = {Jan Bakker and Jaap Kamps},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Multi-System Biomedical QA with Post-Hoc Sentence Attribution for TREC BioGen¶
Harikrishnan Gurushankar Saisudha, Sabine Bergler
- Participant: CLaC
- Paper: https://trec.nist.gov/pubs/trec34/papers/CLaC.biogen.pdf
- Runs: simpleQA_BM25 | simpleQA_hybrid | MedHopQA_BM25 | MedHopQA_FAISS | LLM_BM25 | LLM_NLI_BM25
Abstract
This paper presents six systems developed for the TREC 2024 Biomedical Reference Attribution task, covering both Task A (grounding answers) and Task B (reference attribution). For Task A, two LLM-based citation attribution systems were designed to identify supporting and contradicting evidence from PubMed, using a combination of sparse retrieval, dense reranking, and optional NLI-based contradiction filtering. For Task B, four systems were developed: two SimpleQA pipelines emphasizing lightweight, narrative-aligned answer generation, and two MedHopQA-based pipelines adapted from a multi-hop biomedical QA framework. Across systems, we evaluate the impact of retrieval strategies, question decomposition, and reranking on evidence selection and answer quality. Our analysis highlights key challenges in grounding citations, particularly for generic statements and contradiction detection, and identifies opportunities to improve LLM-based attribution through refined retrieval strategies and specialized fine-tuning. Together, these systems provide a comprehensive exploration of reference attribution in biomedical QA and illustrate the trade-offs between retrieval coverage, narrative alignment, and attribution accuracy.
Bibtex
@inproceedings{CLaC-trec2025-papers-proc-1,
title = {Multi-System Biomedical QA with Post-Hoc Sentence Attribution for TREC BioGen},
author = {Harikrishnan Gurushankar Saisudha and Sabine Bergler},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Modular Sparse and Dense Retrieval for Evidence-Constrained Biomedical Question Answering¶
Ganesh Chandrasekar, Benjamin Lofo Follo, Aleksandr Vinokhodov, Sabine Bergler
- Participant: CLaC Lab
- Paper: https://trec.nist.gov/pubs/trec34/papers/CLaC Lab.biogen.pdf
- Runs: system_a | system_b_tfidf | system_c_medcpt | system_d_medcpt_wide | system_e
Abstract
We describe a submission to TREC BioGen Task B, which generates short answers grounded in PubMed and cited with PMIDs. Our approach is modular: a sparse pipeline (BM25 with cross-encoder reranking) and a dense pipeline (MedCPT bi-encoder with cross-encoder) are both coupled with evidence-constrained generation that preserves PMIDs end to end. Each pipeline runs under two retrieval budgets (narrow and wide) using a shared query set formed by the original question plus three reformulations, yielding five runs in total including a baseline. We outline design choices for reformulation, reranking, evidence selection, and citation control, and we report official results for answer quality and citation metrics.
Bibtex
@inproceedings{CLaC_Lab-trec2025-papers-proc-1,
title = {Modular Sparse and Dense Retrieval for Evidence-Constrained Biomedical Question Answering},
author = {Ganesh Chandrasekar and Benjamin Lofo Follo and Aleksandr Vinokhodov and Sabine Bergler},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
SIB Text-Mining group at TREC BioGen 2025¶
Luc Mottin, Anaïs Mottaz, Alexandre Flament, Julien Knafou, Patrick Ruch
- Participant: SIB
- Paper: https://trec.nist.gov/pubs/trec34/papers/SIB.biogen.pdf
- Runs: SIB-task-a-3 | SIB-task-a-4 | SIB-task-a-2 | SIB-task-a-1 | SIB-task-a-6 | SIB-task-a-5 | SIB-task-a-7 | SIB-task-b-1 | SIB-task-b-2 | SIB-task-b-3 | SIB-task-b-4
Abstract
In the 2025 TREC Biomedical Generative Retrieval (BioGen) Track, we evaluated approaches for producing evidence-grounded biomedical answers across two tasks: sentence-level grounding (Task A) and citation-attributed answer generation (Task B). Our pipelines combined specialized retrieval components with both open-source and instruction-tuned large language models (LLMs), integrating sparse and dense retrieval, re-ranking, and, in later runs, LLM-based claim decomposition, splitting complex claims into positive and negative subclaims for more precise evidence evaluation. Retrieved evidence was used to guide models such as Qwen2.5-7B-Instruct and Meta-Llama-3-8B in classifying supporting and contradicting references, with PMIDs selected using round-robin strategies to balance subclaim coverage. For Task B, this approach was extended to produce complete citation-grounded biomedical summaries. Across all runs, we aimed at evaluating the impact of hybrid retrieval, model adaptation, and structured prompting on factual grounding, interpretability, and traceability of LLM outputs.
Bibtex
@inproceedings{SIB-trec2025-papers-proc-1,
title = {SIB Text-Mining group at TREC BioGen 2025},
author = {Luc Mottin and Anaïs Mottaz and Alexandre Flament and Julien Knafou and Patrick Ruch},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}
Dal@TREC25: Improving Biomedical QA with Adaptive Retrieval and Multi-Stage RAG¶
Jitansh Arora, Aman Jaiswal, Dr. Juan Ramirez-Orta, Dr. Evangelos Milios
- Participant: dal
- Paper: https://trec.nist.gov/pubs/trec34/papers/dal.biogen.pdf
- Runs: rrf_monot5-msmarco_deepseek-r1 | rrf_monot5-msmarco_llama70b | rmmdn | rmmln | afmmdn | afmmd | expd | empd | expert_prompt | emotional_prompt
Abstract
This paper presents the design and implementation of Retrieval-Augmented Generation (RAG) pipelines developed for the TREC BioGen 2025 Challenge by the Dalhousie team, which focuses on grounding and attribution in Biomedical Question Answering. Our approach integrates Hybrid Retrieval, Re-Ranking, and Large Language Model (LLM) reasoning to improve Factual Accuracy and Citation Fidelity. For Task A (Grounding Answer), we introduce a pipeline that reformulates questions and answer sentences into supportive and contradictory queries, retrieves relevant PubMed articles, and classifies their stance using LLM-based reasoning. For Task B (Reference Attribution), we extend this framework to generate concise, evidence-grounded biomedical answers through a combination of Retrieval–Generation and Generate–then–Retrieve architectures. Multiple prompting strategies and validation mechanisms were explored to balance retrieval coverage, precision, and logical consistency. The proposed systems emphasize modularity, reproducibility, and adaptability to evolving biomedical corpora, providing a robust foundation for advancing trustworthy, citation-grounded question answering in the biomedical domain.
Bibtex
@inproceedings{dal-trec2025-papers-proc-1,
title = {Dal@TREC25: Improving Biomedical QA with Adaptive Retrieval and Multi-Stage RAG},
author = {Jitansh Arora and Aman Jaiswal and Dr. Juan Ramirez-Orta and Dr. Evangelos Milios},
booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
year = {2025},
address = {Gaithersburg, Maryland},
series = {NIST SP xxxx}
}