Runs - BioGen 2025¶

afmmd¶

Run ID: afmmd
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-20
Task: trec2025-biogen-task-b
MD5: b5c62c53b36f483e6e2d12b9a4c9326a
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval. Queries are expanded with DeepSeek-R1 and refined with Rocchio feedback for BM25 retrieval (top 5000 docs), while five semantically diverse sub-queries are generated for dense retrieval via a FAISS index using BioBERT embeddings (up to 1000 docs each). Results are merged with Reciprocal Rank Fusion and re-ranked using MonoT5 to select the top 10 documents, which are split into sentences and scored to produce the top 30 evidence snippets. The LLM then evaluates whether the snippets are sufficient; if not, it identifies missing information and generates reformulated queries for iterative retrieval (up to 5 rounds). Finally, DeepSeek-R1 generates an answer strictly from the collected snippets, adhering to formatting and citation constraints (≤250 words, max three PMIDs per sentence). Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails any validation, it is regenerated using corrective feedback upto a maximum of 20 times.

afmmdn¶

Run ID: afmmdn
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-20
Task: trec2025-biogen-task-b
MD5: 6f7e4ce9a7fc813b342760f43fe74cf1
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval. Queries are expanded with DeepSeek-R1 and refined with Rocchio feedback for BM25 retrieval (top 5000 docs), while five semantically diverse sub-queries are generated for dense retrieval via a FAISS index using BioBERT embeddings (up to 1000 docs each). Results are merged with Reciprocal Rank Fusion and re-ranked using MonoT5 to select the top 10 documents, which are split into sentences and scored to produce the top 30 evidence snippets. The LLM then evaluates whether the snippets are sufficient; if not, it identifies missing information and generates reformulated queries for iterative retrieval (up to 5 rounds). Finally, DeepSeek-R1 generates an answer strictly from the collected snippets, adhering to formatting and citation constraints (≤250 words, max three PMIDs per sentence). This time, we do not validate the content of the answer sentences against the cited documents using the Mistral-7b model. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.

emotional_prompt¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: emotional_prompt
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-a
MD5: 616eda8aa5cbed92cb2bcafdc78155b7
Run description: This system generates expanded search queries from each answer sentence using an emotionally focused prompting strategy. The queries are encoded with BioBERT and matched against a FAISS index of PubMed. Retrieved documents are evaluated by a Mistral model for support or contradiction, and the system outputs only citations that are verified as relevant evidence.

empd¶

Run ID: empd
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-21
Task: trec2025-biogen-task-b
MD5: b0245fa529fbeddc8c7c9297bde8f471
Run description: This run implements an emotional prompting strategy for biomedical question answering. Answers are generated by DeepSeek-R1 using a carefully crafted emotional system prompt to ensure precise and professional biomedical language. Supporting citations are added through a dense retrieval and verification pipeline: BioBERT + FAISS retrieves up to 30 relevant PubMed documents, and Mistral-7B filters them using an emotional prompt to retain only truly supportive evidence. The final outputs are concise (<250 words) and contain no more than three supporting PubMed citations per sentence.

expd¶

Run ID: expd
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-21
Task: trec2025-biogen-task-b
MD5: ca553d654b95be07f73c31ed4bca4398
Run description: This run implements an expert prompting strategy for biomedical question answering. Answers are generated by DeepSeek-R1 using a carefully crafted expert system prompt to ensure precise and professional biomedical language. Supporting citations are added through a dense retrieval and verification pipeline: BioBERT + FAISS retrieves up to 30 relevant PubMed documents, and Mistral-7B filters them using an expert prompt to retain only truly supportive evidence. The final outputs are concise (<250 words) and contain no more than three supporting PubMed citations per sentence.

expert_prompt¶

Run ID: expert_prompt
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-a
MD5: d6f54db4b2d2ba8f65451769b9556f6d
Run description: This system expands each answer sentence into new search queries using an expert-driven prompting strategy. The queries are embedded with BioBERT and searched against a FAISS index of PubMed. Retrieved documents are then filtered by a Mistral model that classifies their relationship to the claim, ensuring only genuinely supportive or contradictory citations are kept.

gehc_htic_task_a¶

Run ID: gehc_htic_task_a
Participant: GEHC-HTIC
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-a
MD5: 9f3205f0da80b0b2a4c3402fa5909544
Run description: This run uses a BM25-only pipeline over the official PubMed Lucene index (Pyserini), which stores each document’s raw JSON with keys {"id","contents"}. For every answer, we formulate a base query by concatenating the question and answer text, ensuring retrieval is keyed to the specific claim being evaluated. We inject a single, per-item metadata block (team_id, run_id, qa_id, question) in the output to match the submission schema, and we enable checkpointing with resume support so long jobs don’t lose progress. For SUPPORT, we retrieve the top-100 BM25 hits, build (text, pmid) candidates from the stored raw JSON, and exclude any pmids already known as existing supports or later selected as contradictions. We then apply a MedCPT Cross-Encoder to rerank these candidates by semantic relevance and keep the top-3 PMIDs. If reranking produces fewer than three results or text is missing for some hits, we backfill directly from the BM25 ranking to guarantee three supported citations whenever possible. For CONTRADICTIONS, we run a broad BM25 search (top-1000) to maximize recall, then split each abstract into sentences and apply a negative-cue filter (e.g., “did not,” “no effect,” “not significant”) to focus on likely contradictory spans, with a fallback to the first few sentences if no cues are found. These sentences are classified with a MedNLI seq2seq model (doc→claim prompts), and any document with at least one sentence decoding to “contradiction” is kept; we then select up to three contradicted PMIDs. Label decoding is tolerant, and the approach is robust to truncation by operating at the sentence level.

gehc_htic_task_b¶

Run ID: gehc_htic_task_b
Participant: GEHC-HTIC
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-b
MD5: 2ce4d2fd9ddf718b5804a6c1b28455e8
Run description: This run builds a two-stage retrieval-and-generation pipeline to produce concise, citation-grounded answers. First, we perform a high-recall BM25 search over the PubMed Lucene index using the question alone (top 1000). We then rerank these candidates with a MedCPT Cross-Encoder, but crucially we use the narrative (not the question) as the reranking query. This leverages the richer intent and context in the narrative to prioritize documents that are not just topically related but also aligned with what the user actually wants to read about. From this reranked set, we keep the top 10 documents (with their PMIDs and cross-encoder scores) as the working evidence set for the LLM. Answer synthesis uses a one-shot prompt tailored for biomedical communication. The prompt includes the topic, question, narrative, and a detailed “iron and ferritin” exemplar that demonstrates the desired style and structure. The system lists the selected documents with numeric indices ([1], [2], …), PMIDs, and relevance scores, then asks GPT-4o (via an enterprise Azure OpenAI endpoint) to produce a short, narrative-style answer under strict rules: every sentence must be supported by 1–3 citations, conflicts must be explicitly called out, and only provided evidence can be used. The generation is run with deterministic settings (temperature=0, fixed seed) and a retry/backoff wrapper for resiliency.

h2oloo_rr_g41_t50¶

Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix

Run ID: h2oloo_rr_g41_t50
Participant: h2oloo
Track: BioGen
Year: 2025
Submission: 2025-08-25
Task: trec2025-biogen-task-b
MD5: 512af248565e3b80d5a8442562c015a8
Run description: GPT4.1 takes top 50 results and uses Ragnarok Biogen V6 to generate grounded answer

h2oloo_rr_q3-30b_nc¶

Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix

Run ID: h2oloo_rr_q3-30b_nc
Participant: h2oloo
Track: BioGen
Year: 2025
Submission: 2025-08-25
Task: trec2025-biogen-task-b
MD5: 40afec226e67f6d1698dbf8728a7f77e
Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 uses Ragnarok Biogen V6 - No Cite

h2oloo_rr_q3-30b_t20¶

Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix

Run ID: h2oloo_rr_q3-30b_t20
Participant: h2oloo
Track: BioGen
Year: 2025
Submission: 2025-08-25
Task: trec2025-biogen-task-b
MD5: d075fa276c82dadc219f413cd02ad35c
Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 takes top 20 results and uses Ragnarok Biogen V6 to generate grounded answer

h2oloo_rr_q3-30b_t50¶

Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix

Run ID: h2oloo_rr_q3-30b_t50
Participant: h2oloo
Track: BioGen
Year: 2025
Submission: 2025-08-25
Task: trec2025-biogen-task-b
MD5: 9a1d1b64a57bf74e79cfc30d40e7b993
Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 takes top 50 results and uses Ragnarok Biogen V6 to generate grounded answer

hltbio-gpt5.searcher¶

Run ID: hltbio-gpt5.searcher
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: f266d732fcf4cd737ded99e33f3c9ea0
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Searcher II pointwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for most steps. GPT-5 is used for final answer generation (drafting) and answer shortening (revising report). LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg-fsrrf¶

Run ID: hltbio-lg-fsrrf
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: e25dd639464cc09c5a50d45d99265564
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.crux¶

Run ID: hltbio-lg.crux
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: 82cc22b26eea943ce83c75e71c17ba3d
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from meta-llama/Llama-3.3-70B-Instruct pointwise subquestion reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.fsrrfprf¶

Run ID: hltbio-lg.fsrrfprf
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: 0c77b30653dd9f4e62eca775351f5d79
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3 using PRF with each. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.jina¶

Run ID: hltbio-lg.jina
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: 08d343b91246db31e4f0d62256a3b0a9
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from jinaai/jina-reranker-m0 (2.4B) reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.jina.qwen¶

Run ID: hltbio-lg.jina.qwen
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: b77494d2abda361ec8afeca9a32219da
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of jinaai/jina-reranker-m0 (2.4B) and Qwen/Qwen3-Reranker-8B reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.listllama¶

Run ID: hltbio-lg.listllama
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: eadb4895bafa93315ab238162930631d
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from meta-llama/Llama-3.3-70B-Instruct listwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.qwen¶

Run ID: hltbio-lg.qwen
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: 795bbfabadb5c310596484553e2160ff
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Qwen/Qwen3-Reranker-8B reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltbio-lg.searcher¶

Run ID: hltbio-lg.searcher
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2025-08-19
Task: trec2025-biogen-task-b
MD5: 33826798de65d1713babe2ad1616fdd7
Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Searcher II pointwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.

hltcoe-eval-common-smoothed-sonnet¶

Participants | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix

Run ID: hltcoe-eval-common-smoothed-sonnet
Participant: EvalHLTCOE
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: 6dfc15f8e00fc42fbf289e9f30f533c0
Run description: unspecified

hltcoe-eval-svc-smoothed-sonnet¶

Run ID: hltcoe-eval-svc-smoothed-sonnet
Participant: EvalHLTCOE
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: ffb2d40ede881d2f4a7d51e7085f4fc7
Run description: unspecified

hltcoe-multiagt.llama70B.ag_sw-ret_plq-6-wbg-limit¶

Participants | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix

Run ID: hltcoe-multiagt.llama70B.ag_sw-ret_plq-6-wbg-limit
Participant: hltcoe-multiagt
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: 7cfeffd4818d652c4c1375622a7e4442
Run description: unspecified

hltcoe-multiagt.llama70B.lg-w-ret_lpq-nt-s_cite¶

Run ID: hltcoe-multiagt.llama70B.lg-w-ret_lpq-nt-s_cite
Participant: hltcoe-multiagt
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: 88b249e2005db8ff69707bb186125fc9
Run description: unspecified

hltcoe-multiagt.llama70B.lg-w-ret_p-nt-s_cite¶

Run ID: hltcoe-multiagt.llama70B.lg-w-ret_p-nt-s_cite
Participant: hltcoe-multiagt
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: 1d4d8205029439fecf42330289283276
Run description: unspecified

hltcoe-rerank.llama70B.lg-w-ret_plq-nt-s_cite¶

Run ID: hltcoe-rerank.llama70B.lg-w-ret_plq-nt-s_cite
Participant: hltcoe-rerank
Track: BioGen
Year: 2025
Submission: 2026-02-11
Task: trec2025-biogen-task-b
MD5: 40e8ec7fd6ed5fd1d0bb37b3c917b2bb
Run description: unspecified

LLM_BM25¶

Run ID: LLM_BM25
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: 73f640521d34a2f90939466da64224f0
Run description: This run implements a pipeline to identify both supporting and contradictory citations for each sentence of a given medical answer. The system processes each sentence independently. First, it employs a two-stage document retrieval method (BM25 followed by a ColBERT re-ranker) to gather a small, highly relevant set of documents from a PubMed corpus. These documents are then formatted and passed as context to a Mistral-7B-Instruct-v0.3 model. Guided by specific, constraint-driven prompts, the LLM analyzes the documents and identifies up to three PMIDs that either support or contradict the claim made in the sentence. The final output is a structured list containing each sentence paired with its corresponding citations.

LLM_NLI_BM25¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: LLM_NLI_BM25
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: 22be2ab0a22af2161c22d0c6c0f72501
Run description: This run is designed to automatically find both supporting and contradictory citations for a series of medical statements. For each statement, it first generates search queries and retrieves a set of relevant scientific articles from PubMed using a BM25 and ColBERT reranking pipeline. It then uses the Mistral-7B language model to identify and extract supporting PMIDs from these articles. To find contradictory evidence, it employs a specialized SciFive NLI model to filter the retrieved articles for statements that logically contradict the original claim, and then uses Mistral-7B to extract the corresponding PMIDs. The final output is a structured JSON file that appends the identified supporting and contradictory citations to each original statement.

MedHopQA_BM25¶

Run ID: MedHopQA_BM25
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 74b2ee5c867f02265a540b54fd72bf81
Run description: We use the MedHopQA pipeline that we developed for BioCreative 9 workshop from NLM. The system uses iterative, multi-hop strategies for QA. The document retriever used is PubMed indexed using BM25

MedHopQA_FAISS¶

Run ID: MedHopQA_FAISS
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 74ed8175eaf93808e1c7e0d2d6fd9c7a
Run description: We use the MedHopQA pipeline that we developed for BioCreative 9 workshop from NLM. The system uses iterative, multi-hop strategies for QA. The document retriever used is PubMed indexed using BM25 and FAISS

rmmdn¶

Run ID: rmmdn
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-20
Task: trec2025-biogen-task-b
MD5: c68d81194b874b12898ed1db7c339c86
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the deepseek-R1 model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. This time, we do not validate the content of the answer sentences against the cited documents. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.

rmmln¶

Run ID: rmmln
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-20
Task: trec2025-biogen-task-b
MD5: d3f4fb984338cb0f4d3b7285c92d62f6
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the llama-3-70b model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. This time, we do not validate the content of the answer sentences against the cited documents. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.

rrf_monot5-msmarco_deepseek-r1¶

Run ID: rrf_monot5-msmarco_deepseek-r1
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-05
Task: trec2025-biogen-task-b
MD5: 0b610aab283a8902c2094990ca702ffc
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the deepseek-R1 model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails validation, it is regenerated using corrective feedback upto a maximum of 20 times.

rrf_monot5-msmarco_llama70b¶

Run ID: rrf_monot5-msmarco_llama70b
Participant: dal
Track: BioGen
Year: 2025
Submission: 2025-08-05
Task: trec2025-biogen-task-b
MD5: 4d303fa037442df5b21fa8d8bcc5d6b6
Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the llama3-70b model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails validation, it is regenerated using corrective feedback upto a maximum of 20 times.

run10_no_rerank_index-passages-dense_gpt-4o-mini¶

Run ID: run10_no_rerank_index-passages-dense_gpt-4o-mini
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-b
MD5: 01a1c9ea98b7eccf7659dab6b3bafe92
Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

gpt-4o-mini is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run1_no-rerank_index-passages-sparse¶

Run ID: run1_no-rerank_index-passages-sparse
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-a
MD5: 6b20b2995cf71fd52124eb39f4cd7389
Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct¶

Run ID: run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-04
Task: trec2025-biogen-task-b
MD5: ffcc246665d8aae8fbb06b742517bf74
Run description: Retrieval with BM25 on a sparse index with full documents. No ranking was performed. Each document was tagged with a MNLI model and then passed to a LLM to generate the grounded answer

run2_rerank_index-passages-sparse¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: run2_rerank_index-passages-sparse
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-a
MD5: cd44046a583ead20e7e5fe0290558dd8
Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct¶

Run ID: run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: 87cebf600f096e612ac35d0109956691
Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run3_no-rerank_index-passages-dense¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: run3_no-rerank_index-passages-dense
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-a
MD5: 721f32f97fbf6b88fc83f030008ecc61
Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct¶

Run ID: run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: a6fb85df0f46b58616c4ae28e40302e8
Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run4_rerank_index-passages-dense¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: run4_rerank_index-passages-dense
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-a
MD5: 1bdad7e06c941f0659fd43bbd4e93373
Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct¶

Run ID: run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: 44427437ce824c599ecd5469b90c3399
Run description: Document retrieval is first performed over a dense index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct¶

Run ID: run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: 043ddd3fc98fbe59c99ce97c38eceddc
Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct¶

Run ID: run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: efd4c936885a047e7f1dbf07a8254afd
Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct¶

Run ID: run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: ea23d8ad09ad49b3fd61456d9209c6e2
Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct¶

Run ID: run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-06
Task: trec2025-biogen-task-b
MD5: 7707a11e5a46d30b004100e6ed137c3c
Run description: Document retrieval is first performed over a dense index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

run9_no_rerank_index-passages-sparse_gpt-4o-mini¶

Run ID: run9_no_rerank_index-passages-sparse_gpt-4o-mini
Participant: uniud
Track: BioGen
Year: 2025
Submission: 2025-08-08
Task: trec2025-biogen-task-b
MD5: 5ec35351669728f016cd88a224f2ccab
Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.

gpt-4o-mini is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.

scifive-ft-512CL-lex¶

Participants | Input | task-a-auto-eval | task-a-human-eval | Appendix

Run ID: scifive-ft-512CL-lex
Participant: polito
Track: BioGen
Year: 2025
Submission: 2025-08-13
Task: trec2025-biogen-task-a
MD5: d6a5006caffd4d432f415961896ebfb0
Run description: to fine tune the SciFive classifier, I generated a mixed dataset of synthetic and real data with the following structure: {abstract, question, claim, label, metadata}. I used datas from PubmedQA dataset, and SciFact dataset. I used 6 llm models to create such a dataset. the final dataset label distribution is as follows: Neutral: 17982 samples (41.93%) Entailment: 14674 samples (34.21%) Contradiction: 10233 samples (23.86%)

extra information about the dataset: 2. ABSTRACT SOURCES

original(from PubMedQA): 3464 samples (8.08%) generated by medgemma-4b-it: 8775 samples (20.46%) generated by BioMistral-7B-DARE: 9207 samples (21.47%) original(from SciFact): 773 samples (1.80%) original: 20670 samples (48.19%)

3. CLAIM SOURCES¶

original(from PubMedQA): 5224 samples (12.18%) generated by GPT-oss-20b: 5224 samples (12.18%) original(from SciFact): 1471 samples (3.43%) generated by medgemma-4b-it: 6145 samples (14.33%) generated by Llama-3.1-8B-Instruct: 6094 samples (14.21%) generated by OLMo-2-1124-7B-Instruct: 6323 samples (14.74%) generated by SOLAR-10.7B-Instruct-v1.0: 6236 samples (14.54%) generated by Mistral-7B-Instruct-v0.3: 6172 samples (14.39%)

4. QUESTION SOURCES¶

original(from PubMedQA): 10448 samples (24.36%) generated by GPT-oss-20b: 1471 samples (3.43%) generated by medgemma-4b-it: 6145 samples (14.33%) generated by Llama-3.1-8B-Instruct: 6094 samples (14.21%) generated by OLMo-2-1124-7B-Instruct: 6323 samples (14.74%) generated by SOLAR-10.7B-Instruct-v1.0: 6236 samples (14.54%) generated by Mistral-7B-Instruct-v0.3: 6172 samples (14.39%)

SIB-task-a-1¶

Run ID: SIB-task-a-1
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: f8d64b1a404244973ce33d59d4422dc0
Run description: index lucene BM25 + spacy nlp (en_core_web_lg)

SIB-task-a-2¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-2
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: aa3b01e738a17d206c125d646d84727c
Run description: index lucene BM25 + model facebook/bart-large-mnli to classify {support/contradict/irrelevant}

SIB-task-a-3¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-3
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-a
MD5: cb4d3159947197de9bd6597f560e89fe
Run description: * Reciprocal Rank Fusion for combining Elasticsearch and FAISS results (using ncbi/MedCPT-Query-Encoder)
use BAAI/bge-reranker-v2-m3 to rerank
use BioMistral/BioMistral-7B-SLERP with a prompt to select PMID which support or contradict the answers

SIB-task-a-4¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-4
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-22
Task: trec2025-biogen-task-a
MD5: fba46f805344c67bcba33a45adaf81e1
Run description: * Reciprocal Rank Fusion for combining Elasticsearch and FAISS results (using ncbi/MedCPT-Query-Encoder)
use BAAI/bge-reranker-v2-m3 to rerank
use ContactDoctor/Bio-Medical-Llama-3-8B with a prompt to select PMID which support or contradict the answers

SIB-task-a-5¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-5
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: 274d593cf57c64e30f43d8912e933f81
Run description: SIBiLS ElasticSearch + facebook/bart-large-mnli to classify {support/contradict/irrelevant} + prompt engineering

SIB-task-a-6¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-6
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: 0b4c6bafd96cafcb02315dd320839cd4
Run description: A pipeline that expands a claim into sub-claims (and their negations), retrieves candidate papers per (sub)claim using ES kNN + BM25 (0.3) with 0.7 score threshold, then asks Qwen-7B-Instruct to pick strictly confirming papers for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score.

SIB-task-a-7¶

Participants | Proceedings | Input | task-a-auto-eval | Appendix

Run ID: SIB-task-a-7
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-a
MD5: ca6393c825384011ff7be1214056c075
Run description: A pipeline that expands a claim into sub-claims (and their negations), retrieves candidate papers per (sub)claim using ES kNN + BM25 (0.3) with 0.7 score threshold, then asks Qwen-7B-Instruct to pick strictly confirming papers for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score.

SIB-task-b-1¶

Run ID: SIB-task-b-1
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 70b29aba3bcddba0d4acf402d259ee23
Run description: SIBiLS ElasticSearch + llama-3-8B-instruct + prompt engineering

SIB-task-b-2¶

Run ID: SIB-task-b-2
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 4dfa8c4c323f9edf8ed1a5b69d65e368
Run description: SIBiLS ElasticSearch + prompt + LORA - llama3-8b retrained on Biogen 2024 dataset

SIB-task-b-3¶

Run ID: SIB-task-b-3
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: eb71eff4459265fef079863c48aff421
Run description: SIBiLS ES + prompt engineering + Qwen/Qwen2.5-7B-Instruct

SIB-task-b-4¶

Run ID: SIB-task-b-4
Participant: SIB
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 997cc3d02e85a1d347b642b54a175426
Run description: A pipeline that first generates claims from the patient’s question (history). Each claim is expanded into sub-claims (and their negations). For each (sub)claim, candidate sentences are retrieved from PubMed abstracts using ES kNN + BM25 (0.3) with a 0.7 score threshold. Qwen-7B-Instruct is then asked to select strictly confirming sentences for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score, and the selected sentences are summarized into concise, patient-friendly responses.

simpleQA_BM25¶

Run ID: simpleQA_BM25
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 36b4fcfde6dd3fb909583886049ac8b1
Run description: In this run, we performed QA using a two-stage document retrieval pipeline. Up to five sub-queries were generated from each question, combined with the topic and narrative, to retrieve documents via BM25, followed by ColBERT re-ranking of the top 25 documents. Answers were split into sentences, and for each sentence, the top 10 relevant documents were used by an LLM to generate up to three citation prompts.

simpleQA_hybrid¶

Run ID: simpleQA_hybrid
Participant: CLaC
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 18af27dc8a33061a3726b9696f008231
Run description: In this run, we performed QA using a hybrid document retrieval pipeline combining BM25 and FAISS, followed by ColBERT re-ranking of the top 25 documents. Up to five sub-queries were generated from each question and used together with the topic and narrative to retrieve documents. Answers were split into sentences, and for each sentence, the top 10 relevant documents were used by an LLM to generate up to three citation prompts

system_a¶

Run ID: system_a
Participant: CLaC Lab
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 72a22ab259a29b7c83ea7e0f2c247008
Run description: Baseline pipeline: BM25 retrieves 25 passages for the original question, a cross-encoder re-ranks them, and the top-5 are fed to Qwen to generate a ≤250-word answer. Each sentence is grounded with 1–3 bracketed citations that map to PMIDs from the presented evidence.

system_b_tfidf¶

Run ID: system_b_tfidf
Participant: CLaC Lab
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: c3318e39fca5a1e219fa26bb125b16ea
Run description: The model takes a three-staged approach in order to retrieve, generate, and enhance its answers.

The first stage involves a search on the PubMed index using BM25. We perform a query expansion through multiple reformulations of the original question for a wider search.

Next, we deduplicate and summarize the set of all articles content, while maintaining the article from which they belong. Finally, we evaluate the cosine similarity of the queries against the document summaries through their TF-IDF vectors. We then apply Maximal Marginal relevance on the score to select documents that are relevant but also diverse. The documents are then summarized and ordered by MMR ranks, and appended as context to generate the final answer.

Citation parsing proceeds to map citation indices back to the original documents for each sentence generated.

system_c_medcpt¶

Run ID: system_c_medcpt
Participant: CLaC Lab
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: ef6ebce0880c8ae0533276d588b4fb24
Run description: We perform a search on the PubMed index using BM25, and we perform a query expansion through multiple reformulations of the original question for a wider search [ Original + 3 Reformulated Question]. The retrieved documents and the queries are fed into a MedCPT Bi-Encoder to generate L2-normalized query and document embeddings. These are then evaluated through the MedCPT Cross Encoder for reranking the top 20 documents. The resulting documents are truncated and appended to the context for the final question answering.

system_d_medcpt_wide¶

Run ID: system_d_medcpt_wide
Participant: CLaC Lab
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 7245a6093b489be4693c92357220c9b9
Run description: We perform a search on the PubMed index using BM25 with the original query. A wider search is done, where out of 200 retrieved documents, we keep the top 30 reranked ones. The retrieved documents' original abstract and the queries are fed into a MedCPT Bi-Encoder to generate L2-normalized query and document embeddings. The resulting vectors were evaluated through the MedCPT Cross Encoder, reranking the top 20 documents. Documents are finally appended to the context for the final question answering task.

system_e¶

Run ID: system_e
Participant: CLaC Lab
Track: BioGen
Year: 2025
Submission: 2025-08-23
Task: trec2025-biogen-task-b
MD5: 1cb8646f4aa0f99b6c75ee92db1169d5
Run description: The model takes a three-staged approach in order to retrieve, generate, and enhance its answers.

The first stage involves a search on the PubMed index using BM25. A wide search is performed, where out of 200 results, 30 documents are retained for each question queried( 1 Original Question + 3 Reformulated Questions). Overall, 120 top articles were retrieved for each question.

We then deduplicate each article into summaries no longer than 1 to two sentences, totaling at most 80 words. Next, we evaluate the cosine similarity of the queries against the document summaries through their TF-IDF vectors.

We proceed by applying Maximal Marginal Relevance on the score to select documents that are relevant but also diverse. The documents are then summarized and ordered by MMR ranks, and appended as context to generate the final answer.

Citation parsing proceeds to map citation indices back to the original documents for each sentence generated.

task_a_gehc_htic_run2¶

Participants | Proceedings | Input | Appendix

Run ID: task_a_gehc_htic_run2
Participant: GEHC-HTIC
Track: BioGen
Year: 2025
Submission: 2026-03-17
Task: trec2025-biogen-task-a
MD5: 9f3205f0da80b0b2a4c3402fa5909544
Run description: This was an early-submission Task A run with no provided metadata