Runs - BioGen 2025¶
afmmd¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: afmmd
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-20
- Task: trec2025-biogen-task-b
- MD5:
b5c62c53b36f483e6e2d12b9a4c9326a - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval. Queries are expanded with DeepSeek-R1 and refined with Rocchio feedback for BM25 retrieval (top 5000 docs), while five semantically diverse sub-queries are generated for dense retrieval via a FAISS index using BioBERT embeddings (up to 1000 docs each). Results are merged with Reciprocal Rank Fusion and re-ranked using MonoT5 to select the top 10 documents, which are split into sentences and scored to produce the top 30 evidence snippets. The LLM then evaluates whether the snippets are sufficient; if not, it identifies missing information and generates reformulated queries for iterative retrieval (up to 5 rounds). Finally, DeepSeek-R1 generates an answer strictly from the collected snippets, adhering to formatting and citation constraints (≤250 words, max three PMIDs per sentence). Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails any validation, it is regenerated using corrective feedback upto a maximum of 20 times.
afmmdn¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: afmmdn
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-20
- Task: trec2025-biogen-task-b
- MD5:
6f7e4ce9a7fc813b342760f43fe74cf1 - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval. Queries are expanded with DeepSeek-R1 and refined with Rocchio feedback for BM25 retrieval (top 5000 docs), while five semantically diverse sub-queries are generated for dense retrieval via a FAISS index using BioBERT embeddings (up to 1000 docs each). Results are merged with Reciprocal Rank Fusion and re-ranked using MonoT5 to select the top 10 documents, which are split into sentences and scored to produce the top 30 evidence snippets. The LLM then evaluates whether the snippets are sufficient; if not, it identifies missing information and generates reformulated queries for iterative retrieval (up to 5 rounds). Finally, DeepSeek-R1 generates an answer strictly from the collected snippets, adhering to formatting and citation constraints (≤250 words, max three PMIDs per sentence). This time, we do not validate the content of the answer sentences against the cited documents using the Mistral-7b model. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.
emotional_prompt¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: emotional_prompt
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-a
- MD5:
616eda8aa5cbed92cb2bcafdc78155b7 - Run description: This system generates expanded search queries from each answer sentence using an emotionally focused prompting strategy. The queries are encoded with BioBERT and matched against a FAISS index of PubMed. Retrieved documents are evaluated by a Mistral model for support or contradiction, and the system outputs only citations that are verified as relevant evidence.
empd¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: empd
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-21
- Task: trec2025-biogen-task-b
- MD5:
b0245fa529fbeddc8c7c9297bde8f471 - Run description: This run implements an emotional prompting strategy for biomedical question answering. Answers are generated by DeepSeek-R1 using a carefully crafted emotional system prompt to ensure precise and professional biomedical language. Supporting citations are added through a dense retrieval and verification pipeline: BioBERT + FAISS retrieves up to 30 relevant PubMed documents, and Mistral-7B filters them using an emotional prompt to retain only truly supportive evidence. The final outputs are concise (<250 words) and contain no more than three supporting PubMed citations per sentence.
expd¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: expd
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-21
- Task: trec2025-biogen-task-b
- MD5:
ca553d654b95be07f73c31ed4bca4398 - Run description: This run implements an expert prompting strategy for biomedical question answering. Answers are generated by DeepSeek-R1 using a carefully crafted expert system prompt to ensure precise and professional biomedical language. Supporting citations are added through a dense retrieval and verification pipeline: BioBERT + FAISS retrieves up to 30 relevant PubMed documents, and Mistral-7B filters them using an expert prompt to retain only truly supportive evidence. The final outputs are concise (<250 words) and contain no more than three supporting PubMed citations per sentence.
expert_prompt¶
Participants | Proceedings | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: expert_prompt
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-a
- MD5:
d6f54db4b2d2ba8f65451769b9556f6d - Run description: This system expands each answer sentence into new search queries using an expert-driven prompting strategy. The queries are embedded with BioBERT and searched against a FAISS index of PubMed. Retrieved documents are then filtered by a Mistral model that classifies their relationship to the claim, ensuring only genuinely supportive or contradictory citations are kept.
gehc_htic_task_a¶
Participants | Proceedings | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: gehc_htic_task_a
- Participant: GEHC-HTIC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-a
- MD5:
9f3205f0da80b0b2a4c3402fa5909544 - Run description: This run uses a BM25-only pipeline over the official PubMed Lucene index (Pyserini), which stores each document’s raw JSON with keys {"id","contents"}. For every answer, we formulate a base query by concatenating the question and answer text, ensuring retrieval is keyed to the specific claim being evaluated. We inject a single, per-item metadata block (team_id, run_id, qa_id, question) in the output to match the submission schema, and we enable checkpointing with resume support so long jobs don’t lose progress. For SUPPORT, we retrieve the top-100 BM25 hits, build (text, pmid) candidates from the stored raw JSON, and exclude any pmids already known as existing supports or later selected as contradictions. We then apply a MedCPT Cross-Encoder to rerank these candidates by semantic relevance and keep the top-3 PMIDs. If reranking produces fewer than three results or text is missing for some hits, we backfill directly from the BM25 ranking to guarantee three supported citations whenever possible. For CONTRADICTIONS, we run a broad BM25 search (top-1000) to maximize recall, then split each abstract into sentences and apply a negative-cue filter (e.g., “did not,” “no effect,” “not significant”) to focus on likely contradictory spans, with a fallback to the first few sentences if no cues are found. These sentences are classified with a MedNLI seq2seq model (doc→claim prompts), and any document with at least one sentence decoding to “contradiction” is kept; we then select up to three contradicted PMIDs. Label decoding is tolerant, and the approach is robust to truncation by operating at the sentence level.
gehc_htic_task_b¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: gehc_htic_task_b
- Participant: GEHC-HTIC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-b
- MD5:
2ce4d2fd9ddf718b5804a6c1b28455e8 - Run description: This run builds a two-stage retrieval-and-generation pipeline to produce concise, citation-grounded answers. First, we perform a high-recall BM25 search over the PubMed Lucene index using the question alone (top 1000). We then rerank these candidates with a MedCPT Cross-Encoder, but crucially we use the narrative (not the question) as the reranking query. This leverages the richer intent and context in the narrative to prioritize documents that are not just topically related but also aligned with what the user actually wants to read about. From this reranked set, we keep the top 10 documents (with their PMIDs and cross-encoder scores) as the working evidence set for the LLM. Answer synthesis uses a one-shot prompt tailored for biomedical communication. The prompt includes the topic, question, narrative, and a detailed “iron and ferritin” exemplar that demonstrates the desired style and structure. The system lists the selected documents with numeric indices ([1], [2], …), PMIDs, and relevance scores, then asks GPT-4o (via an enterprise Azure OpenAI endpoint) to produce a short, narrative-style answer under strict rules: every sentence must be supported by 1–3 citations, conflicts must be explicitly called out, and only provided evidence can be used. The generation is run with deterministic settings (temperature=0, fixed seed) and a retry/backoff wrapper for resiliency.
h2oloo_rr_g41_t50¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: h2oloo_rr_g41_t50
- Participant: h2oloo
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
512af248565e3b80d5a8442562c015a8 - Run description: GPT4.1 takes top 50 results and uses Ragnarok Biogen V6 to generate grounded answer
h2oloo_rr_q3-30b_nc¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: h2oloo_rr_q3-30b_nc
- Participant: h2oloo
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
40afec226e67f6d1698dbf8728a7f77e - Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 uses Ragnarok Biogen V6 - No Cite
h2oloo_rr_q3-30b_t20¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: h2oloo_rr_q3-30b_t20
- Participant: h2oloo
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
d075fa276c82dadc219f413cd02ad35c - Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 takes top 20 results and uses Ragnarok Biogen V6 to generate grounded answer
h2oloo_rr_q3-30b_t50¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: h2oloo_rr_q3-30b_t50
- Participant: h2oloo
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
9a1d1b64a57bf74e79cfc30d40e7b993 - Run description: vLLM - https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 takes top 50 results and uses Ragnarok Biogen V6 to generate grounded answer
hltbio-gpt5.searcher¶
Participants | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: hltbio-gpt5.searcher
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
f266d732fcf4cd737ded99e33f3c9ea0 - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Searcher II pointwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for most steps. GPT-5 is used for final answer generation (drafting) and answer shortening (revising report). LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg-fsrrf¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg-fsrrf
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
e25dd639464cc09c5a50d45d99265564 - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.crux¶
Participants | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: hltbio-lg.crux
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
82cc22b26eea943ce83c75e71c17ba3d - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from meta-llama/Llama-3.3-70B-Instruct pointwise subquestion reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.fsrrfprf¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg.fsrrfprf
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
0c77b30653dd9f4e62eca775351f5d79 - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3 using PRF with each. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.jina¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg.jina
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
08d343b91246db31e4f0d62256a3b0a9 - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from jinaai/jina-reranker-m0 (2.4B) reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.jina.qwen¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg.jina.qwen
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
b77494d2abda361ec8afeca9a32219da - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from RRF of jinaai/jina-reranker-m0 (2.4B) and Qwen/Qwen3-Reranker-8B reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.listllama¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg.listllama
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
eadb4895bafa93315ab238162930631d - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from meta-llama/Llama-3.3-70B-Instruct listwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.qwen¶
Participants | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: hltbio-lg.qwen
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
795bbfabadb5c310596484553e2160ff - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Qwen/Qwen3-Reranker-8B reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltbio-lg.searcher¶
Participants | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: hltbio-lg.searcher
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2025-08-19
- Task: trec2025-biogen-task-b
- MD5:
33826798de65d1713babe2ad1616fdd7 - Run description: LangGraph generator (reflection, note taking, query generation, etc) with retrieval results from Searcher II pointwise reranking RRF of PLAID-X, Qwen-Embedding-8B, and SPLADE-v3. LangGraph uses Llama 3.3 70B for all steps. LangGraph generates 4 initial queries, retrieves 12 results per query, and runs up to 5 research loops.
hltcoe-eval-common-smoothed-sonnet¶
Participants | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-eval-common-smoothed-sonnet
- Participant: EvalHLTCOE
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
6dfc15f8e00fc42fbf289e9f30f533c0 - Run description: unspecified
hltcoe-eval-svc-smoothed-sonnet¶
Participants | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-eval-svc-smoothed-sonnet
- Participant: EvalHLTCOE
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
ffb2d40ede881d2f4a7d51e7085f4fc7 - Run description: unspecified
hltcoe-multiagt.llama70B.ag_sw-ret_plq-6-wbg-limit¶
Participants | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-multiagt.llama70B.ag_sw-ret_plq-6-wbg-limit
- Participant: hltcoe-multiagt
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
7cfeffd4818d652c4c1375622a7e4442 - Run description: unspecified
hltcoe-multiagt.llama70B.lg-w-ret_lpq-nt-s_cite¶
Participants | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-multiagt.llama70B.lg-w-ret_lpq-nt-s_cite
- Participant: hltcoe-multiagt
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
88b249e2005db8ff69707bb186125fc9 - Run description: unspecified
hltcoe-multiagt.llama70B.lg-w-ret_p-nt-s_cite¶
Participants | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-multiagt.llama70B.lg-w-ret_p-nt-s_cite
- Participant: hltcoe-multiagt
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
1d4d8205029439fecf42330289283276 - Run description: unspecified
hltcoe-rerank.llama70B.lg-w-ret_plq-nt-s_cite¶
Participants | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: hltcoe-rerank.llama70B.lg-w-ret_plq-nt-s_cite
- Participant: hltcoe-rerank
- Track: BioGen
- Year: 2025
- Submission: 2026-02-11
- Task: trec2025-biogen-task-b
- MD5:
40e8ec7fd6ed5fd1d0bb37b3c917b2bb - Run description: unspecified
LLM_BM25¶
Participants | Proceedings | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: LLM_BM25
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
73f640521d34a2f90939466da64224f0 - Run description: This run implements a pipeline to identify both supporting and contradictory citations for each sentence of a given medical answer. The system processes each sentence independently. First, it employs a two-stage document retrieval method (BM25 followed by a ColBERT re-ranker) to gather a small, highly relevant set of documents from a PubMed corpus. These documents are then formatted and passed as context to a Mistral-7B-Instruct-v0.3 model. Guided by specific, constraint-driven prompts, the LLM analyzes the documents and identifies up to three PMIDs that either support or contradict the claim made in the sentence. The final output is a structured list containing each sentence paired with its corresponding citations.
LLM_NLI_BM25¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: LLM_NLI_BM25
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
22be2ab0a22af2161c22d0c6c0f72501 - Run description: This run is designed to automatically find both supporting and contradictory citations for a series of medical statements. For each statement, it first generates search queries and retrieves a set of relevant scientific articles from PubMed using a BM25 and ColBERT reranking pipeline. It then uses the Mistral-7B language model to identify and extract supporting PMIDs from these articles. To find contradictory evidence, it employs a specialized SciFive NLI model to filter the retrieved articles for statements that logically contradict the original claim, and then uses Mistral-7B to extract the corresponding PMIDs. The final output is a structured JSON file that appends the identified supporting and contradictory citations to each original statement.
MedHopQA_BM25¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: MedHopQA_BM25
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
74b2ee5c867f02265a540b54fd72bf81 - Run description: We use the MedHopQA pipeline that we developed for BioCreative 9 workshop from NLM. The system uses iterative, multi-hop strategies for QA. The document retriever used is PubMed indexed using BM25
MedHopQA_FAISS¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: MedHopQA_FAISS
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
74ed8175eaf93808e1c7e0d2d6fd9c7a - Run description: We use the MedHopQA pipeline that we developed for BioCreative 9 workshop from NLM. The system uses iterative, multi-hop strategies for QA. The document retriever used is PubMed indexed using BM25 and FAISS
rmmdn¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: rmmdn
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-20
- Task: trec2025-biogen-task-b
- MD5:
c68d81194b874b12898ed1db7c339c86 - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the deepseek-R1 model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. This time, we do not validate the content of the answer sentences against the cited documents. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.
rmmln¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: rmmln
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-20
- Task: trec2025-biogen-task-b
- MD5:
d3f4fb984338cb0f4d3b7285c92d62f6 - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the llama-3-70b model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. This time, we do not validate the content of the answer sentences against the cited documents. If the answer fails non-LLM programmatic validation (250-word limit check and max 3 pmids/sent), it is regenerated using corrective feedback upto a maximum of 20 times.
rrf_monot5-msmarco_deepseek-r1¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: rrf_monot5-msmarco_deepseek-r1
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-05
- Task: trec2025-biogen-task-b
- MD5:
0b610aab283a8902c2094990ca702ffc - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the deepseek-R1 model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails validation, it is regenerated using corrective feedback upto a maximum of 20 times.
rrf_monot5-msmarco_llama70b¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: rrf_monot5-msmarco_llama70b
- Participant: dal
- Track: BioGen
- Year: 2025
- Submission: 2025-08-05
- Task: trec2025-biogen-task-b
- MD5:
4d303fa037442df5b21fa8d8bcc5d6b6 - Run description: This run implements a hybrid retrieval pipeline combining BM25 and dense retrieval to collect relevant PubMed documents for each biomedical question. Retrieved documents are fused using Reciprocal Rank Fusion and re-ranked using a MonoT5 model. The top 10 documents are then split into sentences, and the top 30 evidence snippets are selected using MonoT5 relevance scoring. These snippets, along with the question, topic, and narrative, are provided as context to the llama3-70b model. The model is prompted to generate an answer only using provided data. The answer must comply with strict formatting and citation constraints, including a 250-word limit and a maximum of three PMIDs per sentence. Next, we validate the content of each answer sentence against its cited documents using the Mistral-7B model. If the answer fails validation, it is regenerated using corrective feedback upto a maximum of 20 times.
run10_no_rerank_index-passages-dense_gpt-4o-mini¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run10_no_rerank_index-passages-dense_gpt-4o-mini
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-b
- MD5:
01a1c9ea98b7eccf7659dab6b3bafe92 - Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
gpt-4o-mini is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run1_no-rerank_index-passages-sparse¶
Participants | Proceedings | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: run1_no-rerank_index-passages-sparse
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-a
- MD5:
6b20b2995cf71fd52124eb39f4cd7389 - Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: run1_no-rerank_index-passages-sparse_Llama-3.1-8B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-04
- Task: trec2025-biogen-task-b
- MD5:
ffcc246665d8aae8fbb06b742517bf74 - Run description: Retrieval with BM25 on a sparse index with full documents. No ranking was performed. Each document was tagged with a MNLI model and then passed to a LLM to generate the grounded answer
run2_rerank_index-passages-sparse¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: run2_rerank_index-passages-sparse
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-a
- MD5:
cd44046a583ead20e7e5fe0290558dd8 - Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: run2_rerank_index-passages-sparse_Llama-3.1-8B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
87cebf600f096e612ac35d0109956691 - Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run3_no-rerank_index-passages-dense¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: run3_no-rerank_index-passages-dense
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-a
- MD5:
721f32f97fbf6b88fc83f030008ecc61 - Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct¶
Participants | Proceedings | Input | task-b-human-eval | task-b-auto-citation | task-b-auto-answer | task-b-auto-answer-human | task-b-auto-citation-human | Appendix
- Run ID: run3_no_rerank_index-passages-dense_Llama-3.1-8B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
a6fb85df0f46b58616c4ae28e40302e8 - Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run4_rerank_index-passages-dense¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: run4_rerank_index-passages-dense
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-a
- MD5:
1bdad7e06c941f0659fd43bbd4e93373 - Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run4_rerank_index-passages-dense_Llama-3.1-8B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
44427437ce824c599ecd5469b90c3399 - Run description: Document retrieval is first performed over a dense index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.1-8B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run5_no_rerank_index-passages-sparse_Llama-3.3-70B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
043ddd3fc98fbe59c99ce97c38eceddc - Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run6_rerank_index-passages-sparse_Llama-3.3-70B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
efd4c936885a047e7f1dbf07a8254afd - Run description: Document retrieval is first performed using BM25 over a sparse index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run7_no_rerank_index-passages-dense_Llama-3.3-70B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
ea23d8ad09ad49b3fd61456d9209c6e2 - Run description: Document retrieval is first performed over a dense index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run8_rerank_index-passages-dense_Llama-3.3-70B-Instruct
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-06
- Task: trec2025-biogen-task-b
- MD5:
7707a11e5a46d30b004100e6ed137c3c - Run description: Document retrieval is first performed over a dense index. The initial results are then refined by a re-ranking model that reorders them based on relevance. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
Llama-3.3-70B-Instruct is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
run9_no_rerank_index-passages-sparse_gpt-4o-mini¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation-human | task-b-auto-citation | task-b-auto-answer-human | Appendix
- Run ID: run9_no_rerank_index-passages-sparse_gpt-4o-mini
- Participant: uniud
- Track: BioGen
- Year: 2025
- Submission: 2025-08-08
- Task: trec2025-biogen-task-b
- MD5:
5ec35351669728f016cd88a224f2ccab - Run description: Document retrieval is first performed using BM25 over a sparse index. An MNLI model is applied to assess the relationship between each document and the query, classifying them as supporting, contradicting, or neutral.
gpt-4o-mini is then prompted with the retrieved documents, along with only the supporting and contradicting evidence. Its generated response is subsequently parsed to fit the required output schema.
scifive-ft-512CL-lex¶
Participants | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: scifive-ft-512CL-lex
- Participant: polito
- Track: BioGen
- Year: 2025
- Submission: 2025-08-13
- Task: trec2025-biogen-task-a
- MD5:
d6a5006caffd4d432f415961896ebfb0 - Run description: to fine tune the SciFive classifier, I generated a mixed dataset of synthetic and real data with the following structure: {abstract, question, claim, label, metadata}. I used datas from PubmedQA dataset, and SciFact dataset. I used 6 llm models to create such a dataset. the final dataset label distribution is as follows: Neutral: 17982 samples (41.93%) Entailment: 14674 samples (34.21%) Contradiction: 10233 samples (23.86%)
extra information about the dataset: 2. ABSTRACT SOURCES
original(from PubMedQA): 3464 samples (8.08%) generated by medgemma-4b-it: 8775 samples (20.46%) generated by BioMistral-7B-DARE: 9207 samples (21.47%) original(from SciFact): 773 samples (1.80%) original: 20670 samples (48.19%)
3. CLAIM SOURCES¶
original(from PubMedQA): 5224 samples (12.18%) generated by GPT-oss-20b: 5224 samples (12.18%) original(from SciFact): 1471 samples (3.43%) generated by medgemma-4b-it: 6145 samples (14.33%) generated by Llama-3.1-8B-Instruct: 6094 samples (14.21%) generated by OLMo-2-1124-7B-Instruct: 6323 samples (14.74%) generated by SOLAR-10.7B-Instruct-v1.0: 6236 samples (14.54%) generated by Mistral-7B-Instruct-v0.3: 6172 samples (14.39%)
4. QUESTION SOURCES¶
original(from PubMedQA): 10448 samples (24.36%) generated by GPT-oss-20b: 1471 samples (3.43%) generated by medgemma-4b-it: 6145 samples (14.33%) generated by Llama-3.1-8B-Instruct: 6094 samples (14.21%) generated by OLMo-2-1124-7B-Instruct: 6323 samples (14.74%) generated by SOLAR-10.7B-Instruct-v1.0: 6236 samples (14.54%) generated by Mistral-7B-Instruct-v0.3: 6172 samples (14.39%)
SIB-task-a-1¶
Participants | Proceedings | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: SIB-task-a-1
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
f8d64b1a404244973ce33d59d4422dc0 - Run description: index lucene BM25 + spacy nlp (en_core_web_lg)
SIB-task-a-2¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-2
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
aa3b01e738a17d206c125d646d84727c - Run description: index lucene BM25 + model facebook/bart-large-mnli to classify {support/contradict/irrelevant}
SIB-task-a-3¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-3
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-a
- MD5:
cb4d3159947197de9bd6597f560e89fe - Run description: * Reciprocal Rank Fusion for combining Elasticsearch and FAISS results (using ncbi/MedCPT-Query-Encoder)
- use BAAI/bge-reranker-v2-m3 to rerank
- use BioMistral/BioMistral-7B-SLERP with a prompt to select PMID which support or contradict the answers
SIB-task-a-4¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-4
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-22
- Task: trec2025-biogen-task-a
- MD5:
fba46f805344c67bcba33a45adaf81e1 - Run description: * Reciprocal Rank Fusion for combining Elasticsearch and FAISS results (using ncbi/MedCPT-Query-Encoder)
- use BAAI/bge-reranker-v2-m3 to rerank
- use ContactDoctor/Bio-Medical-Llama-3-8B with a prompt to select PMID which support or contradict the answers
SIB-task-a-5¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-5
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
274d593cf57c64e30f43d8912e933f81 - Run description: SIBiLS ElasticSearch + facebook/bart-large-mnli to classify {support/contradict/irrelevant} + prompt engineering
SIB-task-a-6¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-6
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
0b4c6bafd96cafcb02315dd320839cd4 - Run description: A pipeline that expands a claim into sub-claims (and their negations), retrieves candidate papers per (sub)claim using ES kNN + BM25 (0.3) with 0.7 score threshold, then asks Qwen-7B-Instruct to pick strictly confirming papers for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score.
SIB-task-a-7¶
Participants | Proceedings | Input | task-a-auto-eval | Appendix
- Run ID: SIB-task-a-7
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
ca6393c825384011ff7be1214056c075 - Run description: A pipeline that expands a claim into sub-claims (and their negations), retrieves candidate papers per (sub)claim using ES kNN + BM25 (0.3) with 0.7 score threshold, then asks Qwen-7B-Instruct to pick strictly confirming papers for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score.
SIB-task-b-1¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: SIB-task-b-1
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
70b29aba3bcddba0d4acf402d259ee23 - Run description: SIBiLS ElasticSearch + llama-3-8B-instruct + prompt engineering
SIB-task-b-2¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: SIB-task-b-2
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
4dfa8c4c323f9edf8ed1a5b69d65e368 - Run description: SIBiLS ElasticSearch + prompt + LORA - llama3-8b retrained on Biogen 2024 dataset
SIB-task-b-3¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: SIB-task-b-3
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
eb71eff4459265fef079863c48aff421 - Run description: SIBiLS ES + prompt engineering + Qwen/Qwen2.5-7B-Instruct
SIB-task-b-4¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: SIB-task-b-4
- Participant: SIB
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
997cc3d02e85a1d347b642b54a175426 - Run description: A pipeline that first generates claims from the patient’s question (history). Each claim is expanded into sub-claims (and their negations). For each (sub)claim, candidate sentences are retrieved from PubMed abstracts using ES kNN + BM25 (0.3) with a 0.7 score threshold. Qwen-7B-Instruct is then asked to select strictly confirming sentences for positives (supported) and for negatives (contradicted). Final top-3 per category are chosen by combining sub-claim representativeness and retrieval score, and the selected sentences are summarized into concise, patient-friendly responses.
simpleQA_BM25¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: simpleQA_BM25
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
36b4fcfde6dd3fb909583886049ac8b1 - Run description: In this run, we performed QA using a two-stage document retrieval pipeline. Up to five sub-queries were generated from each question, combined with the topic and narrative, to retrieve documents via BM25, followed by ColBERT re-ranking of the top 25 documents. Answers were split into sentences, and for each sentence, the top 10 relevant documents were used by an LLM to generate up to three citation prompts.
simpleQA_hybrid¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: simpleQA_hybrid
- Participant: CLaC
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
18af27dc8a33061a3726b9696f008231 - Run description: In this run, we performed QA using a hybrid document retrieval pipeline combining BM25 and FAISS, followed by ColBERT re-ranking of the top 25 documents. Up to five sub-queries were generated from each question and used together with the topic and narrative to retrieve documents. Answers were split into sentences, and for each sentence, the top 10 relevant documents were used by an LLM to generate up to three citation prompts
system_a¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: system_a
- Participant: CLaC Lab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
72a22ab259a29b7c83ea7e0f2c247008 - Run description: Baseline pipeline: BM25 retrieves 25 passages for the original question, a cross-encoder re-ranks them, and the top-5 are fed to Qwen to generate a ≤250-word answer. Each sentence is grounded with 1–3 bracketed citations that map to PMIDs from the presented evidence.
system_b_tfidf¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: system_b_tfidf
- Participant: CLaC Lab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
c3318e39fca5a1e219fa26bb125b16ea - Run description: The model takes a three-staged approach in order to retrieve, generate, and enhance its answers.
The first stage involves a search on the PubMed index using BM25. We perform a query expansion through multiple reformulations of the original question for a wider search.
Next, we deduplicate and summarize the set of all articles content, while maintaining the article from which they belong. Finally, we evaluate the cosine similarity of the queries against the document summaries through their TF-IDF vectors. We then apply Maximal Marginal relevance on the score to select documents that are relevant but also diverse. The documents are then summarized and ordered by MMR ranks, and appended as context to generate the final answer.
Citation parsing proceeds to map citation indices back to the original documents for each sentence generated.
system_c_medcpt¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: system_c_medcpt
- Participant: CLaC Lab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
ef6ebce0880c8ae0533276d588b4fb24 - Run description: We perform a search on the PubMed index using BM25, and we perform a query expansion through multiple reformulations of the original question for a wider search [ Original + 3 Reformulated Question]. The retrieved documents and the queries are fed into a MedCPT Bi-Encoder to generate L2-normalized query and document embeddings. These are then evaluated through the MedCPT Cross Encoder for reranking the top 20 documents. The resulting documents are truncated and appended to the context for the final question answering.
system_d_medcpt_wide¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: system_d_medcpt_wide
- Participant: CLaC Lab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
7245a6093b489be4693c92357220c9b9 - Run description: We perform a search on the PubMed index using BM25 with the original query. A wider search is done, where out of 200 retrieved documents, we keep the top 30 reranked ones. The retrieved documents' original abstract and the queries are fed into a MedCPT Bi-Encoder to generate L2-normalized query and document embeddings. The resulting vectors were evaluated through the MedCPT Cross Encoder, reranking the top 20 documents. Documents are finally appended to the context for the final question answering task.
system_e¶
Participants | Proceedings | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: system_e
- Participant: CLaC Lab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-b
- MD5:
1cb8646f4aa0f99b6c75ee92db1169d5 - Run description: The model takes a three-staged approach in order to retrieve, generate, and enhance its answers.
The first stage involves a search on the PubMed index using BM25. A wide search is performed, where out of 200 results, 30 documents are retained for each question queried( 1 Original Question + 3 Reformulated Questions). Overall, 120 top articles were retrieved for each question.
We then deduplicate each article into summaries no longer than 1 to two sentences, totaling at most 80 words. Next, we evaluate the cosine similarity of the queries against the document summaries through their TF-IDF vectors.
We proceed by applying Maximal Marginal Relevance on the score to select documents that are relevant but also diverse. The documents are then summarized and ordered by MMR ranks, and appended as context to generate the final answer.
Citation parsing proceeds to map citation indices back to the original documents for each sentence generated.
task_a_gehc_htic_run2¶
Participants | Proceedings | Input | Appendix
- Run ID: task_a_gehc_htic_run2
- Participant: GEHC-HTIC
- Track: BioGen
- Year: 2025
- Submission: 2026-03-17
- Task: trec2025-biogen-task-a
- MD5:
9f3205f0da80b0b2a4c3402fa5909544 - Run description: This was an early-submission Task A run with no provided metadata
task_a_run1¶
Participants | Input | task-a-auto-eval | Appendix
- Run ID: task_a_run1
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
89f38e82fb2ca20309c7a8ce60f38344 - Run description: .
task_a_run2¶
Participants | Input | task-a-auto-eval | Appendix
- Run ID: task_a_run2
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
8c767da83ede08c3f1a6ff66bd6d0328 - Run description: .
task_a_run3¶
Participants | Input | task-a-auto-eval | Appendix
- Run ID: task_a_run3
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-a
- MD5:
71b78ed8caa239567b18d8a287b97c82 - Run description: .
task_a_run4¶
Participants | Input | task-a-auto-eval | task-a-human-eval | Appendix
- Run ID: task_a_run4
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-26
- Task: trec2025-biogen-task-a
- MD5:
47270c10dc71c12626e51c7f6c78df97 - Run description: .
task_a_run6_A¶
Participants | Input | task-a-auto-eval | Appendix
- Run ID: task_a_run6_A
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-23
- Task: trec2025-biogen-task-a
- MD5:
3d7667d4d0e07069ccc2d26ac3451cda - Run description: ,
task_b_output_reranker_sum¶
Participants | Input | task-b-auto-citation-human | task-b-human-eval | Appendix
- Run ID: task_b_output_reranker_sum
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2026-03-17
- Task: trec2025-biogen-task-b
- MD5:
8f6016974ee4dfda54ec276a36f1961c - Run description: This is a Task B run submitted early with no metadata
task_b_run1¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: task_b_run1
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
453e1b60a04d14fed3cce29f3c3ef9a6 - Run description: .
task_b_run4¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: task_b_run4
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
e26f98e850949e266d5fbfa4412d043f - Run description: .
task_b_run4_B¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: task_b_run4_B
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
ab3534350e56bfbc322fb0586d927a25 - Run description: .
task_b_run5¶
Participants | Input | task-b-auto-answer | task-b-auto-citation | Appendix
- Run ID: task_b_run5
- Participant: InfoLab
- Track: BioGen
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-biogen-task-b
- MD5:
ea30a7fb10e3a83b60d67e706082b08d - Run description: .
UAmsterdam_bergen¶
Participants | Proceedings | Input | Appendix
- Run ID: UAmsterdam_bergen
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
409569452de90eff213ed9b6236e8e13 - Run description: unspecified
UAmsterdam_bergen_llama-70b¶
Participants | Proceedings | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: UAmsterdam_bergen_llama-70b
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
bba83d794e2c382bc7696e7942b30dbb - Run description: unspecified
UAmsterdam_bergen_llama-8b¶
Participants | Proceedings | Input | task-b-auto-citation-human | task-b-human-eval | task-b-auto-answer-human | Appendix
- Run ID: UAmsterdam_bergen_llama-8b
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
20a089ce61d15212f97f8317dfc9f87c - Run description: unspecified
UAmsterdam_bergen_mistral-7b¶
Participants | Proceedings | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix
- Run ID: UAmsterdam_bergen_mistral-7b
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
cbbfb8cdd1c5cb7aed3cab065db8ad87 - Run description: unspecified
UAmsterdam_bergen_pisco-llama¶
Participants | Proceedings | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix
- Run ID: UAmsterdam_bergen_pisco-llama
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
064dc55a61adb0625a1c9620ab3b0bc1 - Run description: unspecified
UAmsterdam_bergen_pisco-mistral¶
Participants | Proceedings | Input | task-b-auto-citation-human | task-b-auto-answer-human | Appendix
- Run ID: UAmsterdam_bergen_pisco-mistral
- Participant: UAmsterdam
- Track: BioGen
- Year: 2025
- Submission: 2026-02-12
- Task: trec2025-biogen-task-b
- MD5:
309a695ce484589e7948ba6258e0db43 - Run description: unspecified