Runs - Tip of the Tongue (TOT) 2025¶

anserini-bm25¶

Participants | Input | trec_eval | Appendix

Run ID: anserini-bm25
Participant: coordinators
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: 4f40a9559e7bb4044464de3928e92982
Run description: We used the ir-datasets integration that we prepared to index the corpus to Anserini and did BM25 retrieval with default values.

bge-m3¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: bge-m3
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-01
Task: trec2025-tot-main
MD5: 52af7e65d9d0644d04d99d0d19c6b864
Run description: This is a dense retrieval run. We directly use the Wikipedia embeddings from https://huggingface.co/datasets/Upstash/wikipedia-2024-06-bge-m3 We use the bge-m3 model (https://huggingface.co/BAAI/bge-m3) to embed all the queries and cosine similarity is computed to retrieval top 1000 passages.

bm25-porterblk-test¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: bm25-porterblk-test
Participant: DUTH
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-08-31
Task: trec2025-tot-main
MD5: bcdf217f3fad8aa82e343cddd01153ff
Run description: Automatic BM25 baseline using PyTerrier/Terrier over the official TREC ToT 2025 Wikipedia corpus. Index: Terrier with Stopwords + PorterStemmer, EnglishTokeniser, and blocks (positions) enabled. Query processing: remove control chars and punctuation; keep ≤128 tokens (Terrier further truncates). Retrieval: BM25, top-1000 per query. No manual intervention on test; parameters verified on the provided dev splits.

bm25_hedge_aware¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: bm25_hedge_aware
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 8be1b496b0c0f3d2aedcc046a0f375b6
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index (title + full text), no field weighting changes. Software: PyTerrier 0.10.0, Terrier 5.11. Queries executed with controls parse=false (bag-of-words). Query normalization: strip punctuation that can trip Terrier (slashes, curly quotes, long dashes, ellipses), remove remaining non-alphanumerics, collapse whitespace. No manual per-query edits. Hedges: removed via fixed list data/hedges.txt (case-insensitive phrase removal; removes only hedge/booster phrases such as “maybe”, “i’m not sure”, “kind of”, “definitely”; content words like names/years/places are kept). Retrieval: BM25 first-stage only, num_results=1000, applied to the hedges-removed query. Post-processing: enforce exactly 1000 docs per query; sort by score; TREC format with run_id bm25_hedges. External resources: none; no official baseline runfiles used. Run type: Automatic.

bm25_hedges_neg¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: bm25_hedges_neg
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: cf7f8087c2e6dd5cb07eebfacc9d4f77
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, parse=false. Query processing: Parser-safe normalization, then hedge/uncertainty removal using a fixed lexicon (data/hedges.txt). Removal is case-insensitive, phrase-level, longest-first; only matched hedge phrases are deleted and content words are preserved. Negation detection: The normalized (pre-removal) query text is analyzed for negation cues. We match single-token cues (not, no, never, without, cannot), two-token not forms (do/does/did/is/are/was/were/should/could/would/will), and split contractions (e.g., “don t”, “isn t”). We capture up to four subsequent tokens and retain a span only if it includes an attribute head from data/neg_heads.txt (version/remake/year/language/color/cut/etc.). Retrieval: BM25 first-stage retrieval on the hedges-removed query, returning 1000 documents per query. Negation-aware re-scoring: After retrieval, we penalize candidates whose title/lead foreground a negated span: −2.0 if matched in ~first 128 chars; −1.0 if in the early lead (~first 400 chars). No hard filtering; body-only mentions are not penalized. Ranking/output: Sort by adjusted score; guarantee exactly 1000 per query (top up from base BM25 if needed); TREC format with run_id bm25_hedges_neg. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.

bm25_negations¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: bm25_negations
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 94a747046236076c44293944a432d8c0
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, parse=false. Query processing: Parser-safe normalization only (remove problematic punctuation as described above). We do not remove hedge phrases in this run; the normalized query text is used as-is for retrieval. Negation detection: We detect negation cues in the normalized query by matching single-token cues (not, no, never, without, cannot), two-token forms ( not for do/does/did/is/are/was/were/should/could/would/will), and split-contraction forms that arise after normalization (e.g., “don t”, “isn t”). After a cue, we capture up to four subsequent tokens and keep a span only if it contains an attribute head from a fixed vocabulary (data/neg_heads.txt: version/remake/year/language/color/cut/etc.). This focuses on facets users commonly negate and avoids misinterpreting “No …” titles (which rarely include such heads). Retrieval: BM25 first-stage retrieval on the normalized query, returning 1000 documents per query. We do not remove the negated words from the query. Negation-aware re-scoring: After retrieval, we examine each candidate’s title plus the first ~1000 characters of the page (lead). If a negated span appears within the first ~128 characters (title-like window), we subtract 2.0 from the score; if it appears within the early lead (~first 400 characters), we subtract 1.0. No hard filtering is applied; body-only mentions are not penalized. This preserves recall and only demotes items that strongly foreground the negated facet. Ranking/output: Sort by the adjusted score; ensure exactly 1000 documents per query by topping up from the BM25 list if needed; TREC format with run_id bm25_negpen. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.

dgMxbaiL01¶

Participants | Input | trec_eval | Appendix

Run ID: dgMxbaiL01
Participant: dgthesis
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 0b1e4548f0f9b65599875de5c3b606db
Run description: Building on the pyterrier bm25 baseline run of the test set provided by the organizing team, I first extracted all documents that appear in that run into a smaller corpus dataset to rerank with. Using the reranker module and the 'mixedbread-ai/mxbai-rerank-large-v1' model I then reranked the 1000 documents for each query, directly outputting each query's reranking results to the runfile.

gemini-retrieval¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: gemini-retrieval
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-09
Task: trec2025-tot-main
MD5: 09e8692b9234f6ed6567441d269d6b94
Run description: LLM retrieval with prompt – return up to 20 wikipedia page titles. LLM name matching uses MIN-SCORE=100 BM25-K=10; Include redirect search; Model: gemini-2.5-flash, temperature=0.0, max-token=5000

gm27q-comb-500¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: gm27q-comb-500
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-08
Task: trec2025-tot-main
MD5: 4f596268322db6e07462dd83d04e01b5
Run description: Combine the llm retrieval results, top-200 of pyterrier sparse retrieval and top-200 of dense retrieval results. Then use the Gemma 27B quantized model to rerank them.

gm27q-LMART-1000¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: gm27q-LMART-1000
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-08
Task: trec2025-tot-main
MD5: f522818b330a501da5c0100ab93f769b
Run description: Use trained lambdaMART reranker to rerank all retrieval results from LLM, sparse and dense retrieval. Take the top 1000. Then use the Gemma 27B quantized model to rerank the top 500.

gmn-rerank-500¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: gmn-rerank-500
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-01
Task: trec2025-tot-main
MD5: 23693da08063f2a393d9b804598796b5
Run description: This run is generated based on three first-stage retrieval results 1) Using gemini 2.5 flash LLM to generate 20 entities that answer TOT query 2) The official Pyterrier BM25 results 3) Dense retrieval results from BGE-M3 model The 20 docs from 1), top 500 from 2), and top 500 from 3) are concatenated to be the first-stage retrieval results. Then LLM (gemini-2.5-flash) is used to rerank those 1020 documents per query. It uses the listwise reranking implementation from the rank_llm library.

lambdamart-rerank¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: lambdamart-rerank
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-03
Task: trec2025-tot-main
MD5: 50888eaf1d897ff53f0fdf8ba4c8540e
Run description: Reranking all results of LLM Gemini retrieval, sparse and dense retrieval using a trained LambdaMART model. The model was trained using the official dataset and LLM synthetic query.

lex-stronger-test¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: lex-stronger-test
Participant: DUTH
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-08-31
Task: trec2025-tot-main
MD5: 7909b9c9154715d9689fc4faa6b8ce7a
Run description: Index: built over the official TREC ToT 2025 Wikipedia corpus using IterDictIndexer (blocks=false). Retrieval per query: multiple lexical retrievers – BM25 variants (b=0.55,k1=1.5; b=0.45,k1=1.6; b=0.60,k1=1.3), PL2, InL2, DFIC, DPH, BB2, DFRee, DirichletLM(mu=1500), Hiemstra_LM(c=0.7 and c=0.35). PRF: RM3 on BM25 pipelines (fb_terms=40, fb_docs=15, lambda=0.55; plus mid/light variants 30/12/0.60 and 20/8/0.50). Depth: each retriever returns up to 10,000 docs. Fusion: Reciprocal Rank Fusion with k=60. Query processing: remove control chars and normalize quotes/slashes; keep up to 128 tokens (Terrier further limits to ~64 terms). If the Terrier parser raises an error, we retry with punctuation-stripped query. Output: top 1000 doc ids per query; if fusion yields fewer, we fill from BM25 fallback to guarantee 1000 per qid. No manual intervention on test; parameters were tuned on the provided dev sets.

lex-stronger-testv2¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: lex-stronger-testv2
Participant: DUTH
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-03
Task: trec2025-tot-main
MD5: ae902037e5ea380f8139a992c61e9adb
Run description: Automatic lexical ensemble using PyTerrier/Terrier over the official TREC ToT 2025 Wikipedia corpus. Indexing: Terrier with Stopwords + PorterStemmer, EnglishTokeniser, and blocks (positions). Document text = title + body (whitespace normalized). Query processing: remove control characters and punctuation, normalize spaces, keep ≤128 tokens (Terrier truncates long queries internally). Retrieval pipelines: BM25, PL2, and DPH (document-level) plus pseudo-relevance feedback with RM3 (fb_terms=50, fb_docs=20, lambda=0.6) on top of BM25. Fusion: RRF (k=60) across base lexical and RM3 branches. Output: top-1000 Wikipedia page ids per query. No manual intervention on test; parameters chosen a-priori/validated only on the provided dev splits.

lightning-ir-dense¶

Participants | Input | trec_eval | Appendix

Run ID: lightning-ir-dense
Participant: coordinators
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: c501ad7c00548ba99b932e52aafeebcf
Run description: We used the Dense Retrieval model that was trained and used in the ToT 2023 task as baseline in lightning-ir (i.e., the model is trained for the movie domain). The document collection was processed via ir-datasets.

llama_norm_fusion_v2¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: llama_norm_fusion_v2
Participant: mst
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: b17814bbb9ef82631569652f33e10fa0
Run description: TREC-TOT 2025 Submission Description Template¶

System Overview¶

Team: IRIS

Approach: Multi-Stage Hierarchical Retrieval and Reranking Pipeline

Pipeline Description¶

Our submission employs a sophisticated 5-stage hierarchical retrieval and reranking pipeline specifically designed for tip-of-the-tongue queries:

Stage 1: Sparse Retrieval¶

Models Used: BM25, DPH (Divergence from Randomness), TF-IDF
Index: TREC-TOT 2025 corpus (6.4M documents) via PyTerrier
Query Enhancement: LLaMA-rewritten queries for improved query understanding
Output: Three ranked lists per query (one per sparse model)

Stage 2: RRF Fusion¶

Method: Reciprocal Rank Fusion with k=60
Formula: score(d) = Σ(1/(k + rank_i(d)))
Purpose: Combines sparse retrieval signals for robust initial ranking
Output: Single fused ranking per query

Stage 3: Dense Retrieval (Bi-Encoders)¶

Models:
sentence-transformers/all-MiniLM-L6-v2 (lightweight semantic matching)
sentence-transformers/all-mpnet-base-v2 (high-quality representations)
sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (QA-optimized)
Performance Optimizations:
In-memory PyTerrier index (7.9GB RAM) eliminating disk I/O bottlenecks
Document caching (50GB RAM) for instant document access
90% VRAM utilization with adaptive GPU batching
Mixed precision (FP16) inference for 2x throughput improvement
Multi-GPU workload distribution and optimization
Process: Semantic similarity computation via cosine similarity
Integration: Weighted combination with sparse scores (70% semantic, 30% sparse)
Output: Dense retrieval rankings per bi-encoder model

Stage 4: LTR Fusion¶

Algorithm: LightGBM Learning-to-Rank
Features: 6-dimensional feature vector (3 sparse + 3 dense scores)
Training: Pre-trained on training set with TREC relevance judgments
Application: Applied to test queries using pre-trained model (no test QRELs)
Output: Optimally fused ranking combining all signals

Stage 5: ColBERT Reranking¶

Model: sentence-transformers/all-MiniLM-L6-v2 for late interaction
Document Scope: Top 1000 documents per query from LTR stage
Score Normalization: Z-score normalization (μ=0, σ=1)
Fusion Strategy: 50/50 weighted combination of normalized LTR and ColBERT scores
Output: Final ranking with 1000 documents per query

Technical Specifications¶

Query Processing: 622 test queries processed through complete pipeline
Document Coverage: Up to 1000 documents per query in final ranking
Score Normalization: Z-score for statistical consistency and outlier handling
Implementation: PyTerrier + sentence-transformers + LightGBM + custom neural components

Performance Characteristics¶

Training Performance: 42.58% NDCG@10 on training set (98.2% of LTR baseline)
Robustness: Multi-stage fusion mitigates individual model weaknesses
Semantic Understanding: Combines lexical precision with semantic matching
Scalability: Efficient processing architecture for large-scale retrieval

Innovation Highlights¶

Hierarchical Architecture: Each stage refines previous ranking with different signal types
Advanced Score Fusion: Z-score normalization prevents scale mismatch issues
Comprehensive Features: 6-dimensional sparse+dense feature engineering
Neural Late Interaction: ColBERT-style semantic reranking for fine-grained relevance
Query Enhancement: LLaMA rewriting optimized for tip-of-the-tongue scenarios

Implementation Notes¶

Environment: Python 3.11, PyTerrier 0.11, sentence-transformers, LightGBM
Hardware: Optimized for multi-core CPU processing + GPU acceleration
Memory Management:
In-memory index loading (7.9GB RAM)
Document caching (50GB RAM)
90% GPU memory utilization with adaptive batching
Performance: Mixed precision (FP16) + multi-GPU distribution
Reliability: Comprehensive error handling and fallback mechanisms

This approach leverages complementary strengths of sparse retrieval (lexical precision), dense retrieval (semantic understanding), learning-to-rank (optimal feature combination), and neural reranking (fine-grained relevance modeling) to achieve optimal performance on tip-of-the-tongue information retrieval tasks.

llama_norm_fusion_z¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: llama_norm_fusion_z
Participant: mst
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: b8794b6a4b89017c99b14803209dd2cf
Run description: TREC-TOT 2025 Submission Description Template¶

System Overview¶

Team: IRIS

Approach: Multi-Stage Hierarchical Retrieval and Reranking Pipeline

Pipeline Description¶

Our submission employs a sophisticated 5-stage hierarchical retrieval and reranking pipeline specifically designed for tip-of-the-tongue queries:

Stage 1: Sparse Retrieval¶

Models Used: BM25, DPH (Divergence from Randomness), TF-IDF
Index: TREC-TOT 2025 corpus (6.4M documents) via PyTerrier
Query Enhancement: LLaMA-rewritten queries for improved query understanding
Output: Three ranked lists per query (one per sparse model)

Stage 2: RRF Fusion¶

Method: Reciprocal Rank Fusion with k=60
Formula: score(d) = Σ(1/(k + rank_i(d)))
Purpose: Combines sparse retrieval signals for robust initial ranking
Output: Single fused ranking per query

Stage 3: Dense Retrieval (Bi-Encoders)¶

Models:
sentence-transformers/all-MiniLM-L6-v2 (lightweight semantic matching)
sentence-transformers/all-mpnet-base-v2 (high-quality representations)
sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (QA-optimized)
Performance Optimizations:
In-memory PyTerrier index (7.9GB RAM) eliminating disk I/O bottlenecks
Document caching (50GB RAM) for instant document access
90% VRAM utilization with adaptive GPU batching
Mixed precision (FP16) inference for 2x throughput improvement
Multi-GPU workload distribution and optimization
Process: Semantic similarity computation via cosine similarity
Integration: Weighted combination with sparse scores (70% semantic, 30% sparse)
Output: Dense retrieval rankings per bi-encoder model

Stage 4: LTR Fusion¶

Algorithm: LightGBM Learning-to-Rank
Features: 6-dimensional feature vector (3 sparse + 3 dense scores)
Training: Pre-trained on training set with TREC relevance judgments
Application: Applied to test queries using pre-trained model (no test QRELs)
Output: Optimally fused ranking combining all signals

Stage 5: ColBERT Reranking¶

Model: sentence-transformers/all-MiniLM-L6-v2 for late interaction
Document Scope: Top 1000 documents per query from LTR stage
Score Normalization: Z-score normalization (μ=0, σ=1)
Fusion Strategy: 50/50 weighted combination of normalized LTR and ColBERT scores
Output: Final ranking with 1000 documents per query

Technical Specifications¶

Query Processing: 622 test queries processed through complete pipeline
Document Coverage: Up to 1000 documents per query in final ranking
Score Normalization: Z-score for statistical consistency and outlier handling
Implementation: PyTerrier + sentence-transformers + LightGBM + custom neural components

Performance Characteristics¶

Training Performance: 42.58% NDCG@10 on training set (98.2% of LTR baseline)
Robustness: Multi-stage fusion mitigates individual model weaknesses
Semantic Understanding: Combines lexical precision with semantic matching
Scalability: Efficient processing architecture for large-scale retrieval

Innovation Highlights¶

Hierarchical Architecture: Each stage refines previous ranking with different signal types
Advanced Score Fusion: Z-score normalization prevents scale mismatch issues
Comprehensive Features: 6-dimensional sparse+dense feature engineering
Neural Late Interaction: ColBERT-style semantic reranking for fine-grained relevance
Query Enhancement: LLaMA rewriting optimized for tip-of-the-tongue scenarios

Implementation Notes¶

Environment: Python 3.11, PyTerrier 0.11, sentence-transformers, LightGBM
Hardware: Optimized for multi-core CPU processing + GPU acceleration
Memory Management:
In-memory index loading (7.9GB RAM)
Document caching (50GB RAM)
90% GPU memory utilization with adaptive batching
Performance: Mixed precision (FP16) + multi-GPU distribution
Reliability: Comprehensive error handling and fallback mechanisms

This approach leverages complementary strengths of sparse retrieval (lexical precision), dense retrieval (semantic understanding), learning-to-rank (optimal feature combination), and neural reranking (fine-grained relevance modeling) to achieve optimal performance on tip-of-the-tongue information retrieval tasks.

pyterrier-bm25¶

Participants | Input | trec_eval | Appendix

Run ID: pyterrier-bm25
Participant: coordinators
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: 7d959809c188ea7e56e7f7b5219a1a31
Run description: We used the ir-datasets integration that we have created to index the dataset into PyTerrier and did retrieval with BM25 (all values to the defaults).

rm3_hedge_neg¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: rm3_hedge_neg
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: ca9a53de6c06268f01ae7ce6397a243b
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin; parse=false. Query processing: Parser-safe normalization combined with hedge/uncertainty removal (data/hedges.txt), applied as case-insensitive, phrase-level deletion of hedge phrases only (longest-first). Negation detection: We detect negated spans from the normalized query by matching single-token cues (not/no/never/without/cannot), two-token not forms (do/does/did/is/are/was/were/should/could/would/will), and split contractions arising after normalization (“don t”, “isn t”, etc.). We capture up to four subsequent tokens and retain the span only if it contains an attribute head from data/neg_heads.txt (e.g., version/remake/year/language/color/cut). Retrieval (pseudo-relevance feedback): BM25 with feedback depth 50 on the hedges-removed query → RM3 (fb_docs=10, fb_terms=20) → BM25 final retrieval with 1000 results per query. Negation-aware re-scoring: After final retrieval, we apply a soft penalty if a negated span appears within a candidate’s title-like window (~first 128 chars, −2.0) or early lead (~first 400 chars, −1.0). No hard filtering and no removal of query terms. Ranking/output: Sort by adjusted score; guarantee exactly 1000 docs per query; TREC format with run_id rm3_hedges_neg. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.

rm3_hedges¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: rm3_hedges
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: b6f472ebe30256c2167bafdd2351b882
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin for RM3; parse=false. Query processing: Parser-safe normalization, followed by hedge/uncertainty removal using a fixed lexicon (data/hedges.txt). Removal is case-insensitive, phrase-level, longest-first; only hedge phrases are deleted and content words are preserved. Negations: No negation detection or penalties are applied in this run. Retrieval (pseudo-relevance feedback): Two-stage PRF pipeline on the hedges-removed query: BM25 initial retrieval with feedback depth 50. RM3 with fb_docs=10, fb_terms=20 to build an expansion. BM25 final retrieval to return 1000 documents per query. Ranking/output: Sort by score; enforce exactly 1000 docs per query; TREC format with run_id rm3_hedges. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.

rm3_negations¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: rm3_negations
Participant: UAmsterdam
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: a03fae124ad7fccf5362df32273665a4
Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin; parse=false. Query processing: Parser-safe normalization only; no hedge removal in this run. Negation detection: We analyze the normalized query for negation cues (single-token cues: not/no/never/without/cannot; two-token not; split contractions like “don t”, “isn t”). After a cue, we capture up to 4 subsequent tokens and keep a span only if it contains an attribute head from data/neg_heads.txt (version/remake/year/language/color/cut/etc.). Retrieval (pseudo-relevance feedback): BM25 with feedback depth 50 → RM3 (fb_docs=10, fb_terms=20) → BM25 final retrieval with 1000 results per query. Negation-aware re-scoring: After the final BM25 stage, we penalize candidates that highlight a negated span in title/lead: −2.0 if matched within ~first 128 chars; −1.0 if matched within ~first 400 chars. We do not remove terms or filter documents. Ranking/output: Sort by adjusted score; ensure exactly 1000 results per query; TREC format with run_id rm3_negpen. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.

runid1¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: runid1
Participant: ufmg
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 313d0f36baab05263b64c1567f0f1b88
Run description: This run was generated using a Direct Preference Optimization (DPO)-based query rewriting system, where a single general-purpose language model was fine-tuned to align its rewrites with the preferences of dense and cross-encoder retrieval systems. The model was applied uniformly across all queries — without domain classification.
Rewrite Model with DPO A fixed pool of rewrite candidates was generated for each training query using a base language model. These candidates were ranked by:
A dense retriever using all-mpnet-base-v2 over the corpus
A cross-encoder reranker using cross-encoder/ms-marco-MiniLM-L-12-v2 From these scores, preference pairs were derived based on improvements in rank or cross encoder logit of the target item. These preferences were then used to fine-tune two separate LoRA adapters on top of meta-llama/Llama-3.1-8B-Instruct via DPO. Both adapters were trained using the same general-purpose query distribution (no domain filtering), enabling the system to generalize across a wide range of vague, incomplete, or ambiguous user inputs.
Rewrite Inference At test time:
All queries were passed through the same fixed pipeline, with no classification or routing.
The dense rewrite was generated using the adapter and a prompt instructing the model to expand on vague details and maintain specificity.
The cross rewrite was generated using the adapter and a prompt encouraging compact, fact-based formulations.
Retrieval via Tree-of-Thoughts (ToT) The generated rewrites were passed to a Tree-of-Thoughts search module, which simulates iterative refinement of the query through LLM-generated hypotheses ("thoughts") and new rewrites:
Embedding generation used all-mpnet-base-v2-
Dense retrieval was performed using cosine similarity on a normalized vector index
Reranking was done with the same cross-encoder used during training
The search proceeds greedily by expanding nodes that produce higher reranking scores
Final result aggregation is done at the document level using the reranking score

runid2¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: runid2
Participant: ufmg
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 2fba9b59fe54f462539263a41213442e
Run description: This run was produced using a multi-stage pipeline combining prompt-tuned LLMs, dense and cross-encoder retrieval, and a Tree-of-Thoughts reasoning framework.
Rewrite Generation via Prompt-Tuned LLaMA-3 We fine-tuned four adapters (using LoRA) on top of meta-llama/Llama-3.1-8B-Instruct, specialized in query rewriting for two domains:
Movies: movies-dense: optimized for dense retrieval movies-cross: optimized for cross-encoder reranking
General Domain (e.g., people, places, etc.): all-dense: for dense retrieval all-cross: for reranking

Each adapter was trained using Prompt Tuning to transform vague or partial user queries into precise and informative rewrites. The target rewrites were selected based on their performance (i.e., best retrieval rank) from a pool of LLM-generated candidates. At inference time, each test query was classified as either "movie" or "other" (e.g., "person", "place", etc.) using a Query Classifier, and routed to the appropriate prompt and adapter. Two rewrites were generated per query: - One using the -dense adapter - One using the -cross adapter Both rewrites were stored and passed to the retrieval module.

Retrieval with Tree-of-Thoughts (ToT) Framework We employed a Tree-of-Thoughts architecture that combines LLM-based reasoning with dense retrieval and cross-encoder reranking: a. Initial Retrieval:
An embedding encoder (all-mpnet-base-v2) generated query embeddings.
Dense similarity search was performed over either a general-purpose index or a movie-specific index, depending on the query classification.

b. Reranking: - Candidates were reranked using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2), scoring each candidate against the original vague query and the rewrite from the *-cross adapter

c. Tree Expansion: - A greedy search explored the thought space: At each level, the LLM (LLaMA-3.1-8B with base weights) was prompted to produce new thoughts and rewrites, guided by a specialized multi-turn prompt Each rewrite was embedded, searched via dense retrieval, and reranked again. The highest-scoring node was expanded until max depth or convergence.

This iterative process simulates how a human might refine their query through successive hypotheses. The resulting ranked list was built by aggregating the best reranked scores across all nodes.

Final Output Construction For each test query, we returned a ranked list of item IDs, ordered by reranker score

runid3¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: runid3
Participant: ufmg
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 464bec2e01351c314b0d572a43f1df5c
Run description: This run was generated using a Dense + Cross Encoder preference-based rewriting pipeline, in which a LLaMA 3.1–8B model was fine-tuned using Direct Preference Optimization (DPO) to generate reformulations aligned with the behavior of both dense and cross-encoder retrievers.
Rewrite Preference Modeling via DPO We began by creating a pool of candidate rewrites for each training query using a pretrained LLM. These rewrites were evaluated using two retrieval modules:
Dense retriever: MPNet embeddings over the corpus.
Cross-encoder reranker: cross-encoder/ms-marco-MiniLM-L-12-v2.

For each query, pairwise preferences were derived by comparing rewrites based on their downstream retrieval performance (e.g., higher-ranked retrieved results). We applied a threshold over the NDCG difference or rank delta to filter consistent preference pairs. These preference pairs were then used to train a DPO objective on the base model meta-llama/Llama-3.1-8B-Instruct, resulting in several LoRA adapters specialized for rewrite generation in different domains

Prompt-Guided Inference for Rewrites At inference time, for each test query:
A domain classifier (movie vs. other) routed the input to the appropriate set of adapters and prompts.
The LLaMA model loaded the selected LoRA adapter and generated multiple rewrites: Three rewrites using different dense-focused adapters and prompts One cross-encoder-aligned rewrite using the cross adapter
Each rewrite was generated with a specialized prompt, instructing the model to: Be precise, short, and factual (for reranking) Preserve user-specified context and details
Retrieval via Tree-of-Thoughts Framework The output rewrites were passed into a Tree-of-Thoughts (ToT) search module:
Dense embedding retrieval using MPNet
Cross-encoder reranking.
Greedy tree expansion with LLM-generated thoughts and rewrites
Node evaluation based on reranker score
Final aggregation of results based on score

runid4¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: runid4
Participant: ufmg
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-10
Task: trec2025-tot-main
MD5: 1804c2742ca1f2a10e5b4fb82ecd2039
Run description: This run was generated using a model to classify the relevance of the top-1 retrieved item from each of the three previous runs. For each query, we applied a relevance classifier built on gpt-5-nano to determine whether the top-1 result from each run is semantically aligned with the user's query. We then selected the result whose top-1 was judged most relevant.

Previous Runs Summary: Prompt Tuning – Domain Split This run used prompt-tuned LoRA adapters on top of meta-llama/Llama-3.1-8B-Instruct, with separate models for movie vs. general queries. Each query was first classified by a query type classifier (movie vs. other), then routed to specialized adapters. Rewrites were generated using domain-specific prompts and evaluated in a Tree-of-Thoughts retrieval framework with dense + cross reranking. DPO – Movie vs General Split This run used a set of LoRA adapters trained via Direct Preference Optimization (DPO) to align rewrites with dense or cross retriever preferences. The query was first classified, and different adapters were used for the movie vs. general domain. Rewrites were scored based on how well they matched previously learned preferences, and used for retrieval via the same ToT framework. DPO – General Model (No Classification) This run removed domain-specific routing and used a general-purpose DPO model trained on all query types. A single dense-aligned and a single cross-aligned adapter were used across all queries. Rewrites were generated with consistent prompts and evaluated using the same dense + reranking pipeline without any per-query logic or adaptation.

GPT-5-nano Relevance Scoring To combine these systems, we used a lightweight GPT-5 classifier to assess the semantic relevance of the top-1 document retrieved by each run for every query. The model received the query and the top-1 result text and was asked to classify the relevance. The classifier produced a three-way decision for each run’s top result: 2 = Relevant 1 = Maybe 0 = Not relevant For each query, we selected the run with the highest relevance score. If two or more runs tied, we applied a fixed priority order: General DPO → Movies DPO → Prompt Tuning.

scrb-tot-01¶

Participants | Proceedings | Input | Appendix

Run ID: scrb-tot-01
Participant: SRCB
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-08-25
Task: trec2025-tot-main
MD5: 41ceb6c18ad3610283cd66272f956c2b
Run description: A pipeline composed of Dense Retriever, Reranker, and LLM Reranker

Query processing: all queries are converted to a list of cues by DeepSeek-V3

Dense Retriever based on Qwen/Qwen3-Embedding-8B: - For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties - For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs

Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.

Listwise Reranker using Deepseek-V3: Rerank top-20 results from the reranker

scrb-tot-02¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: scrb-tot-02
Participant: SRCB
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-08-25
Task: trec2025-tot-main
MD5: 95ea84f74abcee61e0f8a06edffd19db
Run description: A pipeline composed of Dense Retriever, Reranker, LLM Retriever and LLM Reranker

Query processing: all queries are converted to a list of cues by DeepSeek-V3

Dense Retriever based on Qwen/Qwen3-Embedding-8B: - For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties - For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs

Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.

LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.

Listwise Reranker using Deepseek-V3: Replace the top 11~20 docs from the reranker with the docs retrieved by LLM Retriever, and then rerank top-20 results (deduplicated)

scrb-tot-03¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: scrb-tot-03
Participant: SRCB
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-08-31
Task: trec2025-tot-main
MD5: 5f45c7d3be9369f22ecb29c37ab8684f
Run description: A pipeline composed of Dense Retriever, Reranker, LLM Retriever and LLM Reranker Query processing: all queries are converted to a list of cues by DeepSeek-V3 Dense Retriever based on Qwen/Qwen3-Embedding-8B:
For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties
For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.

LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.

Listwise Reranker using Deepseek-V3: We design a three-stage ranking pipeline. First, the LLM retrieval results are inserted into the candidate list starting from rank 6, while ranks 1–5 are preserved from the baseline ranking. Second, we apply DeepSeek-v3 in a listwise ranking setting to reorder candidates from rank 2 through rank 10. Third, from the resulting ranking, we select the top four titles and conduct a fine-grained reranking using GPT-5 with the analyze-ranking strategy. The final output is obtained from this refined ranking.

scrb-tot-04¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: scrb-tot-04
Participant: SRCB
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-08
Task: trec2025-tot-main
MD5: 97e4de9fc097ba1165d5a3d69d7c28c0
Run description: A pipeline composed of Dense Retriever, Reranker, and LLM Reranker

Query processing: all queries are converted to a list of cues by DeepSeek-V3

Dense Retriever based on Qwen/Qwen3-Embedding-8B: For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs

Reranker: Rerank top 2000 results from the retriever. For movie domain: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. For other domain: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2, samples from tomt-kis dataset and 1766 synthetic data created by glm-4-plus. The queries of the synthetic data were created from the top-5 docs retrieved by our baseline system.

LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.

Listwise Reranker using Deepseek-V3: We design a three-stage ranking pipeline for Movie domain only. First, the LLM retrieval results are inserted into the candidate list starting from rank 6, while ranks 1–5 are preserved from the baseline ranking. Second, we apply DeepSeek-v3 in a listwise ranking setting to reorder candidates from rank 2 through rank 10. Third, from the resulting ranking, we select the top four titles and conduct a fine-grained reranking using GPT-5 with the analyze-ranking strategy. The final output is obtained from this refined ranking. For other domain, we replace the top 11~20 docs from the reranker with the docs retrieved by LLM Retriever, and then rerank top-20 results (deduplicated)

top_model_dense¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: top_model_dense
Participant: DS@GT
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-11
Task: trec2025-tot-main
MD5: 5a60d1e381c8b1a40bce83e1e240494e
Run description: First stage dense retrieval with topic modelling

webis-bm25-gpt-oss¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: webis-bm25-gpt-oss
Participant: webis
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-09
Task: trec2025-tot-main
MD5: c8985a0b162f2f7d568dabb1f0a4181f
Run description: We used openai-gpt-oss-120b with 5 long query reduction prompts that we developed in the previous year. The LLM had the original query as input, and was framed to remove words that do not help with retrieval. We used 5 different prompts and submitted the five reduced queries against PyTerrier with BM25 in default configuration and used reciprocal rank fusion implemented in ranx to fuse the 5 runs.

webis-bm25-llama¶

Participants | Proceedings | Input | trec_eval | Appendix

Run ID: webis-bm25-llama
Participant: webis
Track: Tip of the Tongue (TOT)
Year: 2025
Submission: 2025-09-09
Task: trec2025-tot-main
MD5: 2280f49721dcb7526a4d2fb540854d37
Run description: We used llama-3-3-70b-versatile with 5 long query reduction prompts that we developed in the previous year. The LLM had the original query as input, and was framed to remove words that do not help with retrieval. We used 5 different prompts and submitted the five reduced queries against PyTerrier with BM25 in default configuration and used reciprocal rank fusion implemented in ranx to fuse the 5 runs.

Runs - Tip of the Tongue (TOT) 2025¶

anserini-bm25¶

bge-m3¶

bm25-porterblk-test¶

bm25_hedge_aware¶

bm25_hedges_neg¶

bm25_negations¶

dgMxbaiL01¶

gemini-retrieval¶

gm27q-comb-500¶

gm27q-LMART-1000¶

gmn-rerank-500¶

lambdamart-rerank¶

lex-stronger-test¶

lex-stronger-testv2¶

lightning-ir-dense¶

llama_norm_fusion_v2¶

Run description: TREC-TOT 2025 Submission Description Template¶

System Overview¶

Pipeline Description¶

Stage 1: Sparse Retrieval¶

Stage 2: RRF Fusion¶

Stage 3: Dense Retrieval (Bi-Encoders)¶

Stage 4: LTR Fusion¶

Stage 5: ColBERT Reranking¶

Technical Specifications¶

Performance Characteristics¶

Innovation Highlights¶

Implementation Notes¶

llama_norm_fusion_z¶

Run description: TREC-TOT 2025 Submission Description Template¶

System Overview¶

Pipeline Description¶

Stage 1: Sparse Retrieval¶

Stage 2: RRF Fusion¶

Stage 3: Dense Retrieval (Bi-Encoders)¶

Stage 4: LTR Fusion¶

Stage 5: ColBERT Reranking¶

Technical Specifications¶

Performance Characteristics¶

Innovation Highlights¶

Implementation Notes¶

pyterrier-bm25¶

rm3_hedge_neg¶

rm3_hedges¶

rm3_negations¶

runid1¶

runid2¶

runid3¶

runid4¶

scrb-tot-01¶

scrb-tot-02¶

scrb-tot-03¶

scrb-tot-04¶

top_model_dense¶

webis-bm25-gpt-oss¶

webis-bm25-llama¶