Runs - Tip of the Tongue (TOT) 2025¶
anserini-bm25¶
Participants | Input | trec_eval | Appendix
- Run ID: anserini-bm25
- Participant: coordinators
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
4f40a9559e7bb4044464de3928e92982 - Run description: We used the ir-datasets integration that we prepared to index the corpus to Anserini and did BM25 retrieval with default values.
bge-m3¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: bge-m3
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-01
- Task: trec2025-tot-main
- MD5:
52af7e65d9d0644d04d99d0d19c6b864 - Run description: This is a dense retrieval run. We directly use the Wikipedia embeddings from https://huggingface.co/datasets/Upstash/wikipedia-2024-06-bge-m3 We use the bge-m3 model (https://huggingface.co/BAAI/bge-m3) to embed all the queries and cosine similarity is computed to retrieval top 1000 passages.
bm25-porterblk-test¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: bm25-porterblk-test
- Participant: DUTH
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-08-31
- Task: trec2025-tot-main
- MD5:
bcdf217f3fad8aa82e343cddd01153ff - Run description: Automatic BM25 baseline using PyTerrier/Terrier over the official TREC ToT 2025 Wikipedia corpus. Index: Terrier with Stopwords + PorterStemmer, EnglishTokeniser, and blocks (positions) enabled. Query processing: remove control chars and punctuation; keep ≤128 tokens (Terrier further truncates). Retrieval: BM25, top-1000 per query. No manual intervention on test; parameters verified on the provided dev splits.
bm25_hedge_aware¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: bm25_hedge_aware
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
8be1b496b0c0f3d2aedcc046a0f375b6 - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index (title + full text), no field weighting changes. Software: PyTerrier 0.10.0, Terrier 5.11. Queries executed with controls parse=false (bag-of-words). Query normalization: strip punctuation that can trip Terrier (slashes, curly quotes, long dashes, ellipses), remove remaining non-alphanumerics, collapse whitespace. No manual per-query edits. Hedges: removed via fixed list data/hedges.txt (case-insensitive phrase removal; removes only hedge/booster phrases such as “maybe”, “i’m not sure”, “kind of”, “definitely”; content words like names/years/places are kept). Retrieval: BM25 first-stage only, num_results=1000, applied to the hedges-removed query. Post-processing: enforce exactly 1000 docs per query; sort by score; TREC format with run_id bm25_hedges. External resources: none; no official baseline runfiles used. Run type: Automatic.
bm25_hedges_neg¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: bm25_hedges_neg
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
cf7f8087c2e6dd5cb07eebfacc9d4f77 - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text.
Software/config: PyTerrier 0.10.0, Terrier 5.11, parse=false.
Query processing: Parser-safe normalization, then hedge/uncertainty removal using a fixed lexicon (data/hedges.txt). Removal is case-insensitive, phrase-level, longest-first; only matched hedge phrases are deleted and content words are preserved.
Negation detection: The normalized (pre-removal) query text is analyzed for negation cues. We match single-token cues (not, no, never, without, cannot), two-token
not forms (do/does/did/is/are/was/were/should/could/would/will), and split contractions (e.g., “don t”, “isn t”). We capture up to four subsequent tokens and retain a span only if it includes an attribute head from data/neg_heads.txt (version/remake/year/language/color/cut/etc.). Retrieval: BM25 first-stage retrieval on the hedges-removed query, returning 1000 documents per query. Negation-aware re-scoring: After retrieval, we penalize candidates whose title/lead foreground a negated span: −2.0 if matched in ~first 128 chars; −1.0 if in the early lead (~first 400 chars). No hard filtering; body-only mentions are not penalized. Ranking/output: Sort by adjusted score; guarantee exactly 1000 per query (top up from base BM25 if needed); TREC format with run_id bm25_hedges_neg. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.
bm25_negations¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: bm25_negations
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
94a747046236076c44293944a432d8c0 - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text.
Software/config: PyTerrier 0.10.0, Terrier 5.11, parse=false.
Query processing: Parser-safe normalization only (remove problematic punctuation as described above). We do not remove hedge phrases in this run; the normalized query text is used as-is for retrieval.
Negation detection: We detect negation cues in the normalized query by matching single-token cues (not, no, never, without, cannot), two-token forms (
not for do/does/did/is/are/was/were/should/could/would/will), and split-contraction forms that arise after normalization (e.g., “don t”, “isn t”). After a cue, we capture up to four subsequent tokens and keep a span only if it contains an attribute head from a fixed vocabulary (data/neg_heads.txt: version/remake/year/language/color/cut/etc.). This focuses on facets users commonly negate and avoids misinterpreting “No …” titles (which rarely include such heads). Retrieval: BM25 first-stage retrieval on the normalized query, returning 1000 documents per query. We do not remove the negated words from the query. Negation-aware re-scoring: After retrieval, we examine each candidate’s title plus the first ~1000 characters of the page (lead). If a negated span appears within the first ~128 characters (title-like window), we subtract 2.0 from the score; if it appears within the early lead (~first 400 characters), we subtract 1.0. No hard filtering is applied; body-only mentions are not penalized. This preserves recall and only demotes items that strongly foreground the negated facet. Ranking/output: Sort by the adjusted score; ensure exactly 1000 documents per query by topping up from the BM25 list if needed; TREC format with run_id bm25_negpen. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.
dgMxbaiL01¶
Participants | Input | trec_eval | Appendix
- Run ID: dgMxbaiL01
- Participant: dgthesis
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
0b1e4548f0f9b65599875de5c3b606db - Run description: Building on the pyterrier bm25 baseline run of the test set provided by the organizing team, I first extracted all documents that appear in that run into a smaller corpus dataset to rerank with. Using the reranker module and the 'mixedbread-ai/mxbai-rerank-large-v1' model I then reranked the 1000 documents for each query, directly outputting each query's reranking results to the runfile.
gemini-retrieval¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: gemini-retrieval
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-09
- Task: trec2025-tot-main
- MD5:
09e8692b9234f6ed6567441d269d6b94 - Run description: LLM retrieval with prompt – return up to 20 wikipedia page titles. LLM name matching uses MIN-SCORE=100 BM25-K=10; Include redirect search; Model: gemini-2.5-flash, temperature=0.0, max-token=5000
gm27q-comb-500¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: gm27q-comb-500
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-08
- Task: trec2025-tot-main
- MD5:
4f596268322db6e07462dd83d04e01b5 - Run description: Combine the llm retrieval results, top-200 of pyterrier sparse retrieval and top-200 of dense retrieval results. Then use the Gemma 27B quantized model to rerank them.
gm27q-LMART-1000¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: gm27q-LMART-1000
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-08
- Task: trec2025-tot-main
- MD5:
f522818b330a501da5c0100ab93f769b - Run description: Use trained lambdaMART reranker to rerank all retrieval results from LLM, sparse and dense retrieval. Take the top 1000. Then use the Gemma 27B quantized model to rerank the top 500.
gmn-rerank-500¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: gmn-rerank-500
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-01
- Task: trec2025-tot-main
- MD5:
23693da08063f2a393d9b804598796b5 - Run description: This run is generated based on three first-stage retrieval results 1) Using gemini 2.5 flash LLM to generate 20 entities that answer TOT query 2) The official Pyterrier BM25 results 3) Dense retrieval results from BGE-M3 model The 20 docs from 1), top 500 from 2), and top 500 from 3) are concatenated to be the first-stage retrieval results. Then LLM (gemini-2.5-flash) is used to rerank those 1020 documents per query. It uses the listwise reranking implementation from the rank_llm library.
lambdamart-rerank¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: lambdamart-rerank
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-03
- Task: trec2025-tot-main
- MD5:
50888eaf1d897ff53f0fdf8ba4c8540e - Run description: Reranking all results of LLM Gemini retrieval, sparse and dense retrieval using a trained LambdaMART model. The model was trained using the official dataset and LLM synthetic query.
lex-stronger-test¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: lex-stronger-test
- Participant: DUTH
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-08-31
- Task: trec2025-tot-main
- MD5:
7909b9c9154715d9689fc4faa6b8ce7a - Run description: Index: built over the official TREC ToT 2025 Wikipedia corpus using IterDictIndexer (blocks=false). Retrieval per query: multiple lexical retrievers – BM25 variants (b=0.55,k1=1.5; b=0.45,k1=1.6; b=0.60,k1=1.3), PL2, InL2, DFIC, DPH, BB2, DFRee, DirichletLM(mu=1500), Hiemstra_LM(c=0.7 and c=0.35). PRF: RM3 on BM25 pipelines (fb_terms=40, fb_docs=15, lambda=0.55; plus mid/light variants 30/12/0.60 and 20/8/0.50). Depth: each retriever returns up to 10,000 docs. Fusion: Reciprocal Rank Fusion with k=60. Query processing: remove control chars and normalize quotes/slashes; keep up to 128 tokens (Terrier further limits to ~64 terms). If the Terrier parser raises an error, we retry with punctuation-stripped query. Output: top 1000 doc ids per query; if fusion yields fewer, we fill from BM25 fallback to guarantee 1000 per qid. No manual intervention on test; parameters were tuned on the provided dev sets.
lex-stronger-testv2¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: lex-stronger-testv2
- Participant: DUTH
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-03
- Task: trec2025-tot-main
- MD5:
ae902037e5ea380f8139a992c61e9adb - Run description: Automatic lexical ensemble using PyTerrier/Terrier over the official TREC ToT 2025 Wikipedia corpus. Indexing: Terrier with Stopwords + PorterStemmer, EnglishTokeniser, and blocks (positions). Document text = title + body (whitespace normalized). Query processing: remove control characters and punctuation, normalize spaces, keep ≤128 tokens (Terrier truncates long queries internally). Retrieval pipelines: BM25, PL2, and DPH (document-level) plus pseudo-relevance feedback with RM3 (fb_terms=50, fb_docs=20, lambda=0.6) on top of BM25. Fusion: RRF (k=60) across base lexical and RM3 branches. Output: top-1000 Wikipedia page ids per query. No manual intervention on test; parameters chosen a-priori/validated only on the provided dev splits.
lightning-ir-dense¶
Participants | Input | trec_eval | Appendix
- Run ID: lightning-ir-dense
- Participant: coordinators
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
c501ad7c00548ba99b932e52aafeebcf - Run description: We used the Dense Retrieval model that was trained and used in the ToT 2023 task as baseline in lightning-ir (i.e., the model is trained for the movie domain). The document collection was processed via ir-datasets.
llama_norm_fusion_v2¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: llama_norm_fusion_v2
- Participant: mst
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
b17814bbb9ef82631569652f33e10fa0 -
Run description: TREC-TOT 2025 Submission Description Template¶
System Overview¶
Team: IRIS
Approach: Multi-Stage Hierarchical Retrieval and Reranking Pipeline
Pipeline Description¶
Our submission employs a sophisticated 5-stage hierarchical retrieval and reranking pipeline specifically designed for tip-of-the-tongue queries:
Stage 1: Sparse Retrieval¶
- Models Used: BM25, DPH (Divergence from Randomness), TF-IDF
- Index: TREC-TOT 2025 corpus (6.4M documents) via PyTerrier
- Query Enhancement: LLaMA-rewritten queries for improved query understanding
- Output: Three ranked lists per query (one per sparse model)
Stage 2: RRF Fusion¶
- Method: Reciprocal Rank Fusion with k=60
- Formula: score(d) = Σ(1/(k + rank_i(d)))
- Purpose: Combines sparse retrieval signals for robust initial ranking
- Output: Single fused ranking per query
Stage 3: Dense Retrieval (Bi-Encoders)¶
- Models:
- sentence-transformers/all-MiniLM-L6-v2 (lightweight semantic matching)
- sentence-transformers/all-mpnet-base-v2 (high-quality representations)
- sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (QA-optimized)
- Performance Optimizations:
- In-memory PyTerrier index (7.9GB RAM) eliminating disk I/O bottlenecks
- Document caching (50GB RAM) for instant document access
- 90% VRAM utilization with adaptive GPU batching
- Mixed precision (FP16) inference for 2x throughput improvement
- Multi-GPU workload distribution and optimization
- Process: Semantic similarity computation via cosine similarity
- Integration: Weighted combination with sparse scores (70% semantic, 30% sparse)
- Output: Dense retrieval rankings per bi-encoder model
Stage 4: LTR Fusion¶
- Algorithm: LightGBM Learning-to-Rank
- Features: 6-dimensional feature vector (3 sparse + 3 dense scores)
- Training: Pre-trained on training set with TREC relevance judgments
- Application: Applied to test queries using pre-trained model (no test QRELs)
- Output: Optimally fused ranking combining all signals
Stage 5: ColBERT Reranking¶
- Model: sentence-transformers/all-MiniLM-L6-v2 for late interaction
- Document Scope: Top 1000 documents per query from LTR stage
- Score Normalization: Z-score normalization (μ=0, σ=1)
- Fusion Strategy: 50/50 weighted combination of normalized LTR and ColBERT scores
- Output: Final ranking with 1000 documents per query
Technical Specifications¶
- Query Processing: 622 test queries processed through complete pipeline
- Document Coverage: Up to 1000 documents per query in final ranking
- Score Normalization: Z-score for statistical consistency and outlier handling
- Implementation: PyTerrier + sentence-transformers + LightGBM + custom neural components
Performance Characteristics¶
- Training Performance: 42.58% NDCG@10 on training set (98.2% of LTR baseline)
- Robustness: Multi-stage fusion mitigates individual model weaknesses
- Semantic Understanding: Combines lexical precision with semantic matching
- Scalability: Efficient processing architecture for large-scale retrieval
Innovation Highlights¶
- Hierarchical Architecture: Each stage refines previous ranking with different signal types
- Advanced Score Fusion: Z-score normalization prevents scale mismatch issues
- Comprehensive Features: 6-dimensional sparse+dense feature engineering
- Neural Late Interaction: ColBERT-style semantic reranking for fine-grained relevance
- Query Enhancement: LLaMA rewriting optimized for tip-of-the-tongue scenarios
Implementation Notes¶
- Environment: Python 3.11, PyTerrier 0.11, sentence-transformers, LightGBM
- Hardware: Optimized for multi-core CPU processing + GPU acceleration
- Memory Management:
- In-memory index loading (7.9GB RAM)
- Document caching (50GB RAM)
- 90% GPU memory utilization with adaptive batching
- Performance: Mixed precision (FP16) + multi-GPU distribution
- Reliability: Comprehensive error handling and fallback mechanisms
This approach leverages complementary strengths of sparse retrieval (lexical precision), dense retrieval (semantic understanding), learning-to-rank (optimal feature combination), and neural reranking (fine-grained relevance modeling) to achieve optimal performance on tip-of-the-tongue information retrieval tasks.
llama_norm_fusion_z¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: llama_norm_fusion_z
- Participant: mst
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
b8794b6a4b89017c99b14803209dd2cf -
Run description: TREC-TOT 2025 Submission Description Template¶
System Overview¶
Team: IRIS
Approach: Multi-Stage Hierarchical Retrieval and Reranking Pipeline
Pipeline Description¶
Our submission employs a sophisticated 5-stage hierarchical retrieval and reranking pipeline specifically designed for tip-of-the-tongue queries:
Stage 1: Sparse Retrieval¶
- Models Used: BM25, DPH (Divergence from Randomness), TF-IDF
- Index: TREC-TOT 2025 corpus (6.4M documents) via PyTerrier
- Query Enhancement: LLaMA-rewritten queries for improved query understanding
- Output: Three ranked lists per query (one per sparse model)
Stage 2: RRF Fusion¶
- Method: Reciprocal Rank Fusion with k=60
- Formula: score(d) = Σ(1/(k + rank_i(d)))
- Purpose: Combines sparse retrieval signals for robust initial ranking
- Output: Single fused ranking per query
Stage 3: Dense Retrieval (Bi-Encoders)¶
- Models:
- sentence-transformers/all-MiniLM-L6-v2 (lightweight semantic matching)
- sentence-transformers/all-mpnet-base-v2 (high-quality representations)
- sentence-transformers/multi-qa-MiniLM-L6-cos-v1 (QA-optimized)
- Performance Optimizations:
- In-memory PyTerrier index (7.9GB RAM) eliminating disk I/O bottlenecks
- Document caching (50GB RAM) for instant document access
- 90% VRAM utilization with adaptive GPU batching
- Mixed precision (FP16) inference for 2x throughput improvement
- Multi-GPU workload distribution and optimization
- Process: Semantic similarity computation via cosine similarity
- Integration: Weighted combination with sparse scores (70% semantic, 30% sparse)
- Output: Dense retrieval rankings per bi-encoder model
Stage 4: LTR Fusion¶
- Algorithm: LightGBM Learning-to-Rank
- Features: 6-dimensional feature vector (3 sparse + 3 dense scores)
- Training: Pre-trained on training set with TREC relevance judgments
- Application: Applied to test queries using pre-trained model (no test QRELs)
- Output: Optimally fused ranking combining all signals
Stage 5: ColBERT Reranking¶
- Model: sentence-transformers/all-MiniLM-L6-v2 for late interaction
- Document Scope: Top 1000 documents per query from LTR stage
- Score Normalization: Z-score normalization (μ=0, σ=1)
- Fusion Strategy: 50/50 weighted combination of normalized LTR and ColBERT scores
- Output: Final ranking with 1000 documents per query
Technical Specifications¶
- Query Processing: 622 test queries processed through complete pipeline
- Document Coverage: Up to 1000 documents per query in final ranking
- Score Normalization: Z-score for statistical consistency and outlier handling
- Implementation: PyTerrier + sentence-transformers + LightGBM + custom neural components
Performance Characteristics¶
- Training Performance: 42.58% NDCG@10 on training set (98.2% of LTR baseline)
- Robustness: Multi-stage fusion mitigates individual model weaknesses
- Semantic Understanding: Combines lexical precision with semantic matching
- Scalability: Efficient processing architecture for large-scale retrieval
Innovation Highlights¶
- Hierarchical Architecture: Each stage refines previous ranking with different signal types
- Advanced Score Fusion: Z-score normalization prevents scale mismatch issues
- Comprehensive Features: 6-dimensional sparse+dense feature engineering
- Neural Late Interaction: ColBERT-style semantic reranking for fine-grained relevance
- Query Enhancement: LLaMA rewriting optimized for tip-of-the-tongue scenarios
Implementation Notes¶
- Environment: Python 3.11, PyTerrier 0.11, sentence-transformers, LightGBM
- Hardware: Optimized for multi-core CPU processing + GPU acceleration
- Memory Management:
- In-memory index loading (7.9GB RAM)
- Document caching (50GB RAM)
- 90% GPU memory utilization with adaptive batching
- Performance: Mixed precision (FP16) + multi-GPU distribution
- Reliability: Comprehensive error handling and fallback mechanisms
This approach leverages complementary strengths of sparse retrieval (lexical precision), dense retrieval (semantic understanding), learning-to-rank (optimal feature combination), and neural reranking (fine-grained relevance modeling) to achieve optimal performance on tip-of-the-tongue information retrieval tasks.
pyterrier-bm25¶
Participants | Input | trec_eval | Appendix
- Run ID: pyterrier-bm25
- Participant: coordinators
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
7d959809c188ea7e56e7f7b5219a1a31 - Run description: We used the ir-datasets integration that we have created to index the dataset into PyTerrier and did retrieval with BM25 (all values to the defaults).
rm3_hedge_neg¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: rm3_hedge_neg
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
ca9a53de6c06268f01ae7ce6397a243b - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text.
Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin; parse=false.
Query processing: Parser-safe normalization combined with hedge/uncertainty removal (data/hedges.txt), applied as case-insensitive, phrase-level deletion of hedge phrases only (longest-first).
Negation detection: We detect negated spans from the normalized query by matching single-token cues (not/no/never/without/cannot), two-token
not forms (do/does/did/is/are/was/were/should/could/would/will), and split contractions arising after normalization (“don t”, “isn t”, etc.). We capture up to four subsequent tokens and retain the span only if it contains an attribute head from data/neg_heads.txt (e.g., version/remake/year/language/color/cut). Retrieval (pseudo-relevance feedback): BM25 with feedback depth 50 on the hedges-removed query → RM3 (fb_docs=10, fb_terms=20) → BM25 final retrieval with 1000 results per query. Negation-aware re-scoring: After final retrieval, we apply a soft penalty if a negated span appears within a candidate’s title-like window (~first 128 chars, −2.0) or early lead (~first 400 chars, −1.0). No hard filtering and no removal of query terms. Ranking/output: Sort by adjusted score; guarantee exactly 1000 docs per query; TREC format with run_id rm3_hedges_neg. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.
rm3_hedges¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: rm3_hedges
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
b6f472ebe30256c2167bafdd2351b882 - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text. Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin for RM3; parse=false. Query processing: Parser-safe normalization, followed by hedge/uncertainty removal using a fixed lexicon (data/hedges.txt). Removal is case-insensitive, phrase-level, longest-first; only hedge phrases are deleted and content words are preserved. Negations: No negation detection or penalties are applied in this run. Retrieval (pseudo-relevance feedback): Two-stage PRF pipeline on the hedges-removed query: BM25 initial retrieval with feedback depth 50. RM3 with fb_docs=10, fb_terms=20 to build an expansion. BM25 final retrieval to return 1000 documents per query. Ranking/output: Sort by score; enforce exactly 1000 docs per query; TREC format with run_id rm3_hedges. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.
rm3_negations¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: rm3_negations
- Participant: UAmsterdam
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
a03fae124ad7fccf5362df32273665a4 - Run description: Corpus and index: TREC ToT 2025 Wikipedia JSONL; PyTerrier/Terrier index over title + full text.
Software/config: PyTerrier 0.10.0, Terrier 5.11, terrier-prf plugin; parse=false.
Query processing: Parser-safe normalization only; no hedge removal in this run.
Negation detection: We analyze the normalized query for negation cues (single-token cues: not/no/never/without/cannot; two-token
not; split contractions like “don t”, “isn t”). After a cue, we capture up to 4 subsequent tokens and keep a span only if it contains an attribute head from data/neg_heads.txt (version/remake/year/language/color/cut/etc.). Retrieval (pseudo-relevance feedback): BM25 with feedback depth 50 → RM3 (fb_docs=10, fb_terms=20) → BM25 final retrieval with 1000 results per query. Negation-aware re-scoring: After the final BM25 stage, we penalize candidates that highlight a negated span in title/lead: −2.0 if matched within ~first 128 chars; −1.0 if matched within ~first 400 chars. We do not remove terms or filter documents. Ranking/output: Sort by adjusted score; ensure exactly 1000 results per query; TREC format with run_id rm3_negpen. External resources/baselines: No LLMs or official baseline runfiles used. Run type: Automatic.
runid1¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: runid1
- Participant: ufmg
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
313d0f36baab05263b64c1567f0f1b88 -
Run description: This run was generated using a Direct Preference Optimization (DPO)-based query rewriting system, where a single general-purpose language model was fine-tuned to align its rewrites with the preferences of dense and cross-encoder retrieval systems. The model was applied uniformly across all queries — without domain classification.
-
Rewrite Model with DPO A fixed pool of rewrite candidates was generated for each training query using a base language model. These candidates were ranked by:
- A dense retriever using all-mpnet-base-v2 over the corpus
-
A cross-encoder reranker using cross-encoder/ms-marco-MiniLM-L-12-v2 From these scores, preference pairs were derived based on improvements in rank or cross encoder logit of the target item. These preferences were then used to fine-tune two separate LoRA adapters on top of meta-llama/Llama-3.1-8B-Instruct via DPO. Both adapters were trained using the same general-purpose query distribution (no domain filtering), enabling the system to generalize across a wide range of vague, incomplete, or ambiguous user inputs.
-
Rewrite Inference At test time:
- All queries were passed through the same fixed pipeline, with no classification or routing.
- The dense rewrite was generated using the adapter and a prompt instructing the model to expand on vague details and maintain specificity.
-
The cross rewrite was generated using the adapter and a prompt encouraging compact, fact-based formulations.
-
Retrieval via Tree-of-Thoughts (ToT) The generated rewrites were passed to a Tree-of-Thoughts search module, which simulates iterative refinement of the query through LLM-generated hypotheses ("thoughts") and new rewrites:
- Embedding generation used all-mpnet-base-v2-
- Dense retrieval was performed using cosine similarity on a normalized vector index
- Reranking was done with the same cross-encoder used during training
- The search proceeds greedily by expanding nodes that produce higher reranking scores
- Final result aggregation is done at the document level using the reranking score
runid2¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: runid2
- Participant: ufmg
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
2fba9b59fe54f462539263a41213442e - Run description: This run was produced using a multi-stage pipeline combining prompt-tuned LLMs, dense and cross-encoder retrieval, and a Tree-of-Thoughts reasoning framework.
- Rewrite Generation via Prompt-Tuned LLaMA-3 We fine-tuned four adapters (using LoRA) on top of meta-llama/Llama-3.1-8B-Instruct, specialized in query rewriting for two domains:
-
Movies: movies-dense: optimized for dense retrieval movies-cross: optimized for cross-encoder reranking
-
General Domain (e.g., people, places, etc.): all-dense: for dense retrieval all-cross: for reranking
Each adapter was trained using Prompt Tuning to transform vague or partial user queries into precise and informative rewrites. The target rewrites were selected based on their performance (i.e., best retrieval rank) from a pool of LLM-generated candidates. At inference time, each test query was classified as either "movie" or "other" (e.g., "person", "place", etc.) using a Query Classifier, and routed to the appropriate prompt and adapter. Two rewrites were generated per query: - One using the -dense adapter - One using the -cross adapter Both rewrites were stored and passed to the retrieval module.
- Retrieval with Tree-of-Thoughts (ToT) Framework We employed a Tree-of-Thoughts architecture that combines LLM-based reasoning with dense retrieval and cross-encoder reranking: a. Initial Retrieval:
- An embedding encoder (all-mpnet-base-v2) generated query embeddings.
- Dense similarity search was performed over either a general-purpose index or a movie-specific index, depending on the query classification.
b. Reranking: - Candidates were reranked using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2), scoring each candidate against the original vague query and the rewrite from the *-cross adapter
c. Tree Expansion: - A greedy search explored the thought space: At each level, the LLM (LLaMA-3.1-8B with base weights) was prompted to produce new thoughts and rewrites, guided by a specialized multi-turn prompt Each rewrite was embedded, searched via dense retrieval, and reranked again. The highest-scoring node was expanded until max depth or convergence.
This iterative process simulates how a human might refine their query through successive hypotheses. The resulting ranked list was built by aggregating the best reranked scores across all nodes.
- Final Output Construction For each test query, we returned a ranked list of item IDs, ordered by reranker score
runid3¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: runid3
- Participant: ufmg
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
464bec2e01351c314b0d572a43f1df5c -
Run description: This run was generated using a Dense + Cross Encoder preference-based rewriting pipeline, in which a LLaMA 3.1–8B model was fine-tuned using Direct Preference Optimization (DPO) to generate reformulations aligned with the behavior of both dense and cross-encoder retrievers.
-
Rewrite Preference Modeling via DPO We began by creating a pool of candidate rewrites for each training query using a pretrained LLM. These rewrites were evaluated using two retrieval modules:
- Dense retriever: MPNet embeddings over the corpus.
- Cross-encoder reranker: cross-encoder/ms-marco-MiniLM-L-12-v2.
For each query, pairwise preferences were derived by comparing rewrites based on their downstream retrieval performance (e.g., higher-ranked retrieved results). We applied a threshold over the NDCG difference or rank delta to filter consistent preference pairs. These preference pairs were then used to train a DPO objective on the base model meta-llama/Llama-3.1-8B-Instruct, resulting in several LoRA adapters specialized for rewrite generation in different domains
- Prompt-Guided Inference for Rewrites At inference time, for each test query:
- A domain classifier (movie vs. other) routed the input to the appropriate set of adapters and prompts.
- The LLaMA model loaded the selected LoRA adapter and generated multiple rewrites: Three rewrites using different dense-focused adapters and prompts One cross-encoder-aligned rewrite using the cross adapter
-
Each rewrite was generated with a specialized prompt, instructing the model to: Be precise, short, and factual (for reranking) Preserve user-specified context and details
-
Retrieval via Tree-of-Thoughts Framework The output rewrites were passed into a Tree-of-Thoughts (ToT) search module:
- Dense embedding retrieval using MPNet
- Cross-encoder reranking.
- Greedy tree expansion with LLM-generated thoughts and rewrites
- Node evaluation based on reranker score
- Final aggregation of results based on score
runid4¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: runid4
- Participant: ufmg
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-10
- Task: trec2025-tot-main
- MD5:
1804c2742ca1f2a10e5b4fb82ecd2039 - Run description: This run was generated using a model to classify the relevance of the top-1 retrieved item from each of the three previous runs. For each query, we applied a relevance classifier built on gpt-5-nano to determine whether the top-1 result from each run is semantically aligned with the user's query. We then selected the result whose top-1 was judged most relevant.
Previous Runs Summary: Prompt Tuning – Domain Split This run used prompt-tuned LoRA adapters on top of meta-llama/Llama-3.1-8B-Instruct, with separate models for movie vs. general queries. Each query was first classified by a query type classifier (movie vs. other), then routed to specialized adapters. Rewrites were generated using domain-specific prompts and evaluated in a Tree-of-Thoughts retrieval framework with dense + cross reranking. DPO – Movie vs General Split This run used a set of LoRA adapters trained via Direct Preference Optimization (DPO) to align rewrites with dense or cross retriever preferences. The query was first classified, and different adapters were used for the movie vs. general domain. Rewrites were scored based on how well they matched previously learned preferences, and used for retrieval via the same ToT framework. DPO – General Model (No Classification) This run removed domain-specific routing and used a general-purpose DPO model trained on all query types. A single dense-aligned and a single cross-aligned adapter were used across all queries. Rewrites were generated with consistent prompts and evaluated using the same dense + reranking pipeline without any per-query logic or adaptation.
GPT-5-nano Relevance Scoring To combine these systems, we used a lightweight GPT-5 classifier to assess the semantic relevance of the top-1 document retrieved by each run for every query. The model received the query and the top-1 result text and was asked to classify the relevance. The classifier produced a three-way decision for each run’s top result: 2 = Relevant 1 = Maybe 0 = Not relevant For each query, we selected the run with the highest relevance score. If two or more runs tied, we applied a fixed priority order: General DPO → Movies DPO → Prompt Tuning.
scrb-tot-01¶
Participants | Proceedings | Input | Appendix
- Run ID: scrb-tot-01
- Participant: SRCB
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-tot-main
- MD5:
41ceb6c18ad3610283cd66272f956c2b - Run description: A pipeline composed of Dense Retriever, Reranker, and LLM Reranker
Query processing: all queries are converted to a list of cues by DeepSeek-V3
Dense Retriever based on Qwen/Qwen3-Embedding-8B: - For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties - For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs
Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.
Listwise Reranker using Deepseek-V3: Rerank top-20 results from the reranker
scrb-tot-02¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: scrb-tot-02
- Participant: SRCB
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-08-25
- Task: trec2025-tot-main
- MD5:
95ea84f74abcee61e0f8a06edffd19db - Run description: A pipeline composed of Dense Retriever, Reranker, LLM Retriever and LLM Reranker
Query processing: all queries are converted to a list of cues by DeepSeek-V3
Dense Retriever based on Qwen/Qwen3-Embedding-8B: - For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties - For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs
Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.
LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.
Listwise Reranker using Deepseek-V3: Replace the top 11~20 docs from the reranker with the docs retrieved by LLM Retriever, and then rerank top-20 results (deduplicated)
scrb-tot-03¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: scrb-tot-03
- Participant: SRCB
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-08-31
- Task: trec2025-tot-main
- MD5:
5f45c7d3be9369f22ecb29c37ab8684f - Run description: A pipeline composed of Dense Retriever, Reranker, LLM Retriever and LLM Reranker Query processing: all queries are converted to a list of cues by DeepSeek-V3 Dense Retriever based on Qwen/Qwen3-Embedding-8B:
- For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties
- For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs Reranker: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. Rerank top 2000 results from the retriever.
LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.
Listwise Reranker using Deepseek-V3: We design a three-stage ranking pipeline. First, the LLM retrieval results are inserted into the candidate list starting from rank 6, while ranks 1–5 are preserved from the baseline ranking. Second, we apply DeepSeek-v3 in a listwise ranking setting to reorder candidates from rank 2 through rank 10. Third, from the resulting ranking, we select the top four titles and conduct a fine-grained reranking using GPT-5 with the analyze-ranking strategy. The final output is obtained from this refined ranking.
scrb-tot-04¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: scrb-tot-04
- Participant: SRCB
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-08
- Task: trec2025-tot-main
- MD5:
97e4de9fc097ba1165d5a3d69d7c28c0 - Run description: A pipeline composed of Dense Retriever, Reranker, and LLM Reranker
Query processing: all queries are converted to a list of cues by DeepSeek-V3
Dense Retriever based on Qwen/Qwen3-Embedding-8B: For movie domain: finetuned on movie data (augmented data based on train, dev1, and 5000 samples from tomt-kis dataset), creating index for 500k+ movie docs filtered by wikidata properties For other domain: use the original Qwen3-Emebedding-8b to create the index for all docs
Reranker: Rerank top 2000 results from the retriever. For movie domain: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2 and 300 samples from tomt-kis dataset. For other domain: finetuned Qwen3-Reranker-8B on augmented data based on train, dev1, dev2, samples from tomt-kis dataset and 1766 synthetic data created by glm-4-plus. The queries of the synthetic data were created from the top-5 docs retrieved by our baseline system.
LLM Retriever: use DeepSeek-R1 to retrieve up to 10 Wikipedia entities and align them with the doc id in the corpus.
Listwise Reranker using Deepseek-V3: We design a three-stage ranking pipeline for Movie domain only. First, the LLM retrieval results are inserted into the candidate list starting from rank 6, while ranks 1–5 are preserved from the baseline ranking. Second, we apply DeepSeek-v3 in a listwise ranking setting to reorder candidates from rank 2 through rank 10. Third, from the resulting ranking, we select the top four titles and conduct a fine-grained reranking using GPT-5 with the analyze-ranking strategy. The final output is obtained from this refined ranking. For other domain, we replace the top 11~20 docs from the reranker with the docs retrieved by LLM Retriever, and then rerank top-20 results (deduplicated)
top_model_dense¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: top_model_dense
- Participant: DS@GT
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-11
- Task: trec2025-tot-main
- MD5:
5a60d1e381c8b1a40bce83e1e240494e - Run description: First stage dense retrieval with topic modelling
webis-bm25-gpt-oss¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: webis-bm25-gpt-oss
- Participant: webis
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-09
- Task: trec2025-tot-main
- MD5:
c8985a0b162f2f7d568dabb1f0a4181f - Run description: We used openai-gpt-oss-120b with 5 long query reduction prompts that we developed in the previous year. The LLM had the original query as input, and was framed to remove words that do not help with retrieval. We used 5 different prompts and submitted the five reduced queries against PyTerrier with BM25 in default configuration and used reciprocal rank fusion implemented in ranx to fuse the 5 runs.
webis-bm25-llama¶
Participants | Proceedings | Input | trec_eval | Appendix
- Run ID: webis-bm25-llama
- Participant: webis
- Track: Tip of the Tongue (TOT)
- Year: 2025
- Submission: 2025-09-09
- Task: trec2025-tot-main
- MD5:
2280f49721dcb7526a4d2fb540854d37 - Run description: We used llama-3-3-70b-versatile with 5 long query reduction prompts that we developed in the previous year. The LLM had the original query as input, and was framed to remove words that do not help with retrieval. We used 5 different prompts and submitted the five reduced queries against PyTerrier with BM25 in default configuration and used reciprocal rank fusion implemented in ranx to fuse the 5 runs.