Proceedings - Tip of the Tongue (TOT) 2025¶

OVERVIEW OF THE TREC 2025 TIP-OF-THE-TONGUE TRACK¶

Jaime Arguello, Fernando Diaz, Maik Fröebe, To Eun Kim, Bhaskar Mitra

Paper: https://trec.nist.gov/pubs/trec34/papers/Overview_tot.pdf

Abstract

Tip-of-the-tongue (ToT) known-item retrieval involves re-finding an item for which the searcher does not reliably recall an identifier. ToT information requests (or queries) are verbose and tend to include several complex phenomena, making them especially difficult for existing information retrieval systems. The TREC 2025 ToT track focused on a single ad-hoc retrieval task. This year, we extended the track to general domain and incorporated different sets of test queries from diverse sources, namely from the MS-ToT dataset, manual topic development, and LLM-based synthetic query generation. This year, 9 groups (including the track coordinators) submitted 32 runs.

Bibtex

@inproceedings{coordinators-trec2025-papers-proc-2,
    title = {OVERVIEW OF THE TREC 2025 TIP-OF-THE-TONGUE TRACK},
    author = {Jaime Arguello and Fernando Diaz and Maik Fröebe and To Eun Kim and Bhaskar Mitra},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Webis at TREC 2025: Tip-of-the-Tongue Track and AutoJudge¶

Maik Fröbe, Jan Heinrich Merker, Eric Oliver Schmidt, Martin Potthast, Matthias Hagen

Participant: webis
Paper: https://trec.nist.gov/pubs/trec34/papers/webis.tot.pdf
Runs: webis-bm25-gpt-oss | webis-bm25-llama

Abstract

This paper describes the Webis Group’s participation in the 2025 edition of TREC. We participated in the Tip-of-the-Tongue track and the pilot round of the AutoJudge track. For our participation in the Tip-of-the-Tongue track, we re-executed our query relaxation strategies that we developed in our previous years submissions (removing terms that likely reduce retrieval effectiveness). For the pilot round of the AutoJudge track we apply axiomatic thinking by using preferences and features from all 29 axiomatic constraints for retrieval augmented generation that are implemented in the ir_axioms package (evaluation is in progress).

Bibtex

@inproceedings{webis-trec2025-papers-proc-1,
    title = {Webis at TREC 2025: Tip-of-the-Tongue Track and AutoJudge},
    author = {Maik Fröbe and Jan Heinrich Merker and Eric Oliver Schmidt and Martin Potthast and Matthias Hagen},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks¶

Hongyu Li, Yuming Zhang, Junyu Zhou, Yongwei Zhang, Shanshan Jiang, Bin Dong

Participant: SRCB
Paper: https://trec.nist.gov/pubs/trec34/papers/SRCB.mllm.tot.pdf
Runs: scrb-tot-01 | scrb-tot-02 | scrb-tot-03 | scrb-tot-04

Abstract

This paper reports the performance of SRCB’s system in the Million LLMs and Tip-of-the-Tongue tracks. For the Million LLM Track, we mainly rely on powerful LLMs and various methods to construct the missing training labels. And then, we use the constructed training data and devise several approaches to achieve the ranking target and conduct experiments. For the Tip-of-the-Tongue task, we propose a retrieval framework that integrates dense and LLM-based components. Original queries are transformed into cue lists, and additional data are used to fine-tune both the dense retriever and re-ranker. Furthermore, retrieval results from LLMs are incorporated to supplement the reranker, and the final ranking is produced using an LLM reranker.

Bibtex

@inproceedings{SRCB-trec2025-papers-proc-1,
    title = {SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks},
    author = {Hongyu Li and Yuming Zhang and Junyu Zhou and Yongwei Zhang and Shanshan Jiang and Bin Dong},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Hedge Removal, Query Drift, and the Simulation Gap in Tip-of-the-Tongue Retrieval¶

Bruno N. Sotic, Jaap Kamps

Participant: UAmsterdam
Paper: https://trec.nist.gov/pubs/trec34/papers/UAmsterdam.tot.pdf
Runs: bm25_hedge_aware | bm25_hedges_neg | bm25_negations | rm3_hedges | rm3_negations | rm3_hedge_neg

Abstract

Retrieving known items from vague, verbose queries (the “Tip-of-the-Tongue” or ToT problem) poses a unique challenge for information retrieval. In the TREC 2025 ToT Track, we investigated linguistic preprocessing strategies (such as hedge removal and negation penalties) and hybrid retrieval methods across simulated and human-generated queries. Our experiments reveal a substantial divergence between LLM-simulated development data and the official test set. Hedge removal yielded large gains on verbose synthetic queries (+25.4% NDCG on dev3), but minimal improvement on sparse human queries (+2.7% on test, not significant). Negation penalties produced no measurable effect across all conditions. Pseudo-Relevance Feedback (RM3) consistently degraded performance, amplifying query drift rather than resolving vocabulary mismatch. Analysis of the recall pool reveals a fundamental bottleneck: only 32% of target documents appeared among the top 100 BM25 results on the test set. Within this constraint, hybrid dense reranking improved early precision by 10.6% on hard queries where lexical matching failed, but degraded performance on queries with strong term overlap. We conclude that systems optimized on LLM-simulated ToT data risk overfitting to synthetic linguistic patterns that do not reflect the sparse, fragmented nature of real human memory retrieval.

Bibtex

@inproceedings{UAmsterdam-trec2025-papers-proc-2,
    title = {Hedge Removal, Query Drift, and the Simulation Gap in Tip-of-the-Tongue Retrieval},
    author = {Bruno N. Sotic and Jaap Kamps},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking¶

Wenxin Zhou, Ritesh Mehta, Anthony Miyaguchi

Participant: DS@GT
Paper: https://trec.nist.gov/pubs/trec34/papers/[email protected]
Runs: bge-m3 | gmn-rerank-500 | lambdamart-rerank | gm27q-LMART-1000 | gm27q-comb-500 | gemini-retrieval | top_model_dense

Abstract

We develop a two-stage retrieval system that combines multiple complementary retrieval methods with a learned reranker and LLM-based reranking, to address the TREC Tip-of-the-Tongue (ToT) task. In the first stage, we employ hybrid retrieval that merges LLM-based retrieval, sparse (BM25), and dense (BGE-M3) retrieval methods. We also introduce topic-aware multi-index dense retrieval that partitions the Wikipedia corpus into 24 topical domains. In the second stage, we evaluate both a trained LambdaMART reranker and LLM-based reranking. To support model training, we generate 5000 synthetic ToT queries using LLMs. Our best system achieves recall of 0.66 and NDCG@1000 of 0.41 on the test set by combining hybrid retrieval with Gemini-2.5-flash reranking, demonstrating the effectiveness of fusion retrieval.

Bibtex

@inproceedings{DS_GT-trec2025-papers-proc-1,
    title = {DS@GT at TREC TOT 2025: Bridging Vague Recollection with Fusion Retrieval and Learned Reranking},
    author = {Wenxin Zhou and Ritesh Mehta and Anthony Miyaguchi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

UFMG at TREC 2025: Retriever-Aligned Query Rewriting for Tip-of-the-Tongue Retrieval¶

Arthur Pontes Nader, Rodrygo L. T. Santos

Participant: ufmg
Paper: https://trec.nist.gov/pubs/trec34/papers/ufmg.tot.pdf
Runs: runid1 | runid2 | runid3 | runid4

Abstract

Tip-of-the-Tongue queries are difficult to rewrite due to vague user descriptions and limited supervised training data. We address this by generating rewrite preference pairs automatically from dense and cross-encoder retrieval scores, enabling a reliable dataset for fine-tuning directly on ranker preferences. We compare prompt tuning, domain-specific DPO, and general DPO models within a Tree-of-Thoughts rewriting and retrieval pipeline. Results on the TREC Tip-of-the-Tongue track show steady gains from prompt tuning to DPO, with a GPT-5-nano ensemble of all runs achieving our best performance among our submissions (NDCG@1000 = 0.277, MRR@1000 = 0.199).

Bibtex

@inproceedings{ufmg-trec2025-papers-proc-1,
    title = {UFMG at TREC 2025: Retriever-Aligned Query Rewriting for Tip-of-the-Tongue Retrieval},
    author = {Arthur Pontes Nader and Rodrygo L. T. Santos},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval¶

Debayan Mukhopadhyay, Utshab Kumar Ghosh, Shubham Chatterjee

Participant: mst
Paper: https://trec.nist.gov/pubs/trec34/papers/mst.tot.pdf
Runs: llama_norm_fusion_z | llama_norm_fusion_v2

Abstract

Retrieving known items from vague, partial, or inaccurate descriptions, a phenomenon known as Tip-of-the-Tongue (ToT) retrieval, remains a significant challenge for modern information retrieval systems. Our approach integrates a single call to an 8B-parameter Large Language Model (LLM) using a model agnostic prompt for query reformulation and controlled expansion to bridge the gap between ill-formed ToT queries and well-specified information needs in scenarios where Pseudo-Relevance Feedback (PRF) based expansion methods are rendered ineffective, due to poor first stage Recall and Ranking using the raw queries, resulting in expansion using the incorrect documents. Importantly, the LLM employed in our framework was deliberately kept generic: it was not fine-tuned for Tip-of-the-Tongue (ToT) queries, nor adapted to any specific content domains (e.g., movies, books, landmarks). This design choice underscores that the observed gains stem from the formulation of the proposed prompting and expansion strategy itself, rather than from task or domain-specific specialization of the underlying model. Rewritten queries are then processed by a multi-stage retrieval pipeline consisting of an initial sparse retrieval stage (BM25), followed by an ensemble of bi-encoder and late-interaction re-rankers (Contriever, E5-large-v2, and ColBERTv2), cross-encoder re-ranking using monoT5, and a final list-wise re-ranking stage powered by a 72B-parameter LLM (Qwen 2.5 Instruct, 4-bit quantized). Experiments on the datasets provided in the 2025 TREC-ToT track show that supplying original user queries directly to this otherwise competitive multi-stage ranking pipeline still yields poor retrieval effectiveness, highlighting the central role of query formulation in ToT scenarios. In contrast our light-weight LLM based pre-retrieval query transformation improves the initial Recall net by 20.61% and then subsequent re-ranking using the re-written queries improves nDCG@10 by 33.88%, MRR by 29.92% and Map@10 by 29.98% over the same pipeline operating on raw queries, indicating that it serves as a highly cost-effective intervention, unlocking substantial performance gains and enabling downstream retrievers and rankers to realize their full potential in Tip-of-the-Tongue retrieval.

Bibtex

@inproceedings{mst-trec2025-papers-proc-1,
    title = {Single-Turn LLM Reformulation Powered Multi-Stage Hybrid Re-Ranking for Tip-of-the-Tongue Known-Item Retrieval},
    author = {Debayan Mukhopadhyay and Utshab Kumar Ghosh and Shubham Chatterjee},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Bridging Lexical and Neural Ranking for Topic-Oriented Retrieval¶

Georgios Arampatzis, Konstantina Safouri, Avi Arampatzis

Participant: DUTH
Paper: https://trec.nist.gov/pubs/trec34/papers/DUTH.tot.pdf
Runs: lex-stronger-test | bm25-porterblk-test | lex-stronger-testv2

Abstract

This paper presents the DUTH team’s participation in the TREC Tip-of-the-Tongue (TREC-TOT) 2025 shared task. Although we explored a hybrid retrieval pipeline combining BM25 with Sentence-BERT dense embeddings and a MiniLM-based cross-encoder during development, our official submitted runs rely exclusively on unsupervised lexical retrieval models implemented in the Terrier and PyTerrier frameworks. The submitted systems integrate multiple probabilistic models—including BM25, Divergence From Randomness variants, and query-likelihood language models—with RM3 pseudo-relevance feedback and Reciprocal Rank Fusion. This multi-stage lexical architecture aims to maximize early precision and robust recall for underspecified Tip-of-the-Tongue queries. Experiments on the official TREC-TOT 2025 development split show that the fused lexical pipelines achieve strong performance across early ranking and recall-based metrics, highlighting the competitiveness of carefully tuned lexical ensembles for memory-based retrieval. Hybrid dense re-ranking demonstrated improvements during development but was not part of the official submission.

Bibtex

@inproceedings{DUTH-trec2025-papers-proc-4,
    title = {Bridging Lexical and Neural Ranking for Topic-Oriented Retrieval},
    author = {Georgios Arampatzis and Konstantina Safouri and Avi Arampatzis},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}