Proceedings - Million LLMs Track (MLLM) 2025¶

OVERVIEW OF THE TREC 2025 MILLION LARGE LANGUAGE MODELS TRACK¶

Evangelos Kanoulas, Panagiotis Eustratiadis, Jamie Callan, Mark Sanderson, Yongkang Li, Jingfen Qiao, Gabrielle Poerwawinata, Vaishali Pal

Paper: https://trec.nist.gov/pubs/trec34/papers/Overview_mllm.pdf

Abstract

Agentic AI envisions ecosystems of intelligent agents collaboratively solving complex tasks with minimal human intervention. In such ecosystems, each agent possesses specialized expertise, making effective expert selection central to overall system performance. While most current approaches assume a small number of well-documented models, real-world expertise is far more diverse and cannot be adequately captured through static metadata or hand-written descriptions. We anticipate a future with millions of specialized language models (LLMs), each excelling in different domains or problem types. Rather than relying on predefined capability statements, we propose a retrieval-based paradigm in which an assistant agent infers expertise dynamically by examining models’ observable behavior. Upon receiving a user query, the assistant ranks candidate LLMs based on demonstrated competence, enabling efficient and adaptive expert selection. The TREC Million LLM Track operationalizes this paradigm by shifting the retrieval target from documents to expert LLMs. Participants are given a discovery set consisting of queries, answers, and log-probabilities from more than one thousand LLMs, and are challenged to infer meaningful expertise representations for each model. Given an unseen test query, systems must then rank the LLMs according to their expected performance, providing the first large-scale benchmark for expertise retrieval in agentic AI.

Bibtex

@inproceedings{coordinators-trec2025-papers-proc-8,
    title = {OVERVIEW OF THE TREC 2025 MILLION LARGE LANGUAGE MODELS TRACK},
    author = {Evangelos Kanoulas and Panagiotis Eustratiadis and Jamie Callan and Mark Sanderson and Yongkang Li and Jingfen Qiao and Gabrielle Poerwawinata and Vaishali Pal},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking¶

Gabrielle Poerwawinata, Jingfen Qiao

Participant: uvairlab
Paper: https://trec.nist.gov/pubs/trec34/papers/uvairlab.mllm.pdf
Runs: lightgbm_job266431

Abstract

The Million LLMs Track (TREC Million LLMs) focuses on methods for ranking large language models (LLMs) based on their expected ability to answer a given query. As tasks increasingly require a combination of both general-purpose and domain-specific models, it is vital to predict which LLM is best suited for a given query without needing to query each model directly. We propose a supervised learning-to-rank approach that exploits next-token log probabilities from pre-generated responses as zero-cost pseudo-relevance signals. For each (query, LLM) pair, we derive a soft relevance label from the mean next-token log probability of the model’s prior response, indicating the model’s confidence as a proxy for answer quality. A LightGBM-based LambdaRank model is trained on feature vectors combining query embeddings (Sentence-BERT), categorical LLM identifiers, and global token-level statistics filtered at the top percentile. On the TREC Million LLMs test set, our best configuration achieves NDCG@10 of 0.3695, substantially outperforming tag-based (0.013) and response-based (0.195) baselines. Our ablation analysis shows that LLM-level statistics contribute more to ranking quality than query-specific embeddings, suggesting that global model capability is a dominant signal in the current evaluation setting.

Bibtex

@inproceedings{uvairlab-trec2025-papers-proc-1,
    title = {Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking},
    author = {Gabrielle Poerwawinata and Jingfen Qiao},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

How to Choose the Right LLM? Exploring Methods for the Million LLMs Track¶

Catalina Riano, Hui Fang

Participant: UDInfo
Paper: https://trec.nist.gov/pubs/trec34/papers/UDInfo.mllm.pdf
Runs: infolab_UD_run1 | infolab_UD_run2 | infolab_UD_run3 | infolab_UD_run4 | infolab_UD_run5

Abstract

This report explains our submission to the Million LLMs Track and outlines the methods we implemented for the task. Our approaches primarily adapt known established techniques, tailoring them to the targets of the track and the characteristics of the provided data. In this document, we describe the different strategies implemented and their details, while also providing context on the formulation, datasets, and evaluation of the task. Although official results are not yet available, the methods explored here aim to further the understanding of the techniques capable of selecting the most appropriate LLMs for a given query.

Bibtex

@inproceedings{UDInfo-trec2025-papers-proc-1,
    title = {How to Choose the Right LLM? Exploring Methods for the Million LLMs Track},
    author = {Catalina Riano and Hui Fang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks¶

Hongyu Li, Yuming Zhang, Junyu Zhou, Yongwei Zhang, Shanshan Jiang, Bin Dong

Participant: SRCB
Paper: https://trec.nist.gov/pubs/trec34/papers/SRCB.mllm.tot.pdf
Runs: submission_5 | submission_4 | ensemble_v1 | ensemble_v2 | ensemble_v3

Abstract

This paper reports the performance of SRCB’s system in the Million LLMs and Tip-of-the-Tongue tracks. For the Million LLM Track, we mainly rely on powerful LLMs and various methods to construct the missing training labels. And then, we use the constructed training data and devise several approaches to achieve the ranking target and conduct experiments. For the Tip-of-the-Tongue task, we propose a retrieval framework that integrates dense and LLM-based components. Original queries are transformed into cue lists, and additional data are used to fine-tune both the dense retriever and re-ranker. Furthermore, retrieval results from LLMs are incorporated to supplement the reranker, and the final ranking is produced using an LLM reranker.

Bibtex

@inproceedings{SRCB-trec2025-papers-proc-1,
    title = {SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks},
    author = {Hongyu Li and Yuming Zhang and Junyu Zhou and Yongwei Zhang and Shanshan Jiang and Bin Dong},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track¶

Reyyan Yeniterzi, Suveyda Yeniterzi

Participant: GenAIus
Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.mllm.pdf
Runs: llm-f-100 | q-f-rm3-100-ss | r-f-rm3-1000-ss | r-ef-rm3-1000-ss | r-f-rm3-1000-rr-ss

Abstract

We address the Million LLM ranking task by formulating it as an expert retrieval problem and adapting established Information Retrieval techniques to estimate the expertise of large language models. Our approach centers on two families of methods: a profile-based strategy that aggregates all query–response pairs from each LLM into a unified representation, and document-based strategies that operate either at the response level or at the query level. Before applying these models, we introduce a two-stage data filtering pipeline to remove uninformative and low-confidence responses, yielding a cleaner signal for expertise estimation. Experimental results on the development set show that response-based aggregation provides the most fine-grained and reliable ranking of LLMs, outperforming both profile-based and question-based variants. Guided by these findings, we prepared five submissions combining different retrieval, filtering, and aggregation configurations, including a re-ranking variant using naver-splade-v3. Our study demonstrates that classical expert retrieval methods, when adapted appropriately, can effectively model and rank LLM expertise.

Bibtex

@inproceedings{GenAIus-trec2025-papers-proc-3,
    title = {Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track},
    author = {Reyyan Yeniterzi and Suveyda Yeniterzi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}