Proceedings - Million LLMs Track (MLLM) 2025¶

Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking¶

Gabrielle Poerwawinata, Jingfen Qiao

Participant: uvairlab
Paper: https://trec.nist.gov/pubs/trec34/papers/uvairlab.mllm.pdf
Runs: lightgbm_job266431

Abstract

The Million LLMs Track (TREC Million LLMs) focuses on methods for ranking large language models (LLMs) based on their expected ability to answer a given query. As tasks increasingly require a combination of both general-purpose and domain-specific models, it is vital to predict which LLM is best suited for a given query without needing to query each model directly. We propose a supervised learning-to-rank approach that exploits next-token log probabilities from pre-generated responses as zero-cost pseudo-relevance signals. For each (query, LLM) pair, we derive a soft relevance label from the mean next-token log probability of the model’s prior response, indicating the model’s confidence as a proxy for answer quality. A LightGBM-based LambdaRank model is trained on feature vectors combining query embeddings (Sentence-BERT), categorical LLM identifiers, and global token-level statistics filtered at the top percentile. On the TREC Million LLMs test set, our best configuration achieves NDCG@10 of 0.3695, substantially outperforming tag-based (0.013) and response-based (0.195) baselines. Our ablation analysis shows that LLM-level statistics contribute more to ranking quality than query-specific embeddings, suggesting that global model capability is a dominant signal in the current evaluation setting.

Bibtex

@inproceedings{uvairlab-trec2025-papers-proc-1,
    title = {Discovering Expert LLMs via Next-Token Log Probabilities and Supervised Ranking},
    author = {Gabrielle Poerwawinata and Jingfen Qiao},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

How to Choose the Right LLM? Exploring Methods for the Million LLMs Track¶

Catalina Riano, Hui Fang

Participant: UDInfo
Paper: https://trec.nist.gov/pubs/trec34/papers/UDInfo.mllm.pdf
Runs: infolab_UD_run1 | infolab_UD_run2 | infolab_UD_run3 | infolab_UD_run4 | infolab_UD_run5

Abstract

The Million LLMs Track, part of TREC 2025, addresses the challenge of determining which LLMs are best suited to answer a specific user query. The main goal of the track is to evaluate the ability of a system to predict LLM expertise: given a query and a set of LLM IDs, the system must rank the models by how likely they are to provide a high quality answer. Unlike traditional IR settings, this ranking must be produced without querying the models at test time. Instead, participants must rely on precomputed discovery data, including LLM responses, metadata, and development labels, to infer each model’s strengths and capabilities. This report explains our submission to the Million LLMs Track and outlines the methods we implemented for the task.

Bibtex

@inproceedings{UDInfo-trec2025-papers-proc-1,
    title = {How to Choose the Right LLM? Exploring Methods for the Million LLMs Track},
    author = {Catalina Riano and Hui Fang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks¶

Hongyu Li, Yuming Zhang, Junyu Zhou, Yongwei Zhang, Shanshan Jiang, Bin Dong

Participant: SRCB
Paper: https://trec.nist.gov/pubs/trec34/papers/SRCB.mllm.tot.pdf
Runs: submission_5 | submission_4 | ensemble_v1 | ensemble_v2 | ensemble_v3

Abstract

This paper reports the performance of SRCB’s system in the Million LLMs and Tip-of-the-Tongue tracks. For the Million LLM Track, we mainly rely on powerful LLMs and various methods to construct the missing training labels. And then, we use the constructed training data and devise several approaches to achieve the ranking target and conduct experiments. For the Tip-of-the-Tongue task, we propose a retrieval framework that integrates dense and LLM-based components. Original queries are transformed into cue lists, and additional data are used to fine-tune both the dense retriever and re-ranker. Furthermore, retrieval results from LLMs are incorporated to supplement the reranker, and the final ranking is produced using an LLM reranker.

Bibtex

@inproceedings{SRCB-trec2025-papers-proc-1,
    title = {SRCB at TREC 2025: Million LLMs and Tip-of-the-Tongue Tracks},
    author = {Hongyu Li and Yuming Zhang and Junyu Zhou and Yongwei Zhang and Shanshan Jiang and Bin Dong},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track¶

Reyyan Yeniterzi, Suveyda Yeniterzi

Participant: GenAIus
Paper: https://trec.nist.gov/pubs/trec34/papers/GenAIus.mllm.pdf
Runs: llm-f-100 | q-f-rm3-100-ss | r-f-rm3-1000-ss | r-ef-rm3-1000-ss | r-f-rm3-1000-rr-ss

Abstract

We address the Million LLM ranking task by formulating it as an expert retrieval problem and adapting established Information Retrieval techniques to estimate the expertise of large language models. Our approach centers on two families of methods: a profile-based strategy that aggregates all query–response pairs from each LLM into a unified representation, and document-based strategies that operate either at the response level or at the query level. Before applying these models, we introduce a two-stage data filtering pipeline to remove uninformative and low-confidence responses, yielding a cleaner signal for expertise estimation. Experimental results on the development set show that response-based aggregation provides the most fine-grained and reliable ranking of LLMs, outperforming both profile-based and question-based variants. Guided by these findings, we prepared five submissions combining different retrieval, filtering, and aggregation configurations, including a re-ranking variant using naver-splade-v3. Our study demonstrates that classical expert retrieval methods, when adapted appropriately, can effectively model and rank LLM expertise.

Bibtex

@inproceedings{GenAIus-trec2025-papers-proc-3,
    title = {Finding the Right LLM: Expert Retrieval for Model Ranking in the TREC 2025 Million LLM Track},
    author = {Reyyan Yeniterzi and Suveyda Yeniterzi},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}