Proceedings - Adhoc Video Search 2025¶

TREC 2025 Ad-hoc Video Search (AVS) Track Overview¶

George Awad

Paper: https://trec.nist.gov/pubs/trec34/papers/Overview_avs.pdf

Abstract

The Ad-hoc Video Search (AVS) task at TREC continues to serve as a long-running benchmark for measuring progress in open-vocabulary video retrieval. The 2025 cycle builds on more than a decade of work and reflects the rapidly evolving landscape of multimodal and vision-language models. This overview describes the task design, dataset characteristics, evaluation protocol, participating teams, and the general retrieval trends observed during this assessment cycle.

Bibtex

@inproceedings{coordinators-trec2025-papers-proc-3,
    title = {TREC 2025 Ad-hoc Video Search (AVS) Track Overview},
    author = {George Awad},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK¶

Fangyun Duan, Haixiang Ni, Xiusong Wang, Chao Liang

Participant: WHU-NERCMS
Paper: https://trec.nist.gov/pubs/trec34/papers/WHU-NERCMS.avs.vqa.pdf
Runs: Fuse all sub-models | HPA | Proportional fusion | BLIP BLIP2 CLIP LaCLIP SLIP diffusion

Abstract

The WHU-NERCMS team participated in the Ad-hoc Vedio Search (AVS) and Video Question Answer(VQA) tasks at TRECVID 2025. For AVS task, we continued to use multiple visual semantic embedding methods, combined with ranking aggregation techniques to integrate different models and their outputs to generate the final ranked video shot list. For VQA task, we propose to use the VLM model to generate answer that serve as baseline answer. The answer is then embedded in the same vector space with the four options, and then compute the similarity of these vectors to sort the results.

Bibtex

@inproceedings{WHU-NERCMS-trec2025-papers-proc-1,
    title = {WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK},
    author = {Fangyun Duan and Haixiang Ni and Xiusong Wang and Chao Liang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Exploiting Temporal and Semantic Diversity: A Multi-Stage Retrieval–Reranking Pipeline for AVS 2025¶

Thuyen Tran Doan, Bao Tran, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh

Participant: NII_UIT
Paper: https://trec.nist.gov/pubs/trec34/papers/NII_UIT.avs.pdf
Runs: T2V_VILA_NVILA_VideoLLaMA3_weights | Paraphrase_T2V_VILA_NVILA_VideoLLaMA3 | T2V_VILA_v2 | T2V_VILA_NVILA_VideoLLaMA3_v2 | T2V_VILA_NVILA_VideoLLaMA3_Aria

Abstract

With the explosive growth in video content and volume, efficient video retrieval systems have become increasingly essential. However, our system still underperforms on queries involving temporal or action-related information. This limitation stems from the reliance on Text-to-Image (T2I) retrieval models, such as BEiT and BLIP, whose architectures are inherently image-based. In contrast, Text-to-Video (T2V) retrieval models, such as CLIP4Clip and TS2Net, are built upon pretrained backbones like CLIP and incorporate simple yet effective temporal modeling mechanisms, which enhance the system’s ability to understand temporal aspects in textual queries. For our participation in the TRECVID 2025 Ad-hoc Video Search (AVS) task, we have integrated several T2V models into both the initial retrieval stage and the fusion step, in addition to the existing T2I models. This integration not only boosts the overall average precision (AP) score but also improves system diversity and recall. To further leverage the increased recall, we employ a reranking step using several Large Vision-Language Models (LVLMs). These models, equipped with advanced reasoning capabilities, can better interpret complex or ambiguous query elements, such as exclusion terms, that are often challenging for smaller T2I/T2V models to handle effectively. Evaluated on the AVS 24 and 25 main tasks, our system achieves xinfAP scores of 0.4334 and 0.4361, respectively, demonstrating the effectiveness of combining diverse T2V models with multi-VLLM reranking.

Bibtex

@inproceedings{NII_UIT-trec2025-papers-proc-1,
    title = {Exploiting Temporal and Semantic Diversity: A Multi-Stage Retrieval–Reranking Pipeline for AVS 2025},
    author = {Thuyen Tran Doan and Bao Tran and Tien Do and Tien-Dung Mai and Thanh Duc Ngo and Duy-Dinh Le and Shin'ichi Satoh},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

VLM-based Binary Judgment Re-ranking for TREC 2025 Ad-hoc Video Search¶

Kazuya Ueki

Participant: meisei
Paper: https://trec.nist.gov/pubs/trec34/papers/meisei.avs.pdf
Runs: tv25_Meisei_A1 | tv25_Meisei_A2 | tv25_Meisei_A3 | tv25_Meisei_A4

Abstract

We participated in the Ad-hoc Video Search (AVS) task at TREC 2025. Building upon our previous system, we aimed to further enhance search performance through a re-ranking approach. Our method employs multiple state-of-the-art Vision-Language Models (VLMs) to verify whether retrieved video shots truly match a given query, enabling more accurate semantic filtering. Among the 29 systems submitted, our four runs achieved the top four ranks, demonstrating the effectiveness of the proposed VLM-based binary judgment strategy. These results confirm the strong potential of recent VLMs to improve large-scale video retrieval.

Bibtex

@inproceedings{meisei-trec2025-papers-proc-1,
    title = {VLM-based Binary Judgment Re-ranking for TREC 2025 Ad-hoc Video Search},
    author = {Kazuya Ueki},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Doshisha University at TREC 2025 AVS Task¶

Dai Morisaki, Miho Ohsaki, Kimiaki Shirahama

Participant: ccilab
Paper: https://trec.nist.gov/pubs/trec34/papers/ccilab.avs.pdf
Runs: ccilab1

Abstract

This paper presents the result obtained by Co-creation Informatics Laboratory (ccilab), Doshisha University on Ad-hoc Video Search (AVS) task. Our initial plan was to test the performance of a latest vision-language model, especially SigLip2 [1]. However, in our preliminary experiments, features extracted by the pre-trained SigLip2 available on [2] did not work at all. Thus, our submitted run F_M_C_D_ccilab.25.1 was obtained using a basic vision-language model, namely OpenAI CLIP [3], especially the pre-trained one released on [4]. The MAPs of our submitted run are 0.082 and 0.055 when using the 2024 ground truth and the combination of the 2024 and 2025 ground truths, respectively.

Bibtex

@inproceedings{ccilab-trec2025-papers-proc-1,
    title = {Doshisha University at TREC 2025 AVS Task},
    author = {Dai Morisaki and Miho Ohsaki and Kimiaki Shirahama},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Laboratory for Analytic Sciences in TREC 2025 Ad-hoc Video Search¶

Edward Sheriff, John Nolan, Yue Wang, Xi Niu

Participant: ncsu-las
Paper: https://trec.nist.gov/pubs/trec34/papers/ncsu-las.avs.pdf
Runs: gpt | fg-clip | phi-only | clap | phi-subgroup | decomp

Abstract

This paper describes the Laboratory for Analytic Sciences (LAS) participation in the 2025 TREC Ad-hoc Video Search (AVS) task on the V3C2 collection. Motivated by deployment settings with constrained bandwidth and compute, our systems use a scalable keyframe-based text-to-video retrieval pipeline with dense 1 fps indexing and cosine-similarity search. We profile contrastive vision–language embedding models to select an efficient visual backbone for large-scale indexing over 4.5M keyframes. At query time, we evaluate training-free enhancements aimed at improving recall and ranking, including LLM-based semantic expansion, modality-aware query decomposition, Vision–Language Model (VLM)–based relevance scoring and reranking, and selectively triggered CLAP audio fusion for topics implying non-speech sounds. Our results emphasize the role of Vision–Language Models (VLMs), including large multimodal generative models such as gpt-4.1-mini and phi-3.5-vision, as relevance judges. VLM scores align closely with human judgments on prior topics and provide a useful reranking signal, though some mismatches persist. Disagreements are largely attributable to lexical ambiguity, subjective topic phrasing, and temporal uncertainty from single-keyframe evidence. Across six official runs spanning four workflows, VLM reranking yields the largest gains, with VLM quality emerging as the most influential variable: when operating over identical candidate pools, gpt-4.1-mini improves performance on 19 of 20 topics relative to phi-3.5-vision. Semantic expansion followed by gpt-4.1-mini reranking achieves our best score of 0.399 mean infAP.

Bibtex

@inproceedings{ncsu-las-trec2025-papers-proc-2,
    title = {Laboratory for Analytic Sciences in TREC 2025 Ad-hoc Video Search},
    author = {Edward Sheriff and John Nolan and Yue Wang and Xi Niu},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

AFRL at TREC 2025: Zero-Shot Human-Free Video Search with Caption Expansion¶

Andrew Young, Emily Conway, Jeremy Gwinnup

Participant: AFRL
Paper: https://trec.nist.gov/pubs/trec34/papers/AFRL.avs.pdf
Runs: InternVL3 Baseline

Abstract

We describe the Air Force Research Laboratory’s submission to the TREC 2025 Ad-hoc Video Search (AVS) task. Our approach addresses the challenge of indexing and searching video content without human-generated metadata by employing modern multimodal large language models for caption generation. We upgrade from traditional short-caption baselines generating 7-word descriptions to state-of-the-art models producing up to 100-word descriptions with inter-frame context awareness. Our system integrates three components: caption expansion for richer semantic coverage, context-aware learning through weighted concept banks and unlikelihood training for better embeddings, and query decomposition with question-answering models for precise re-ranking. While our official submission encountered critical integration failures resulting in poor performance, we present our methodology, analyze the failures, and demonstrate the viability of our core approach through preliminary validation results.

Bibtex

@inproceedings{AFRL-trec2025-papers-proc-1,
    title = {AFRL at TREC 2025: Zero-Shot Human-Free Video Search with Caption Expansion},
    author = {Andrew Young and Emily Conway and Jeremy Gwinnup},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025¶

Andreas Goulas, Damianos Galanopoulos, Ioannis Patras, Vasileios Mezaris

Participant: CERTH-ITI
Paper: https://trec.nist.gov/pubs/trec34/papers/CERTH-ITI.avs.vqa.pdf
Runs: run_1 | run_2 | run_3 | run_4

Abstract

This paper presents the overview of the runs related to the Ad-hoc Video Search (AVS) and Video Question Answering (VQA) tracks of TRECVID 2025 on behalf of the CERTH-ITI team. For the AVS track, we introduce a two-stage framework built on foundation models. In the first stage, multi-ple vision–language models (VLMs) encode both the input query, augmented through LLM-generated rephrasings, and the candidate video shots, producing weighted similarity scores for initial retrieval. In the second stage, we utilize a Multimodal-LLM(MLLM)-based reranking module that evaluates the semantic alignment between each shot among the top-N highest-ranked ones and the original query, generating updated relevance scores for reordering these shots. This MLLM-driven reranking significantly improves contextual matching and produces more accurate final rankings without requiring any model training. Regarding the VQA track, we fine-tune an audio-visual MLLM model on the provided TRECVID training dataset and we implement an inference-time scaling technique to enhance the mul-timodal understanding capabilities of the MLLM. For the open-ended Answer Generation (AG) task, we aggregate multiple model responses per question via a majority vote. The responses are generated with greedy sampling from different random frame subsets of the video and they are ranked based on the number of votes. For the Multiple-Choice (MC) task, instead of voting, we use mean pooling on the logits assigned by the fine-tuned model to each candidate response. Through the combination of fine-tuning and frame subset ensembling we achieve the highest score across 3 metrics in the VQA AG task and the second highest in the VQA MC task.

Bibtex

@inproceedings{CERTH-ITI-trec2025-papers-proc-1,
    title = {MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025},
    author = {Andreas Goulas and Damianos Galanopoulos and Ioannis Patras and Vasileios Mezaris},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}