Skip to content

Proceedings - Video Question Answering (VQA) 2025

Video Question Answering (VQA) 2025 Track

George Awad, Sanjay Purushotham, Afzal Godil

Abstract

Recent advancements in large multimodal models have significantly improved AI’s ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, which requires AI systems to integrate visual, auditory, and temporal information to answer questions in a meaningful way. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants developed and tested models that answer a diverse set of video segments-based questions covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track serves as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current multimodal AI architectures. By fostering innovation in multimodal learning, this track contributes to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information. The track completed its first pilot year, which included two subtasks: Answer Generation Task and Multiple Choice Task. Based on lessons learned and participant feedback, we plan to run the track again in 2026.

Bibtex
@inproceedings{coordinators-trec2025-papers-proc-1,
    title = {Video Question Answering (VQA) 2025 Track},
    author = {George Awad and Sanjay Purushotham and Afzal Godil},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Nagaoka University of Technology at TREC 2025 Video Question Answering

Isabel Gonzalez, Shungo Kubosaka, Takashi Yukawa

Abstract

This paper details our approach to two tasks in Video Question Answering (VQA) for TREC 2025 challenge, using VideoLLaMA3-2B as the base model. For both the Answer Generation (AG) task and Multiple Choice (MC) task, the primary training data was the dataset provided by TREC, which was used to finetune the model using LoRA. For the AG task, the approach we used was a methodology to generate diverse answers using sampling and ranked them using the model's average log-probability, which proved effective with a NDCG_BERT score of 0.993. Our generated answers were found to be semantically similar (BERT score: 0.893) but lexically different (METEOR: 0.226). We also identified a potential bias in confidence-based ranking that favors shorter answers. For the MC task, our approach was a two-stage process where the model first generates a "ground truth" answer via greedy decoding, which is then used by a Sentence-Transformer to rank the given options based on cosine similarity. This approach achieved a Top1 Accuracy of 0.499 and a Mean Reciprocal Rank (MRR) of 0.686. The system's effectiveness depended on both the accuracy of the "ground truth" generation and the Sentence-Transformer's similarity measurement. We learned this "generate-then-compare" strategy is viable, but its main limitation is error propagation from the first step. This paper outlines these methods, our experimental findings, and their limitations.

Bibtex
@inproceedings{kslab-trec2025-papers-proc-1,
    title = {Nagaoka University of Technology at TREC 2025 Video Question Answering},
    author = {Isabel Gonzalez and Shungo Kubosaka and Takashi Yukawa},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK

Fangyun Duan, Haixiang Ni, Xiusong Wang, Chao Liang

Abstract

The WHU-NERCMS team participated in the Ad-hoc Vedio Search (AVS) and Video Question Answer(VQA) tasks at TRECVID 2025. For AVS task, we continued to use multiple visual semantic embedding methods, combined with ranking aggregation techniques to integrate different models and their outputs to generate the final ranked video shot list. For VQA task, we propose to use the VLM model to generate answer that serve as baseline answer. The answer is then embedded in the same vector space with the four options, and then compute the similarity of these vectors to sort the results.

Bibtex
@inproceedings{WHU-NERCMS-trec2025-papers-proc-1,
    title = {WHU-NERCMS AT TRECVID2025: AD-HOC VEDIO SEARCH(AVS) AND VIDEO QUESTION ANSWER(VQA) TASK},
    author = {Fangyun Duan and Haixiang Ni and Xiusong Wang and Chao Liang},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

Spatio-Temporal Input Densification for Efficient and Robust Open-Domain Video Question Answering

Bao Tran, Thuyen Tran Doan, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh

Abstract

Video Question Answering (VQA) requires systems to jointly reason over visual, auditory, and linguistic cues, and remains challenging due to complex temporal dependencies and the diverse, open-ended nature of real-world queries. Recent approaches often depend on supervised finetuning of large vision-language models, which yields strong in-distribution performance but comes with substantial data and computational demands. Furthermore, finetuned systems can struggle with temporal reasoning, small-object recognition, and effective use of audio information, limiting their robustness in open-domain benchmarks such as TRECVID. In this work, we introduce a VQA framework that enhances existing multimodal models without task-specific finetuning. Its core component is a spatio-temporal input densification strategy that reorganizes video evidence using dense frame sampling and spatial tiling, enabling finer visual understanding and more reliable temporal inference. The framework also incorporates lightweight modules for textualized audio integration, question type–aware prompting, and output normalization, contributing to improved robustness and answer consistency. Despite requiring no task-specific finetuning, the proposed system achieves strong results on the TRECVID 2025 VQA test set. Our Multiple Choice submission attains a Top-1 Accuracy of 0.774 and an MRR of 0.859, ranking as the top-performing run. For Answer Generation, the system reaches METEOR 0.173, BERTScore 0.887, STS 0.270, and NDCGBERTScore 0.996, placing it among the highest-ranked submissions. These results demonstrate the effectiveness of inference-time input densification as a scalable alternative to supervised finetuning.

Bibtex
@inproceedings{NII_UIT-trec2025-papers-proc-2,
    title = {Spatio-Temporal Input Densification for Efficient and Robust Open-Domain Video Question Answering},
    author = {Bao Tran and Thuyen Tran Doan and Tien Do and Tien-Dung Mai and Thanh Duc Ngo and Duy-Dinh Le and Shin'ichi Satoh},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

HLTCOE Evaluation Team at TREC 2025: VQA Track

Dengjia Zhang, Charles Weng, Katherine Guerrerio, Yi Lu, Kenton Murray, Alexander Martin, Reno Kriz, Benjamin Van Durme

Abstract

The HLTCOE Evaluation team participated in TREC VQA’s Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video–question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.

Bibtex
@inproceedings{HLTCOE-trec2025-papers-proc-3,
    title = {HLTCOE Evaluation Team at TREC 2025: VQA Track},
    author = {Dengjia Zhang and Charles Weng and Katherine Guerrerio and Yi Lu and Kenton Murray and Alexander Martin and Reno Kriz and Benjamin Van Durme},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}

MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025

Andreas Goulas, Damianos Galanopoulos, Ioannis Patras, Vasileios Mezaris

Abstract

This paper presents the overview of the runs related to the Ad-hoc Video Search (AVS) and Video Question Answering (VQA) tracks of TRECVID 2025 on behalf of the CERTH-ITI team. For the AVS track, we introduce a two-stage framework built on foundation models. In the first stage, multi-ple vision–language models (VLMs) encode both the input query, augmented through LLM-generated rephrasings, and the candidate video shots, producing weighted similarity scores for initial retrieval. In the second stage, we utilize a Multimodal-LLM(MLLM)-based reranking module that evaluates the semantic alignment between each shot among the top-N highest-ranked ones and the original query, generating updated relevance scores for reordering these shots. This MLLM-driven reranking significantly improves contextual matching and produces more accurate final rankings without requiring any model training. Regarding the VQA track, we fine-tune an audio-visual MLLM model on the provided TRECVID training dataset and we implement an inference-time scaling technique to enhance the mul-timodal understanding capabilities of the MLLM. For the open-ended Answer Generation (AG) task, we aggregate multiple model responses per question via a majority vote. The responses are generated with greedy sampling from different random frame subsets of the video and they are ranked based on the number of votes. For the Multiple-Choice (MC) task, instead of voting, we use mean pooling on the logits assigned by the fine-tuned model to each candidate response. Through the combination of fine-tuning and frame subset ensembling we achieve the highest score across 3 metrics in the VQA AG task and the second highest in the VQA MC task.

Bibtex
@inproceedings{CERTH-ITI-trec2025-papers-proc-1,
    title = {MLLM Frame Subset Ensembling for Audio-Visual Video QA and MLLM-based Reranking for Ad-hoc Video Search in TRECVID 2025},
    author = {Andreas Goulas and Damianos Galanopoulos and Ioannis Patras and Vasileios Mezaris},
    booktitle = {Proceedings of the 34th Text {REtrieval} Conference (TREC 2025)},
    year = {2025},
    address = {Gaithersburg, Maryland},
    series = {NIST SP xxxx}
}