Proceedings - Adhoc Video Search 2024¶

TRECVID 2024 - Evaluating video search, captioning, and activity recognition¶

George Awad, Jonathan Fiscus, Afzal Godil, Lukas Diduch, Yvette Graham, Georges Quénot

Paper: https://trec.nist.gov/pubs/trec33/papers/Overview_avs.pdf

Abstract

The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, tasks-based evaluation supported by metrology.

Bibtex

@inproceedings{coordinators-trec2024-papers-proc-6,
    title = {TRECVID 2024 - Evaluating video search, captioning, and activity recognition},
    author = {George Awad and Jonathan Fiscus and Afzal Godil and Lukas Diduch and Yvette Graham and Georges Quénot},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

Softbank-Meisei at TREC 2024¶

Kazuya Ueki, Yuma Suzuki, Hiroki Takushima, Haruki Sato, Takumi Takada, Aiswariya Manoj Kumar, Hayato Tanoue, Hiroki Nishihara, Yuki Shibata, Takayuki Hori

Participant: softbank-meisei
Paper: https://trec.nist.gov/pubs/trec33/papers/softbank-meisei.avs.vtt.pdf
Runs: SoftbankMeisei - Main Run 1 | SoftbankMeisei - Main Run 2 | SoftbankMeisei - Main Run 3 | SoftbankMeisei - Main Run 4 | SoftbankMeisei - Progress Run 1 | SoftbankMeisei - Progress Run 2 | SoftbankMeisei - Progress Run 3 | SoftbankMeisei - Progress Run 4

Abstract

The Softbank-Meisei team participated in the ad-hoc video search (AVS) and video-to-text (VTT) tasks at TREC 2024. In this year’s AVS task, we submitted four fully automatic systems for both the main and progress tasks. Our systems utilized pre-trained vision and language models, including CLIP, BLIP, and BLIP-2, along with several other advanced models. We also expanded the original query texts using text generation and image generation techniques to enhance data diversity. The integration ratios of these models were optimized based on results from previous benchmark test datasets. In this year’s VTT, as last year, we submitted four main task methods using multiple model captioning, reranking, and generative AI for summarization. For the subtasks, we submitted three methods using the output of each model. Last year’s test data for the main task showed improvements of about 0.04 points in CIDEr-D and about 0.03 points in SPICE, based on the indices we had on hand.

Bibtex

@inproceedings{softbank-meisei-trec2024-papers-proc-2,
    title = {Softbank-Meisei at TREC 2024},
    author = {Kazuya Ueki and Yuma Suzuki and Hiroki Takushima and Haruki Sato and Takumi Takada and Aiswariya Manoj Kumar and Hayato Tanoue and Hiroki Nishihara and Yuki Shibata and Takayuki Hori},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

Softbank-Meisei at TREC 2024 Ad-hoc Video Search and Video to Text Tasks¶

Kazuya Ueki, Yuma Suzuki, Hiroki Takushima, Haruki Sato, Takumi Takada, Aiswariya Manoj Kumar, Hayato Tanoue, Hiroki Nishihara, Yuki Shibata, Takayuki Hori

Participant: softbank-meisei
Paper: https://trec.nist.gov/pubs/trec33/papers/softbank-meisei.avs.vtt.pdf
Runs: SoftbankMeisei - Main Run 1 | SoftbankMeisei - Main Run 2 | SoftbankMeisei - Main Run 3 | SoftbankMeisei - Main Run 4 | SoftbankMeisei - Progress Run 1 | SoftbankMeisei - Progress Run 2 | SoftbankMeisei - Progress Run 3 | SoftbankMeisei - Progress Run 4

Abstract

The Softbank-Meisei team participated in the ad-hoc video search (AVS) and video-to-text (VTT) tasks at TREC 2024. In this year’s AVS task, we submitted four fully automatic systems for both the main and progress tasks. Our systems utilized pre-trained vision and language models, including CLIP, BLIP, and BLIP-2, along with several other advanced models. We also expanded the original query texts using text generation and image generation techniques to enhance data diversity. The integration ratios of these models were optimized based on results from previous benchmark test datasets. In this year’s VTT, as last year, we submitted four main task methods using multiple model captioning, reranking, and generative AI for summarization. For the subtasks, we submitted three methods using the output of each model. Last year’s test data for the main task showed improvements of about 0.04 points in CIDEr-D and about 0.03 points in SPICE, based on the indices we had on hand.

Bibtex

@inproceedings{softbank-meisei-trec2024-papers-proc-3,
    title = {Softbank-Meisei at TREC 2024 Ad-hoc Video Search and Video to Text Tasks},
    author = {Kazuya Ueki and Yuma Suzuki and Hiroki Takushima and Haruki Sato and Takumi Takada and Aiswariya Manoj Kumar and Hayato Tanoue and Hiroki Nishihara and Yuki Shibata and Takayuki Hori},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

RUC_AIM3 at TRECVID 2024: Ad-hoc Video Search¶

Xueyan Wang, Yang Du, Yuqi Liu, Qin Jin

Participant: ruc_aim3
Paper: https://trec.nist.gov/pubs/trec33/papers/ruc_aim3.avs.pdf
Runs: add_captioning | baseline | add_QArerank | add_captioning_QArerank

Abstract

This report presents our solution for the Ad-hoc Video Search (AVS) task of TRECVID 2024. Based on our baseline AVS model in TRECVID 2023, we further improve the searching performance by integrating multiple visual-embedding models, performing video captioning to be used for topic-to-caption searches, and applying a re-ranking strategy for top candidate search selection. Our submissions from our improved AVS model rank the 3rd in TRECVID AVS 2024 on mean average precision (mAP) in the main task, achieving the best run of 36.8.

Bibtex

@inproceedings{ruc_aim3-trec2024-papers-proc-1,
    title = {RUC\_AIM3 at TRECVID 2024: Ad-hoc Video Search},
    author = {Xueyan Wang and Yang Du and Yuqi Liu and Qin Jin},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

ITI-CERTH participation in ActEV and AVS Tracks of TRECVID 2024¶

Konstantinos Gkountakos, Damianos Galanopoulos, Antonios Leventakis, Georgios Tsionkis, Klearchos Stavrothanasopoulos, Konstantinos Ioannidis, Stefanos Vrochidis, Vasileios Mezaris, Ioannis Kompatsiaris

Participant: CERTH-ITI
Paper: https://trec.nist.gov/pubs/trec33/papers/CERTH-ITI.avs.actev.pdf
Runs: certh.iti.avs.24.main.run.1 | certh.iti.avs.24.main.run.2 | certh.iti.avs.24.main.run.3 | certh.iti.avs.24.progress.run.1 | certh.iti.avs.24.progress.run.2 | certh.iti.avs.24.progress.run.3

Abstract

This report presents the overview of the runs related to Ad-hoc Video Search (AVS) and Activities in Extended Video (ActEV) tasks on behalf of the ITI-CERTH team. Our participation in the AVS task involves a collection of five cross-modal deep network architectures and numerous pre-trained models, which are used to calculate the similarities between video shots and queries. These calculated similarities serve as input to a trainable neural network that effectively combines them. During the retrieval stage, we also introduce a normalization step that utilizes both the current and previous AVS queries for revising the combined video shot-query similarities. For the ActEV task, we adapt our framework to support a rule-based classification to overcome the challenges of detecting and recognizing activities in a multi-label manner while experimenting with two separate activity classifiers.

Bibtex

@inproceedings{CERTH-ITI-trec2024-papers-proc-1,
    title = {ITI-CERTH participation in ActEV and AVS Tracks of TRECVID 2024},
    author = {Konstantinos Gkountakos and Damianos Galanopoulos and Antonios Leventakis and Georgios Tsionkis and Klearchos Stavrothanasopoulos and Konstantinos Ioannidis and Stefanos Vrochidis and Vasileios Mezaris and Ioannis Kompatsiaris},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

WHU-NERCMS AT TRECVID2024: AD-HOC VIDEO SEARCH TASK¶

Heng Liu, Jiangshan He, Zeyuan Zhang, Yuanyuan Xu, Chao Liang

Participant: WHU-NERCMS
Paper: https://trec.nist.gov/pubs/trec33/papers/WHU-NERCMS.avs.pdf
Runs: run4 | run3 | run2 | Manual_run1 | relevance_feedback_run4 | relevance_feedback_run1 | auto_run1 | rf_run2 | RF_run3

Abstract

The WHU-NERCMS team participated in the ad-hoc video search (AVS) task of TRECVID 2024. In this year’s AVS task, we continued to use multiple visual semantic embedding methods, combined with interactive feedback-guided ranking aggregation techniques to integrate different models and their outputs to generate the final ranked video shot list. We submitted 4 runs each for automatic and interactive tasks, along with one attempt for a manual assistance task. Table 1 shows our results for this year.

Bibtex

@inproceedings{WHU-NERCMS-trec2024-papers-proc-1,
    title = {WHU-NERCMS AT TRECVID2024: AD-HOC VIDEO SEARCH TASK},
    author = {Heng Liu and Jiangshan He and Zeyuan Zhang and Yuanyuan Xu and Chao Liang},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}