Proceedings - Video-To-Text 2024¶

TRECVID 2024 - Evaluating video search, captioning, and activity recognition¶

George Awad, Jonathan Fiscus, Afzal Godil, Lukas Diduch, Yvette Graham, Georges Quénot

Paper: https://trec.nist.gov/pubs/trec33/papers/Overview_avs.pdf

Abstract

The TREC Video Retrieval Evaluation (TRECVID) is a TREC-style video analysis and retrieval evaluation with the goal of promoting progress in research and development of content-based exploitation and retrieval of information from digital video via open, tasks-based evaluation supported by metrology.

Bibtex

@inproceedings{coordinators-trec2024-papers-proc-6,
    title = {TRECVID 2024 - Evaluating video search, captioning, and activity recognition},
    author = {George Awad and Jonathan Fiscus and Afzal Godil and Lukas Diduch and Yvette Graham and Georges Quénot},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

Softbank-Meisei at TREC 2024¶

Kazuya Ueki, Yuma Suzuki, Hiroki Takushima, Haruki Sato, Takumi Takada, Aiswariya Manoj Kumar, Hayato Tanoue, Hiroki Nishihara, Yuki Shibata, Takayuki Hori

Participant: softbank-meisei
Paper: https://trec.nist.gov/pubs/trec33/papers/softbank-meisei.avs.vtt.pdf
Runs: SoftbankMeisei_vtt_main_run1 | SoftbankMeisei_vtt_main_run2 | SoftbankMeisei_vtt_main_run3 | SoftbankMeisei_vtt_main_run4 | SoftbankMeisei_vtt_sub_run2 | SoftbankMeisei_vtt_sub_run3 | SoftbankMeisei_vtt_sub_run1

Abstract

The Softbank-Meisei team participated in the ad-hoc video search (AVS) and video-to-text (VTT) tasks at TREC 2024. In this year’s AVS task, we submitted four fully automatic systems for both the main and progress tasks. Our systems utilized pre-trained vision and language models, including CLIP, BLIP, and BLIP-2, along with several other advanced models. We also expanded the original query texts using text generation and image generation techniques to enhance data diversity. The integration ratios of these models were optimized based on results from previous benchmark test datasets. In this year’s VTT, as last year, we submitted four main task methods using multiple model captioning, reranking, and generative AI for summarization. For the subtasks, we submitted three methods using the output of each model. Last year’s test data for the main task showed improvements of about 0.04 points in CIDEr-D and about 0.03 points in SPICE, based on the indices we had on hand.

Bibtex

@inproceedings{softbank-meisei-trec2024-papers-proc-2,
    title = {Softbank-Meisei at TREC 2024},
    author = {Kazuya Ueki and Yuma Suzuki and Hiroki Takushima and Haruki Sato and Takumi Takada and Aiswariya Manoj Kumar and Hayato Tanoue and Hiroki Nishihara and Yuki Shibata and Takayuki Hori},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}

Softbank-Meisei at TREC 2024 Ad-hoc Video Search and Video to Text Tasks¶

Kazuya Ueki, Yuma Suzuki, Hiroki Takushima, Haruki Sato, Takumi Takada, Aiswariya Manoj Kumar, Hayato Tanoue, Hiroki Nishihara, Yuki Shibata, Takayuki Hori

Participant: softbank-meisei
Paper: https://trec.nist.gov/pubs/trec33/papers/softbank-meisei.avs.vtt.pdf
Runs: SoftbankMeisei_vtt_main_run1 | SoftbankMeisei_vtt_main_run2 | SoftbankMeisei_vtt_main_run3 | SoftbankMeisei_vtt_main_run4 | SoftbankMeisei_vtt_sub_run2 | SoftbankMeisei_vtt_sub_run3 | SoftbankMeisei_vtt_sub_run1

Abstract

The Softbank-Meisei team participated in the ad-hoc video search (AVS) and video-to-text (VTT) tasks at TREC 2024. In this year’s AVS task, we submitted four fully automatic systems for both the main and progress tasks. Our systems utilized pre-trained vision and language models, including CLIP, BLIP, and BLIP-2, along with several other advanced models. We also expanded the original query texts using text generation and image generation techniques to enhance data diversity. The integration ratios of these models were optimized based on results from previous benchmark test datasets. In this year’s VTT, as last year, we submitted four main task methods using multiple model captioning, reranking, and generative AI for summarization. For the subtasks, we submitted three methods using the output of each model. Last year’s test data for the main task showed improvements of about 0.04 points in CIDEr-D and about 0.03 points in SPICE, based on the indices we had on hand.

Bibtex

@inproceedings{softbank-meisei-trec2024-papers-proc-3,
    title = {Softbank-Meisei at TREC 2024 Ad-hoc Video Search and Video to Text Tasks},
    author = {Kazuya Ueki and Yuma Suzuki and Hiroki Takushima and Haruki Sato and Takumi Takada and Aiswariya Manoj Kumar and Hayato Tanoue and Hiroki Nishihara and Yuki Shibata and Takayuki Hori},
    booktitle = {Proceedings of the 33th Text {REtrieval} Conference (TREC 2024)},
    year = {2024},
    address = {Gaithersburg, Maryland},
    series = {NIST SP 1329}
}