Proceedings - Medical Video Question Answering 2024¶

Doshisha University, Universität zu Lübeck and German Research Center for Artificial Intelligence at TRECVID 2024: QFISC Task¶

Zihao Chen (Doshisha University)Falco Lentzsch (German Research Center for Artificial Intelligence)Nele S. Brügge (German Research Center for Artificial Intelligence)Frédéric Li (German Research Center for Artificial Intelligence)Miho Ohsaki (Doshisha University)Heinz Handels (German Research Center for Artificial Intelligence, University of Luebeck)Marcin Grzegorzek (German Research Center for Artificial Intelligence, University of Luebeck)Kimiaki Shirahama (Doshisha University)

Participant: DoshishaUzlDfki
Paper: https://trec.nist.gov/pubs/trec33/papers/DoshishaUzlDfki.medvidqa.pdf
Runs: chatGPT_zeroshot_prompt | mistral_meta_prompt | mistral_fewshot_prompt | GPT_meta_prompt | CoSeg_meta_prompt

Abstract

This paper presents the approaches proposed by the DoshishaUzlDfki team to address the Query-Focused Instructional Step Captioning (QFISC) task of TRECVID 2024. Given some RGB videos containing stepwise instructions, we explored several techniques to automatically identify the boundaries of each step, and provide a caption to it. More specifically, two different types of methods were investigated for temporal video segmentation. The first uses the CoSeg approach proposed by Wang et al. [9] based on Event Segmentation Theory, which hypothesises that video frames at the boundaries of steps are harder to predict since they tend to contain more significant visual changes. In detail, CoSeg detects event boundaries in the RGB video stream by finding the local maxima in the reconstruction error of a model trained to reconstruct the temporal contrastive embeddings of video snippets. The second type of approaches we tested exclusively relies on the audio modality, and is based on the hypothesis that information about step transitions is often semantically contained in the verbal transcripts of the videos. In detail, we used the WhisperX model [3] that isolates speech parts in the audio tracks of the videos, and converts them into timestamped text transcripts. The latter were then sent as input of a Large Language Model (LLM) with a carefully designed prompt requesting the LLM to identify step boundaries. Once the temporal video segmentation performed, we

sent the WhisperX transcripts corresponding to the video segments determined by both methods to a LLM instructed to caption them. The GPT4o and Mistral Large 2 LLMs were employed in our experiments for both segmentation and captioning. Our results show that the temporal segmentation methods based on audioprocessing significantly outperform the video-based one. More specifically, the best performances we obtained are yielded by our approach using GPT4o with zero-shot prompting for temporal segmentation. It achieves the top global performances of all runs submitted to the QFISC task in all evaluation metrics, except for precision whose best performance is obtained by our run using Mistral Large 2 with chain-of-thoughts prompting.

Bibtex

@inproceedings{DoshishaUzlDfki-trec2024-papers-proc-1,
    author = {Zihao Chen (Doshisha University)
Falco Lentzsch (German Research Center for Artificial Intelligence)
Nele S. Brügge (German Research Center for Artificial Intelligence)
Frédéric Li (German Research Center for Artificial Intelligence)
Miho Ohsaki (Doshisha University)
Heinz Handels (German Research Center for Artificial Intelligence, University of Luebeck)
Marcin Grzegorzek (German Research Center for Artificial Intelligence, University of Luebeck)
Kimiaki Shirahama (Doshisha University)},
    title = {Doshisha University, Universität zu Lübeck and German Research Center for Artificial Intelligence at TRECVID 2024: QFISC Task},
    booktitle = {The Thirty-Third Text REtrieval Conference Proceedings (TREC 2024), Gaithersburg, MD, USA, November 15-18, 2024},
    series = {NIST Special Publication},
    volume = {xxx-xxx},
    publisher = {National Institute of Standards and Technology (NIST)},
    year = {2024},
    trec_org = {DoshishaUzlDfki},
    trec_runs = {chatGPT_zeroshot_prompt, mistral_meta_prompt, mistral_fewshot_prompt, GPT_meta_prompt, CoSeg_meta_prompt},
    trec_tracks = {medvidqa}
   url = {https://trec.nist.gov/pubs/trec33/papers/DoshishaUzlDfki.medvidqa.pdf}
}