Skip to content

Overview - Video-To-Text 2024ΒΆ

Proceedings | Data | Results | Runs | Participants

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from videos.

Track coordinator(s):

  • George Awad, NIST
  • Yvette Graham, Trinity College Dublin
  • Afzal Godil, NIST

Track Web Page: https://www-nlpir.nist.gov/projects/tv2024/vtt.html