Skip to content

Runs - Adhoc Video Search 2025

BLIP BLIP2 CLIP LaCLIP SLIP diffusion

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: BLIP BLIP2 CLIP LaCLIP SLIP diffusion
  • Participant: WHU-NERCMS
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 9132f6cfa8d3e46c38d35939ee46bfc7
  • Run description: 16:4:10:3:3:3

ccilab1

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: ccilab1
  • Participant: ccilab
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-20
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 94223a9f27bd7d9b26222373aecbb4f3
  • Run description: This run is obtained by computing similarities between shots and each topic using OpenAI CLIP's image and text encoders.

clap

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: clap
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 218b44a5a549e2b4228575f70c3f201e
  • Run description: gpt-4.1-mini decomposes query into visual and (non-speech) audio components. Visual component is searched using SigLIP2-base-patch16-naflex embeddings and audio component is searched on CLAP embeddings. The normalized scores from both search techniques are added together for the final ranking. If the LLM decided there was no audio component, then only the SigLIP2 embeddings are used.

decomp

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: decomp
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 9d02b728d3a0dbc6058437541a35b868
  • Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is decomposed into visual components with each component expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged and merged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.

fg-clip

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: fg-clip
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 99c2723085e2afab2f4646c9f065fc32
  • Run description: This run uses FG-CLIP embeddings to retrieve the most relevant keyframes. FG-CLIP is a fine-tuned version of OpenAI's clip-vit-base-patch32, trained on V3C1 keyframes with captions generated by Phi-3-Vision. The fine-tune training used a modified loss function for fine-grain token level comparison.

Fuse all sub-models

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: Fuse all sub-models
  • Participant: WHU-NERCMS
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 3b11bc9aa8e8948c180dd1335b19e8cc
  • Run description: Fuse all sub-models

gpt

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: gpt
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 825e4d3975ac1d1438a66abb5ae6bb7d
  • Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 3 times using GPT-4.1-mini, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.

HPA

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: HPA
  • Participant: WHU-NERCMS
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 557faae54948c29961c030be5dbe23cb
  • Run description: BEIT3 BLIP BLIP2 CLIP internVL LaCLIP SLIP diffusion

InternVL3 Baseline

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: InternVL3 Baseline
  • Participant: AFRL
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: manual
  • Task: trec2025-avs-main
  • MD5: 8c4492a49502aa210fdcc54e33cd361a
  • Run description: Modified InternVL3 VLLM with Basic Cosine Similarity distance measurement to establish baseline.

Paraphrase_T2V_VILA_NVILA_VideoLLaMA3

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: Paraphrase_T2V_VILA_NVILA_VideoLLaMA3
  • Participant: NII_UIT
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: manual
  • Task: trec2025-avs-main
  • MD5: ef713cd4afa144a64168eb57b99a4a6d
  • Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B

phi-only

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: phi-only
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: a8eea8638d13e26a36f4adbd60e45c6b
  • Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.

phi-subgroup

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: phi-subgroup
  • Participant: ncsu-las
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 157c1e2d8a956e4348065814f667579e
  • Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. An overlapping subgroup sort is then applied during re-ranking to limit how far each result can move from its initial rank, and the top 1,000 results are submitted.

Proportional fusion

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: Proportional fusion
  • Participant: WHU-NERCMS
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 22dc56889067adea5825e7578e559812
  • Run description: 6, 16, 4, 10, 5, 3, 3, 3

run1

Participants | Input | sample_eval | Appendix

  • Run ID: run1
  • Participant: SZUAI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: a6156168e8a308bf7df276800d356e6a
  • Run description: rerank run4

run2

Participants | Input | sample_eval | Appendix

  • Run ID: run2
  • Participant: SZUAI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 06640621d51b1cab6363d484cdf527c1
  • Run description: IITV+owl

run3

Participants | Input | sample_eval | Appendix

  • Run ID: run3
  • Participant: SZUAI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 37c050bf8290317957b59ef127179093
  • Run description: IITV+qwen2.5VL

run4

Participants | Input | sample_eval | Appendix

  • Run ID: run4
  • Participant: SZUAI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 7e2837882ae9a8d0d8205cf2ca331499
  • Run description: overlap of IITV, owl3, and qwen

run_1

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: run_1
  • Participant: CERTH-ITI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: e63bf1a69bd9e456beece650bf4954e3
  • Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 4000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.

run_2

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: run_2
  • Participant: CERTH-ITI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 70d2780a4059ef8ba62709660ade8b26
  • Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 2000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.

run_3

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: run_3
  • Participant: CERTH-ITI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 6a83dcf688503a364cf675e02e3c6516
  • Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 1000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.

run_4

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: run_4
  • Participant: CERTH-ITI
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 5ca4c9ff36f3b085420cf867b623f1fd
  • Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. No re-ranking is applied. The similarities are normalized with respect to queries from 2022, 2023, and 2024.

T2V_VILA_NVILA_VideoLLaMA3_Aria

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: T2V_VILA_NVILA_VideoLLaMA3_Aria
  • Participant: NII_UIT
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-29
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: d40d846e3af71fe8a1d2eec773880d2a
  • Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B Aria-8x3.5B

T2V_VILA_NVILA_VideoLLaMA3_v2

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: T2V_VILA_NVILA_VideoLLaMA3_v2
  • Participant: NII_UIT
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 488040411b639f5fa1e616e0feea0623
  • Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B

T2V_VILA_NVILA_VideoLLaMA3_weights

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: T2V_VILA_NVILA_VideoLLaMA3_weights
  • Participant: NII_UIT
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 90d016830abfb04740f597b0008726d7
  • Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B

T2V_VILA_v2

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: T2V_VILA_v2
  • Participant: NII_UIT
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-28
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 1ef3c2ffcfa4ee51368b046ee507123e
  • Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B

tv25_Meisei_A1

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: tv25_Meisei_A1
  • Participant: meisei
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: b22daa76f08d8d38bd68d4ce674d2911
  • Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.

tv25_Meisei_A2

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: tv25_Meisei_A2
  • Participant: meisei
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 42779b09506d3172fb54430a9fa46bdd
  • Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.

tv25_Meisei_A3

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: tv25_Meisei_A3
  • Participant: meisei
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 1a061d778cccba5241f934f72e025220
  • Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.

tv25_Meisei_A4

Participants | Proceedings | Input | sample_eval | Appendix

  • Run ID: tv25_Meisei_A4
  • Participant: meisei
  • Track: Adhoc Video Search
  • Year: 2025
  • Submission: 2025-07-27
  • Type: automatic
  • Task: trec2025-avs-main
  • MD5: 99fc92cb3f493b7f8bc9940db3a33590
  • Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.