Runs - Adhoc Video Search 2025¶
BLIP BLIP2 CLIP LaCLIP SLIP diffusion¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: BLIP BLIP2 CLIP LaCLIP SLIP diffusion
- Participant: WHU-NERCMS
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
9132f6cfa8d3e46c38d35939ee46bfc7 - Run description: 16:4:10:3:3:3
ccilab1¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: ccilab1
- Participant: ccilab
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-20
- Type: automatic
- Task: trec2025-avs-main
- MD5:
94223a9f27bd7d9b26222373aecbb4f3 - Run description: This run is obtained by computing similarities between shots and each topic using OpenAI CLIP's image and text encoders.
clap¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: clap
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
218b44a5a549e2b4228575f70c3f201e - Run description: gpt-4.1-mini decomposes query into visual and (non-speech) audio components. Visual component is searched using SigLIP2-base-patch16-naflex embeddings and audio component is searched on CLAP embeddings. The normalized scores from both search techniques are added together for the final ranking. If the LLM decided there was no audio component, then only the SigLIP2 embeddings are used.
decomp¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: decomp
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
9d02b728d3a0dbc6058437541a35b868 - Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is decomposed into visual components with each component expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged and merged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.
fg-clip¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: fg-clip
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
99c2723085e2afab2f4646c9f065fc32 - Run description: This run uses FG-CLIP embeddings to retrieve the most relevant keyframes. FG-CLIP is a fine-tuned version of OpenAI's clip-vit-base-patch32, trained on V3C1 keyframes with captions generated by Phi-3-Vision. The fine-tune training used a modified loss function for fine-grain token level comparison.
Fuse all sub-models¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: Fuse all sub-models
- Participant: WHU-NERCMS
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
3b11bc9aa8e8948c180dd1335b19e8cc - Run description: Fuse all sub-models
gpt¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: gpt
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
825e4d3975ac1d1438a66abb5ae6bb7d - Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 3 times using GPT-4.1-mini, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.
HPA¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: HPA
- Participant: WHU-NERCMS
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
557faae54948c29961c030be5dbe23cb - Run description: BEIT3 BLIP BLIP2 CLIP internVL LaCLIP SLIP diffusion
InternVL3 Baseline¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: InternVL3 Baseline
- Participant: AFRL
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: manual
- Task: trec2025-avs-main
- MD5:
8c4492a49502aa210fdcc54e33cd361a - Run description: Modified InternVL3 VLLM with Basic Cosine Similarity distance measurement to establish baseline.
Paraphrase_T2V_VILA_NVILA_VideoLLaMA3¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: Paraphrase_T2V_VILA_NVILA_VideoLLaMA3
- Participant: NII_UIT
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: manual
- Task: trec2025-avs-main
- MD5:
ef713cd4afa144a64168eb57b99a4a6d - Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B
phi-only¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: phi-only
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
a8eea8638d13e26a36f4adbd60e45c6b - Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.
phi-subgroup¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: phi-subgroup
- Participant: ncsu-las
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
157c1e2d8a956e4348065814f667579e - Run description: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. An overlapping subgroup sort is then applied during re-ranking to limit how far each result can move from its initial rank, and the top 1,000 results are submitted.
Proportional fusion¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: Proportional fusion
- Participant: WHU-NERCMS
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
22dc56889067adea5825e7578e559812 - Run description: 6, 16, 4, 10, 5, 3, 3, 3
run1¶
Participants | Input | sample_eval | Appendix
- Run ID: run1
- Participant: SZUAI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
a6156168e8a308bf7df276800d356e6a - Run description: rerank run4
run2¶
Participants | Input | sample_eval | Appendix
- Run ID: run2
- Participant: SZUAI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
06640621d51b1cab6363d484cdf527c1 - Run description: IITV+owl
run3¶
Participants | Input | sample_eval | Appendix
- Run ID: run3
- Participant: SZUAI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
37c050bf8290317957b59ef127179093 - Run description: IITV+qwen2.5VL
run4¶
Participants | Input | sample_eval | Appendix
- Run ID: run4
- Participant: SZUAI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
7e2837882ae9a8d0d8205cf2ca331499 - Run description: overlap of IITV, owl3, and qwen
run_1¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: run_1
- Participant: CERTH-ITI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
e63bf1a69bd9e456beece650bf4954e3 - Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 4000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.
run_2¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: run_2
- Participant: CERTH-ITI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
70d2780a4059ef8ba62709660ade8b26 - Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 2000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.
run_3¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: run_3
- Participant: CERTH-ITI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
6a83dcf688503a364cf675e02e3c6516 - Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. Retrieved results are re-ranked using cross-modal similarities computed by Qwen-VL 2.5 across a depth of 1000 videos. The similarities are normalized with respect to queries from 2022, 2023, and 2024.
run_4¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: run_4
- Participant: CERTH-ITI
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
5ca4c9ff36f3b085420cf867b623f1fd - Run description: Textual queries are expanded using 20 rephrasings generated by the LLaMA 3.2 large language model to enrich semantic understanding. No re-ranking is applied. The similarities are normalized with respect to queries from 2022, 2023, and 2024.
T2V_VILA_NVILA_VideoLLaMA3_Aria¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: T2V_VILA_NVILA_VideoLLaMA3_Aria
- Participant: NII_UIT
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-29
- Type: automatic
- Task: trec2025-avs-main
- MD5:
d40d846e3af71fe8a1d2eec773880d2a - Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B Aria-8x3.5B
T2V_VILA_NVILA_VideoLLaMA3_v2¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: T2V_VILA_NVILA_VideoLLaMA3_v2
- Participant: NII_UIT
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
488040411b639f5fa1e616e0feea0623 - Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B
T2V_VILA_NVILA_VideoLLaMA3_weights¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: T2V_VILA_NVILA_VideoLLaMA3_weights
- Participant: NII_UIT
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
90d016830abfb04740f597b0008726d7 - Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B NVILA-15B VideoLLaMA3-7B
T2V_VILA_v2¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: T2V_VILA_v2
- Participant: NII_UIT
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-28
- Type: automatic
- Task: trec2025-avs-main
- MD5:
1ef3c2ffcfa4ee51368b046ee507123e - Run description: InternVL-G BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP InternVideo2 TeachCLIP Side4Video CLIP4Clip TS2Net VILA-1.5-40B
tv25_Meisei_A1¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: tv25_Meisei_A1
- Participant: meisei
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
b22daa76f08d8d38bd68d4ce674d2911 - Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.
tv25_Meisei_A2¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: tv25_Meisei_A2
- Participant: meisei
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
42779b09506d3172fb54430a9fa46bdd - Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.
tv25_Meisei_A3¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: tv25_Meisei_A3
- Participant: meisei
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
1a061d778cccba5241f934f72e025220 - Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.
tv25_Meisei_A4¶
Participants | Proceedings | Input | sample_eval | Appendix
- Run ID: tv25_Meisei_A4
- Participant: meisei
- Track: Adhoc Video Search
- Year: 2025
- Submission: 2025-07-27
- Type: automatic
- Task: trec2025-avs-main
- MD5:
99fc92cb3f493b7f8bc9940db3a33590 - Run description: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.