TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@1	74.1	# 1
Video Retrieval	ActivityNet	InternVideo2-6B	video-to-text R@1	69.7	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@1	63.2	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	video-to-text R@1	56.5	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@10	92.5	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@5	85.6	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	video-to-text R@5	82.8	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	video-to-text R@10	90.3	# 1
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	text-to-video R@1	60.4	# 2
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	video-to-text R@1	54.8	# 2
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	text-to-video R@10	90.8	# 2
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	text-to-video R@5	83.9	# 2
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	video-to-text R@5	81.5	# 2
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	video-to-text R@10	89.5	# 2
Action Recognition	ActivityNet	InternVideo2-6B	mAP	95.9	# 3
Temporal Action Localization	ActivityNet-1.3	InternVideo2-6B	mAP	41.2	# 4
Temporal Action Localization	ActivityNet-1.3	InternVideo2-1B	mAP	40.4	# 5
Text to Audio Retrieval	AudioCaps	InternVideo2-6B	R@1	55.2	# 1
Zero-shot Text to Audio Retrieval	AudioCaps	InternVideo2-6B	Audio-to-text R@1	37.1	# 1
Moment Retrieval	Charades-STA	InternVideo2-1B	R@1 IoU=0.5	68.36	# 2
Moment Retrieval	Charades-STA	InternVideo2-1B	R@1 IoU=0.7	45.03	# 2
Moment Retrieval	Charades-STA	InternVideo2-6B	R@1 IoU=0.5	70.03	# 1
Moment Retrieval	Charades-STA	InternVideo2-6B	R@1 IoU=0.7	48.95	# 1
Zero-shot Text to Audio Retrieval	Clotho	InternVideo2-6B	text-to-audio R@1	17.4	# 1
Text to Audio Retrieval	Clotho	InternVideo2-6B	R@1	27.2	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@1	57.9	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@5	80.0	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@10	84.6	# 2
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	video-to-text R@1	57.1	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	video-to-text R@5	79.9	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	video-to-text R@10	85.0	# 1
Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@1	74.2	# 1
Video Retrieval	DiDeMo	InternVideo2-6B	video-to-text R@1	71.9	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	text-to-video R@1	57.0	# 2
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	text-to-video R@5	80.0	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	text-to-video R@10	85.1	# 1
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	video-to-text R@1	54.3	# 2
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	video-to-text R@5	77.2	# 2
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	video-to-text R@10	83.5	# 3
Zero-Shot Video Question Answer	EgoSchema (fullset)	InternVideo2-6B	Accuracy	41.1	# 4
Audio Classification	ESC-50	InternVideo2	Top-1 Accuracy	98.6	# 1
Audio Classification	ESC-50	InternVideo2	PRE-TRAINING DATASET	Multiple	# 1
Audio Classification	ESC-50	InternVideo2	Accuracy (5-fold)	98.6	# 1
Temporal Action Localization	FineAction	InternVideo2-6B	mAP	27.7	# 2
Temporal Action Localization	HACS	InternVideo2-1B	Average-mAP	42.4	# 4
Action Recognition	HACS	InternVideo2-6B	Top 1 Accuracy	97.0	# 1
Temporal Action Localization	HACS	InternVideo2-6B	Average-mAP	43.3	# 2
Action Classification	Kinetics-400	InternVideo2-6B	Acc@1	92.1	# 1
Action Classification	Kinetics-400	InternVideo2-1B	Acc@1	91.6	# 2
Action Classification	Kinetics-600	InternVideo2-6B	Top-1 Accuracy	91.9	# 1
Action Classification	Kinetics-600	InternVideo2-1B	Top-1 Accuracy	91.6	# 3
Action Classification	Kinetics-700	InternVideo2-1B	Top-1 Accuracy	85.4	# 2
Action Classification	Kinetics-700	InternVideo2-6B	Top-1 Accuracy	85.9	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@1	33.8	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	video-to-text R@1	30.1	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@5	55.9	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@10	62.2	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	video-to-text R@5	47.7	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	video-to-text R@10	54.8	# 1
Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@1	46.4	# 1
Video Retrieval	LSMDC	InternVideo2-6B	video-to-text R@1	46.7	# 1
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	text-to-video R@1	32.0	# 2
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	video-to-text R@1	27.3	# 2
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	text-to-video R@5	52.4	# 2
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	text-to-video R@10	59.4	# 2
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	video-to-text R@5	44.2	# 2
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	video-to-text R@10	51.6	# 2
Action Classification	MiT	InternVideo2-6B	Top 1 Accuracy	51.2	# 1
Action Classification	MiT	InternVideo2-1B	Top 1 Accuracy	50.9	# 2
Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@1	62.8	# 2
Video Retrieval	MSR-VTT	InternVideo2-6B	video-to-text R@1	60.2	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	text-to-video R@1	51.9	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	text-to-video R@5	75.3	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	text-to-video R@10	82.5	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	video-to-text R@1	50.9	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	video-to-text R@5	73.4	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	video-to-text R@10	81.8	# 2
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@1	55.9	# 1
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@5	78.3	# 1
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@10	85.1	# 1
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	video-to-text R@1	53.7	# 1
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	video-to-text R@5	77.5	# 1
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	video-to-text R@10	84.1	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@1	59.3	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	video-to-text R@1	83.1	# 2
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@5	84.4	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@10	89.6	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	video-to-text R@5	94.2	# 2
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	video-to-text R@10	97.0	# 2
Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@1	61.4	# 1
Video Retrieval	MSVD	InternVideo2-6B	video-to-text R@1	85.2	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	text-to-video R@1	58.1	# 2
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	video-to-text R@1	83.3	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	text-to-video R@5	83.0	# 2
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	text-to-video R@10	88.4	# 2
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	video-to-text R@5	94.3	# 1
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	video-to-text R@10	96.9	# 3
Zero-Shot Video Question Answer	MVBench	InternVideo2-1B	Accuracy	60.9	# 1
Video Grounding	QVHighlights	InternVideo2-1B	R@1,IoU=0.5	70.00	# 2
Video Grounding	QVHighlights	InternVideo2-1B	R@1,IoU=0.7	54.45	# 2
Video Grounding	QVHighlights	InternVideo2-6B	R@1,IoU=0.5	71.42	# 1
Video Grounding	QVHighlights	InternVideo2-6B	R@1,IoU=0.7	56.45	# 1
Action Recognition	Something-Something V2	InternVideo2-6B	Top-1 Accuracy	77.5	# 1
Action Recognition	Something-Something V2	InternVideo2-1B	Top-1 Accuracy	77.1	# 4
Temporal Action Localization	THUMOS’14	InternVideo2-6B	Avg mAP (0.3:0.7)	72.0	# 3
Temporal Action Localization	THUMOS’14	InternVideo2-1B	Avg mAP (0.3:0.7)	69.8	# 6
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	text-to-video R@1	70.4	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	video-to-text R@1	85.4	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	text-to-video R@5	93.4	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	text-to-video R@10	96.9	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	video-to-text R@5	97.6	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	video-to-text R@10	99.1	# 2
Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@1	75.5	# 3
Video Retrieval	VATEX	InternVideo2-6B	video-to-text R@1	89.3	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@1	71.5	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	video-to-text R@1	85.3	# 2
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@5	94.0	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@10	97.1	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	video-to-text R@5	97.9	# 1
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	video-to-text R@10	99.3	# 1
Video Instance Segmentation	YouTube-VIS validation	Mask2Former(InternVideo2-6B)	mask AP	64.2	# 10

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-activitynet)](https://paperswithcode.com/sota/video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-activitynet)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-activitynet?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-audiocaps)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-audiocaps?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/moment-retrieval-on-charades-sta)](https://paperswithcode.com/sota/moment-retrieval-on-charades-sta?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/zero-shot-text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/text-to-audio-retrieval-on-clotho)](https://paperswithcode.com/sota/text-to-audio-retrieval-on-clotho?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-didemo)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-didemo)](https://paperswithcode.com/sota/video-retrieval-on-didemo?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/audio-classification-on-esc-50)](https://paperswithcode.com/sota/audio-classification-on-esc-50?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-on-hacs)](https://paperswithcode.com/sota/action-recognition-on-hacs?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-400)](https://paperswithcode.com/sota/action-classification-on-kinetics-400?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-600)](https://paperswithcode.com/sota/action-classification-on-kinetics-600?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-kinetics-700)](https://paperswithcode.com/sota/action-classification-on-kinetics-700?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-lsmdc)](https://paperswithcode.com/sota/video-retrieval-on-lsmdc?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-classification-on-moments-in-time)](https://paperswithcode.com/sota/action-classification-on-moments-in-time?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-msvd)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msvd)](https://paperswithcode.com/sota/video-retrieval-on-msvd?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-question-answer-on-mvbench)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-mvbench?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-grounding-on-qvhighlights)](https://paperswithcode.com/sota/video-grounding-on-qvhighlights?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-something)](https://paperswithcode.com/sota/action-recognition-in-videos-on-something?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-retrieval-on-vatex)](https://paperswithcode.com/sota/zero-shot-video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-fineaction)](https://paperswithcode.com/sota/temporal-action-localization-on-fineaction?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-hacs)](https://paperswithcode.com/sota/temporal-action-localization-on-hacs?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-msr-vtt)](https://paperswithcode.com/sota/video-retrieval-on-msr-vtt?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/action-recognition-in-videos-on-activitynet)](https://paperswithcode.com/sota/action-recognition-in-videos-on-activitynet?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-thumos14)](https://paperswithcode.com/sota/temporal-action-localization-on-thumos14?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-retrieval-on-vatex)](https://paperswithcode.com/sota/video-retrieval-on-vatex?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/temporal-action-localization-on-activitynet)](https://paperswithcode.com/sota/temporal-action-localization-on-activitynet?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/zero-shot-video-question-answer-on-egoschema-1)](https://paperswithcode.com/sota/zero-shot-video-question-answer-on-egoschema-1?p=internvideo2-scaling-video-foundation-models)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/internvideo2-scaling-video-foundation-models/video-instance-segmentation-on-youtube-vis-1)](https://paperswithcode.com/sota/video-instance-segmentation-on-youtube-vis-1?p=internvideo2-scaling-video-foundation-models)`

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

22 Mar 2024 · Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang ·

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our approach employs a progressive training paradigm that unifies the different self- or weakly-supervised learning frameworks of masked video token reconstruction, cross-modal contrastive learning, and next token prediction. Different training stages would guide our model to capture different levels of structure and semantic information through different pretext tasks. At the data level, we prioritize the spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. We scale both data and model size for our InternVideo2. Through extensive experiments, we validate our designs and demonstrate the state-of-the-art performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related captioning, dialogue, and long video understanding benchmarks, highlighting its ability to reason and comprehend long temporal contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo2/.

PDF Abstract

Code

Add Remove Mark official

opengvlab/internvideo2 official

157

opengvlab/internvideo

977

Tasks

Add Remove

Action Classification

Action Recognition

Audio Classification

Contrastive Learning

Moment Retrieval

Temporal Action Localization

Text to Audio Retrieval

Video Grounding

Video Instance Segmentation

Video Retrieval

Video Understanding

Weakly-supervised Learning

Zero-shot Text to Audio Retrieval

Zero-Shot Video Question Answer

Zero-Shot Video Retrieval

Datasets

UCF101

Kinetics

HMDB51

ActivityNet

Kinetics 400

MSR-VTT

ESC-50

THUMOS14

MSVD

Something-Something V2

Charades-STA

DiDeMo

AudioCaps

YouTube-VIS 2019

Clotho

Kinetics-600

LSMDC

VATEX

MiT

Kinetics-700

HACS EgoSchema

QVHighlights MVBench InternVid

FineAction

Results from the Paper

Add Remove

Ranked #1 on Zero-Shot Video Question Answer on MVBench

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@1	74.1	# 1	Compare
Video Retrieval	ActivityNet	InternVideo2-6B	video-to-text R@1	69.7	# 1	Compare
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-6B	text-to-video R@1	63.2	# 1	Compare
			video-to-text R@1	56.5	# 1	Compare
			text-to-video R@10	92.5	# 1	Compare
			text-to-video R@5	85.6	# 1	Compare
			video-to-text R@5	82.8	# 1	Compare
			video-to-text R@10	90.3	# 1	Compare
Zero-Shot Video Retrieval	ActivityNet	InternVideo2-1B	text-to-video R@1	60.4	# 2	Compare
			video-to-text R@1	54.8	# 2	Compare
			text-to-video R@10	90.8	# 2	Compare
			text-to-video R@5	83.9	# 2	Compare
			video-to-text R@5	81.5	# 2	Compare
			video-to-text R@10	89.5	# 2	Compare
Action Recognition	ActivityNet	InternVideo2-6B	mAP	95.9	# 3	Compare
Temporal Action Localization	ActivityNet-1.3	InternVideo2-6B	mAP	41.2	# 4	Compare
Temporal Action Localization	ActivityNet-1.3	InternVideo2-1B	mAP	40.4	# 5	Compare
Text to Audio Retrieval	AudioCaps	InternVideo2-6B	R@1	55.2	# 1	Compare
Zero-shot Text to Audio Retrieval	AudioCaps	InternVideo2-6B	Audio-to-text R@1	37.1	# 1	Compare
Moment Retrieval	Charades-STA	InternVideo2-1B	R@1 IoU=0.5	68.36	# 2	Compare
Moment Retrieval	Charades-STA	InternVideo2-1B	R@1 IoU=0.7	45.03	# 2	Compare
Moment Retrieval	Charades-STA	InternVideo2-6B	R@1 IoU=0.5	70.03	# 1	Compare
Moment Retrieval	Charades-STA	InternVideo2-6B	R@1 IoU=0.7	48.95	# 1	Compare
Zero-shot Text to Audio Retrieval	Clotho	InternVideo2-6B	text-to-audio R@1	17.4	# 1	Compare
Text to Audio Retrieval	Clotho	InternVideo2-6B	R@1	27.2	# 1	Compare
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@1	57.9	# 1	Compare
			text-to-video R@5	80.0	# 1	Compare
			text-to-video R@10	84.6	# 2	Compare
			video-to-text R@1	57.1	# 1	Compare
			video-to-text R@5	79.9	# 1	Compare
			video-to-text R@10	85.0	# 1	Compare
Video Retrieval	DiDeMo	InternVideo2-6B	text-to-video R@1	74.2	# 1	Compare
Video Retrieval	DiDeMo	InternVideo2-6B	video-to-text R@1	71.9	# 1	Compare
Zero-Shot Video Retrieval	DiDeMo	InternVideo2-1B	text-to-video R@1	57.0	# 2	Compare
			text-to-video R@5	80.0	# 1	Compare
			text-to-video R@10	85.1	# 1	Compare
			video-to-text R@1	54.3	# 2	Compare
			video-to-text R@5	77.2	# 2	Compare
			video-to-text R@10	83.5	# 3	Compare
Zero-Shot Video Question Answer	EgoSchema (fullset)	InternVideo2-6B	Accuracy	41.1	# 4	Compare
Audio Classification	ESC-50	InternVideo2	Top-1 Accuracy	98.6	# 1	Compare
			PRE-TRAINING DATASET	Multiple	# 1	Compare
			Accuracy (5-fold)	98.6	# 1	Compare
Temporal Action Localization	FineAction	InternVideo2-6B	mAP	27.7	# 2	Compare
Temporal Action Localization	HACS	InternVideo2-1B	Average-mAP	42.4	# 4	Compare
Action Recognition	HACS	InternVideo2-6B	Top 1 Accuracy	97.0	# 1	Compare
Temporal Action Localization	HACS	InternVideo2-6B	Average-mAP	43.3	# 2	Compare
Action Classification	Kinetics-400	InternVideo2-6B	Acc@1	92.1	# 1	Compare
Action Classification	Kinetics-400	InternVideo2-1B	Acc@1	91.6	# 2	Compare
Action Classification	Kinetics-600	InternVideo2-6B	Top-1 Accuracy	91.9	# 1	Compare
Action Classification	Kinetics-600	InternVideo2-1B	Top-1 Accuracy	91.6	# 3	Compare
Action Classification	Kinetics-700	InternVideo2-1B	Top-1 Accuracy	85.4	# 2	Compare
Action Classification	Kinetics-700	InternVideo2-6B	Top-1 Accuracy	85.9	# 1	Compare
Zero-Shot Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@1	33.8	# 1	Compare
			video-to-text R@1	30.1	# 1	Compare
			text-to-video R@5	55.9	# 1	Compare
			text-to-video R@10	62.2	# 1	Compare
			video-to-text R@5	47.7	# 1	Compare
			video-to-text R@10	54.8	# 1	Compare
Video Retrieval	LSMDC	InternVideo2-6B	text-to-video R@1	46.4	# 1	Compare
Video Retrieval	LSMDC	InternVideo2-6B	video-to-text R@1	46.7	# 1	Compare
Zero-Shot Video Retrieval	LSMDC	InternVideo2-1B	text-to-video R@1	32.0	# 2	Compare
			video-to-text R@1	27.3	# 2	Compare
			text-to-video R@5	52.4	# 2	Compare
			text-to-video R@10	59.4	# 2	Compare
			video-to-text R@5	44.2	# 2	Compare
			video-to-text R@10	51.6	# 2	Compare
Action Classification	MiT	InternVideo2-6B	Top 1 Accuracy	51.2	# 1	Compare
Action Classification	MiT	InternVideo2-1B	Top 1 Accuracy	50.9	# 2	Compare
Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@1	62.8	# 2	Compare
Video Retrieval	MSR-VTT	InternVideo2-6B	video-to-text R@1	60.2	# 2	Compare
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-1B	text-to-video R@1	51.9	# 2	Compare
			text-to-video R@5	75.3	# 2	Compare
			text-to-video R@10	82.5	# 2	Compare
			video-to-text R@1	50.9	# 2	Compare
			video-to-text R@5	73.4	# 2	Compare
			video-to-text R@10	81.8	# 2	Compare
Zero-Shot Video Retrieval	MSR-VTT	InternVideo2-6B	text-to-video R@1	55.9	# 1	Compare
			text-to-video R@5	78.3	# 1	Compare
			text-to-video R@10	85.1	# 1	Compare
			video-to-text R@1	53.7	# 1	Compare
			video-to-text R@5	77.5	# 1	Compare
			video-to-text R@10	84.1	# 1	Compare
Zero-Shot Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@1	59.3	# 1	Compare
			video-to-text R@1	83.1	# 2	Compare
			text-to-video R@5	84.4	# 1	Compare
			text-to-video R@10	89.6	# 1	Compare
			video-to-text R@5	94.2	# 2	Compare
			video-to-text R@10	97.0	# 2	Compare
Video Retrieval	MSVD	InternVideo2-6B	text-to-video R@1	61.4	# 1	Compare
Video Retrieval	MSVD	InternVideo2-6B	video-to-text R@1	85.2	# 1	Compare
Zero-Shot Video Retrieval	MSVD	InternVideo2-1B	text-to-video R@1	58.1	# 2	Compare
			video-to-text R@1	83.3	# 1	Compare
			text-to-video R@5	83.0	# 2	Compare
			text-to-video R@10	88.4	# 2	Compare
			video-to-text R@5	94.3	# 1	Compare
			video-to-text R@10	96.9	# 3	Compare
Zero-Shot Video Question Answer	MVBench	InternVideo2-1B	Accuracy	60.9	# 1	Compare
Video Grounding	QVHighlights	InternVideo2-1B	R@1,IoU=0.5	70.00	# 2	Compare
Video Grounding	QVHighlights	InternVideo2-1B	R@1,IoU=0.7	54.45	# 2	Compare
Video Grounding	QVHighlights	InternVideo2-6B	R@1,IoU=0.5	71.42	# 1	Compare
Video Grounding	QVHighlights	InternVideo2-6B	R@1,IoU=0.7	56.45	# 1	Compare
Action Recognition	Something-Something V2	InternVideo2-6B	Top-1 Accuracy	77.5	# 1	Compare
Action Recognition	Something-Something V2	InternVideo2-1B	Top-1 Accuracy	77.1	# 4	Compare
Temporal Action Localization	THUMOS’14	InternVideo2-6B	Avg mAP (0.3:0.7)	72.0	# 3	Compare
Temporal Action Localization	THUMOS’14	InternVideo2-1B	Avg mAP (0.3:0.7)	69.8	# 6	Compare
Zero-Shot Video Retrieval	VATEX	InternVideo2-1B	text-to-video R@1	70.4	# 2	Compare
			video-to-text R@1	85.4	# 1	Compare
			text-to-video R@5	93.4	# 2	Compare
			text-to-video R@10	96.9	# 2	Compare
			video-to-text R@5	97.6	# 2	Compare
			video-to-text R@10	99.1	# 2	Compare
Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@1	75.5	# 3	Compare
Video Retrieval	VATEX	InternVideo2-6B	video-to-text R@1	89.3	# 1	Compare
Zero-Shot Video Retrieval	VATEX	InternVideo2-6B	text-to-video R@1	71.5	# 1	Compare
			video-to-text R@1	85.3	# 2	Compare
			text-to-video R@5	94.0	# 1	Compare
			text-to-video R@10	97.1	# 1	Compare
			video-to-text R@5	97.9	# 1	Compare
			video-to-text R@10	99.3	# 1	Compare
Video Instance Segmentation	YouTube-VIS validation	Mask2Former(InternVideo2-6B)	mask AP	64.2	# 10	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove