Increasing Language-Picture Pretrained Fashions for Basic Video Recognition


Video recognition is used for quite a few imaginative and prescient functions, equivalent to micro-video advice, sports activities video evaluation, or autonomous driving. Language-image pretraining has proven nice potential in addressing this process. Nonetheless, straight coaching a language-video mannequin requires large-scale video-text pretraining information.

image from rawpixel id 6918285 jpeg

Picture credit score: Rawpixel, CC0 Public Area

A current paper on arXiv.org proposes a brand new structure for video temporal modeling. Novel cross-frame communication consideration is proposed for video temporal modeling. It’s gentle and environment friendly and will be seamlessly plugged into present language-image pretrained fashions.

Researchers design a video-specific prompting approach to yield instance-level textual illustration robotically. Experiments exhibit the prevalence and good generalization potential of the proposed technique beneath numerous studying configurations.

Contrastive language-image pretraining has proven nice success in studying visual-textual joint illustration from web-scale information, demonstrating exceptional “zero-shot” generalization potential for numerous picture duties. Nonetheless, tips on how to successfully increase such new language-image pretraining strategies to video domains continues to be an open downside. On this work, we current a easy but efficient method that adapts the pretrained language-image fashions to video recognition straight, as a substitute of pretraining a brand new mannequin from scratch. Extra concretely, to seize the long-range dependencies of frames alongside the temporal dimension, we suggest a cross-frame consideration mechanism that explicitly exchanges info throughout frames. Such module is light-weight and will be plugged into pretrained language-image fashions seamlessly. Furthermore, we suggest a video-specific prompting scheme, which leverages video content material info for producing discriminative textual prompts. Intensive experiments exhibit that our method is efficient and will be generalized to completely different video recognition situations. Specifically, beneath fully-supervised settings, our method achieves a top-1 accuracy of 87.1% on Kinectics-400, whereas utilizing 12 instances fewer FLOPs in contrast with Swin-L and ViViT-H. In zero-shot experiments, our method surpasses the present state-of-the-art strategies by +7.6% and +14.9% when it comes to top-1 accuracy beneath two fashionable protocols. In few-shot situations, our method outperforms earlier finest strategies by +32.1% and +23.1% when the labeled information is extraordinarily restricted. Code and fashions can be found at this https URL

Analysis article: Ni, B., “Increasing Language-Picture Pretrained Fashions for Basic Video Recognition”, 2022. Hyperlink: https://arxiv.org/abs/2208.02816






Supply hyperlink

Leave a Reply

Your email address will not be published.