Site icon KryptoCoinz

Google AI Introduces VideoPrism: A General-Purpose Video Encoder that Tackles Diverse Video Understanding Tasks with a Single Frozen Model

Google researchers tackle the challenges of attaining a complete understanding of various video content material by introducing a novel encoder mannequin, VideoPrism. Current fashions in video understanding have struggled with varied duties with advanced programs and motion-centric reasoning and demonstrated poor efficiency throughout completely different benchmarks. The researchers aimed to develop a general-purpose video encoder that may successfully deal with a variety of video understanding duties with minimal adaptation.

Current video understanding fashions have made vital progress however nonetheless fall in need of. Some fashions leverage textual content related to movies for studying, and others focus solely on video alerts, which limits the efficient seize of each look and movement cues. VideoPrism proposes an method that integrates each video and textual content modalities throughout pretraining. It introduces a two-stage pretraining framework that mixes contrastive studying with masked video modeling. This technique permits the mannequin to study semantic representations from each video-text pairs and video-only knowledge.

VideoPrism’s structure relies on the Imaginative and prescient Transformer (ViT) with modifications for space-time factorization. Throughout pretraining, the mannequin first aligns video and textual content embeddings via contrastive studying after which continues coaching on video-only knowledge utilizing masked video modeling. This two-stage method is augmented with global-local distillation and token shuffling strategies to enhance mannequin efficiency. Intensive evaluations throughout varied video understanding duties show that VideoPrism achieves state-of-the-art efficiency on 30 out of 33 benchmarks, showcasing its strong generalizability and effectiveness in capturing each look and movement cues.

Google researchers tackle the problem of constructing a foundational video mannequin with their state-of-the-art mannequin VideoPrism for complete video understanding. The proposed technique combines contrastive studying with masked video modeling in a two-stage pretraining framework, leading to a mannequin that excels throughout a variety of video understanding duties.


Try the Paper. All credit score for this analysis goes to the researchers of this mission. Additionally, don’t neglect to comply with us on Twitter and Google Information. Be a part of our 38k+ ML SubReddit, 41k+ Fb Neighborhood, Discord Channel, and LinkedIn Group.

In the event you like our work, you’ll love our e-newsletter..

Don’t Neglect to hitch our Telegram Channel

You might also like our FREE AI Programs….


Pragati Jhunjhunwala is a consulting intern at MarktechPost. She is presently pursuing her B.Tech from the Indian Institute of Know-how(IIT), Kharagpur. She is a tech fanatic and has a eager curiosity within the scope of software program and knowledge science functions. She is at all times studying concerning the developments in several subject of AI and ML.


🐝 Be a part of the Quickest Rising AI Analysis E-newsletter Learn by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and plenty of others…
Exit mobile version