{"id":746077,"date":"2021-05-17T10:35:34","date_gmt":"2021-05-17T17:35:34","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?p=746077"},"modified":"2021-05-17T15:12:04","modified_gmt":"2021-05-17T22:12:04","slug":"microsoft-and-nvidia-introduce-parameter-efficient-multimodal-transformers-for-video-representation-learning","status":"publish","type":"post","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/microsoft-and-nvidia-introduce-parameter-efficient-multimodal-transformers-for-video-representation-learning\/","title":{"rendered":"Microsoft and NVIDIA introduce parameter-efficient multimodal transformers for video representation learning"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><img decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-scaled.jpg\" alt=\"\"\/><\/figure>\n\n\n\n<p>Understanding video is one of the most challenging problems in AI, and\u00a0an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/expanding-scene-and-language-understanding-with-large-scale-pre-training-and-a-unified-architecture\/\">vision-and-language tasks<\/a> such as <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/novel-object-captioning-surpasses-human-performance-on-benchmarks\/\">image captioning<\/a> and <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/\">visual question answering<\/a> due to their ability to learn <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/\">multimodal contextual representations<\/a>. However, training multimodal transformers end-to-end is difficult because of the excessive memory requirement. In fact, most existing vision and language transformers rely on pretrained language transformers to train them successfully.<br><br>Today, in collaboration with NVIDIA Research, we are excited to announce our work: \u201c<a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/parameter-efficient-multimodal-transformers-for-video-representation-learning\/\" target=\"_blank\" rel=\"noreferrer noopener\">Parameter Efficient Multimodal Transformers for Video Representation Learning<\/a>.\u201d In this paper, which was accepted at the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/iclr.cc\/\">International Conference on Learning Representations<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (ICLR 2021),\u00a0we propose an approach to reduce the size of multimodal transformers up to 97 percent by weight sharing mechanisms.\u00a0This allows us to train our model end-to-end on video sequences of 30 seconds (480 frames sampled at 16 frames per second), which is a significant departure from most existing video inference models that process clips shorter than 10 seconds, and achieve competitive performance on a variety of video understanding tasks.<\/p>\n\n\n\n<h2 id=\"aggressive-parameter-sharing-within-transformers\">Aggressive parameter sharing within transformers<\/h2>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"522\" height=\"235\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure1_ICLRmultimodal.png\" alt=\"A graphic depicting audio and video content items passing through an audio transformer layer and a video transformer layer, respectively, before being combined while passing through a multimodal transformer layer\" class=\"wp-image-746746\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure1_ICLRmultimodal.png 522w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure1_ICLRmultimodal-300x135.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure1_ICLRmultimodal-16x7.png 16w\" sizes=\"auto, (max-width: 522px) 100vw, 522px\" \/><figcaption>Figure 1. Our model consists of audio and visual convolutional neural networks, audio and visual transformers, and a multimodal transformer.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Our model consists of five components: audio and visual convolutional neural networks (CNNs), audio and visual transformers, and a multimodal transformer. The two CNNs encode audio and visual signals from one-second clips, respectively, while the three transformers encode audio, visual, and audio-visual signals from the entire input sequence (30 seconds). The whole model contains 155 million weight parameters and the three transformers consume 128 million parameters, or 82.6 percent of the total. Training on 30 seconds of video is GPU memory intensive, requiring small batch sizes and long training time.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"877\" height=\"493\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal.png\" alt=\"A four-part graphic comparing the gradual reduction in parameters from 128 million with no sharing, to 22 million with layer-wise sharing, to 15 million with partial sharing, to 4 million via partial sharing and low-rank factorization\" class=\"wp-image-746749\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal.png 877w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-300x169.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-768x432.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-16x9.png 16w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-655x368.png 655w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-343x193.png 343w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Figure2_Multimodal-640x360.png 640w\" sizes=\"auto, (max-width: 877px) 100vw, 877px\" \/><figcaption>Figure 2. We share parameters across both transformers and layers within each transformer. This results in a 97 percent reduction of parameters.<\/figcaption><\/figure>\n\n\n\n<p>To address this, we reduce the model size by sharing the weight parameters using two strategies. The first strategy shares weights across layers within each transformer, treating a transformer as an unrolled recurrent network. As shown in Figure 2(b), this reduces the parameters by 83 percent (from 128 million to 22 million). The second strategy involves partial weight sharing with low-rank factorization, where we factorize each weight matrix of the transformer into a form \\(W\\)=\\(U\\)\\(\u03a3\\)\\(V\\) and shared \\(U\\) across transformers while keeping \\(\u03a3\\)\\(V\\) private to each transformer. This strategy, depicted in Figure 2(c), helps our model capture the underlying dynamics between modalities efficiently: \\(U\\) models modality-shared dynamics, while \\(\u03a3\\) and \\(V\\) model modality-specific dynamics. The factorization and partial sharing achieve 88 percent parameter reduction. We further reduce the parameters by imposing a low-rank constraint on parameter \\(\u03a3\\), achieving a 97 percent reduction rate, from 128 million parameters to just 4 million parameters.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<h2 id=\"webdataset-an-efficient-pytorch-i-o-library-for-large-scale-datasets\">WebDataset: An efficient PyTorch I\/O library for large-scale datasets<\/h2>\n\n\n\n<p>With this optimized training procedure, we can now train on peta-byte scale datasets at high speed. To meet the high I\/O rates required by the algorithm, we have developed in parallel <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/tmbdev\/webdataset\">WebDataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a new high performance I\/O framework for PyTorch. It provides efficient access to datasets stored in POSIX tar archives and uses only sequential\/streaming data access. This brings substantial performance advantages in many compute environments, and it is essential for large-scale training such as video datasets.<\/p>\n\n\n\n<h2 id=\"content-aware-negative-sampling\">Content-aware negative sampling<\/h2>\n\n\n\n<p>We train our model using contrastive learning objectives. In contrastive learning, finding informative negative samples is essential for the model\u2019s convergence. We develop a novel content-aware negative sampling strategy that favors negatives sufficiently, similar to the positive instance. Specifically, we calculate the normalized pairwise similarity in the CNN embedding space and treat them as sampling probabilities, so that the more similar a sample is to the positive, the higher the chance there is to be selected as negative.<\/p>\n\n\n\n<h2 id=\"downstream-performance\">Downstream performance<\/h2>\n\n\n\n<p>Our model achieves competitive performance across several benchmarks that involve audio and\/or visual modalities from short or long clips. Table 1 shows different components of our model generalizing well across tasks, for instance visual CNN for short action recognition (UCF-101), audio-CNN for short environmental sound classification (ESC-50), and the transformers for long audio-visual video classification (charades and kinetics-sounds). Our results demonstrate the versatility of this model\u2014once pretrained, we can take different components appropriate for downstream scenarios.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"A table comparing results of short video classification on UCF-101, short audio classification on ESC-50, and long video classification on charades (mAP) and kinetics-sounds\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Table1_MultiModalBlogMSR.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Table1_MultiModalBlogMSR.png\" alt=\"A table comparing results of short video classification on UCF-101, short audio classification on ESC-50, and long video classification on charades (mAP) and kinetics-sounds\" class=\"wp-image-746752\" width=\"771\" height=\"212\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Table1_MultiModalBlogMSR.png 621w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Table1_MultiModalBlogMSR-300x83.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/Table1_MultiModalBlogMSR-16x4.png 16w\" sizes=\"auto, (max-width: 771px) 100vw, 771px\" \/><\/a><figcaption>Table 1. (a) Short video classification results on UCF-101 (mean accuracy, percentage). (b) Short audio classification results on ESC-50 (mean accuracy, percentage). c) Long video classification results on charades (mAP) and kinetics-sounds (KS; top-1\/5 accuracy, percentage).<\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"looking-forward\">Looking forward <\/h2>\n\n\n\n<p>Large-parameter transformer models are producing impressive results on numerous challenging tasks, but there is a growing concern that conducting research in this direction is limited to institutions with large compute resources. In this work, we presented an approach to reduce the size of multimodal transformers by up to 97 percent and still achieve competitive results on standard video benchmarks, making them more accessible to institutions with limited compute resources. To accelerate research in this direction, we have open sourced the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/sangho-vision\/avbert\">code and will be releasing the pretrained model checkpoints on GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> in the coming weeks. We hope that our research will open up research opportunities in large-scale pretraining to institutions with limited compute resources<\/p>\n\n\n\n<h3 id=\"acknowledgement\">Acknowledgement<\/h3>\n\n\n\n<p>This research was conducted by an amazing team of researchers from Seoul National University (Sangho Lee, Youngjae Yu, and Gunhee Kim), NVIDIA Research (Thomas Breuel and Jan Kautz) and Microsoft Research (Yale Song).<\/p>\n\n\n\n<p><br><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Understanding video is one of the most challenging problems in AI, and\u00a0an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in vision-and-language tasks such as image captioning and visual question answering due to their ability to [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":746761,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Yale Song","user_id":"37422"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13562],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-746077","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[694407],"related-projects":[],"related-events":[725710],"related-researchers":[],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-960x540.jpg\" class=\"img-object-cover\" alt=\"A graphic depicting audio and video content items passing through an audio transformer layer and a video transformer layer, respectively, before being combined while passing through a multimodal transformer layer\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-960x540.jpg 960w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-300x169.jpg 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-1024x577.jpg 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-768x433.jpg 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-1536x865.jpg 1536w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-2048x1154.jpg 2048w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-16x9.jpg 16w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-1066x600.jpg 1066w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-655x368.jpg 655w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-343x193.jpg 343w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-640x360.jpg 640w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-1280x720.jpg 1280w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/05\/1400x788_ICLR_no_logo_still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Yale Song","formattedDate":"May 17, 2021","formattedExcerpt":"Understanding video is one of the most challenging problems in AI, and\u00a0an important underlying requirement is learning multimodal representations that capture information about objects, actions, sounds, and their long-range statistical dependencies from audio-visual signals. Recently, transformers have been successful in vision-and-language tasks such as image&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/746077","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/comments?post=746077"}],"version-history":[{"count":20,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/746077\/revisions"}],"predecessor-version":[{"id":746872,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/746077\/revisions\/746872"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media\/746761"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=746077"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/categories?post=746077"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/tags?post=746077"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=746077"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=746077"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=746077"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=746077"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=746077"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=746077"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=746077"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=746077"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}