{"id":861732,"date":"2022-07-13T02:18:48","date_gmt":"2022-07-13T09:18:48","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?post_type=msr-blog-post&#038;p=861732"},"modified":"2022-07-13T02:20:00","modified_gmt":"2022-07-13T09:20:00","slug":"dit-self-supervised-pre-training-for-document-image-transformers","status":"publish","type":"msr-blog-post","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/articles\/dit-self-supervised-pre-training-for-document-image-transformers\/","title":{"rendered":"DiT: Self-supervised Pre-training for Document Image Transformers"},"content":{"rendered":"\n<p>Since 2019, researchers at Microsoft Research Asia have been exploring Document AI and have completed a series of work on LayoutLM\/LayoutLMv2\/LayoutXLM, TrOCR, MarkupLM, etc., achieving breakthroughs in a range of typical tasks such as object detection, information extraction, and document classification. However, most of the vision models used have been those trained on general domains, such as ResNet, ViT, etc., rather than document-specific vision models. This has led to the problem of domain shifts and mismatches when encoding document images.<\/p>\n\n\n\n<p>In the document image domain, large-scale annotated data such as ImageNet does not yet exist, and so large-scale supervised pre-training cannot be carried out. Moreover, although some of the work has attempted to explore weakly supervised training of document understanding models, the datasets used in these approaches have been mostly derived from academic papers with similar templates and layouts and are quite different from the forms, receipts, reports, and so on that are more commonly found in real-world applications. Therefore, large-scale unsupervised pre-training models for document images are highly desirable and can bring performance improvements to existing methods.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"638\" height=\"422\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture2.png\" alt=\"diagram\" class=\"wp-image-861738\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture2.png 638w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture2-300x198.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture2-240x159.png 240w\" sizes=\"auto, (max-width: 638px) 100vw, 638px\" \/><figcaption>Figure 1: Various kinds of document images<\/figcaption><\/figure>\n\n\n\n<p>To address these issues, researchers at Microsoft Research Asia have developed a new DiT model based on the current advanced Vision Transformer architecture, the paper has been accepted by ACM Multimedia 2022. The training of DiT does not rely on any labeled data but follows the unsupervised pre-training method for BEiT, Masked Image Modeling, to make full use of the large number of unlabeled document images. In this method, the researchers first resize a document image to 224&#215;224 and then slice it into a sequence of 16&#215;16 patches, for which a representation of each patch is obtained via the encoder part. Also, the image is tokenized with a pre-trained dVAE to obtain the index of each patch in the codebook. By randomly masking some patches in the sequence and reconstructing the masked patches by predicting their indices with the information from the remaining patches, the model is able to learn a generalized document understanding ability from unlabeled document images.<\/p>\n\n\n\n<p>To validate the performance of DiT, the researchers conducted experiments on four different downstream tasks, including document image classification, document layout analysis, table detection, and text detection. Experimental results show that DiT significantly outperforms models trained on general images, demonstrating its effectiveness.<\/p>\n\n\n\n<h2 id=\"model-pre-training\">Model Pre-training<\/h2>\n\n\n\n<p>The model structure of DiT is consistent with ViT and adopts the native Transformer structure as its skeleton network. First, the input document image is split into multiple non-overlapping patches, and a sequence of patch embeddings is obtained by a simple linear projection. After adding the one-dimensional position embeddings, the sequence is fed into a series of consecutive Transformer blocks with multi-head attention to obtain the final representation of each patch. Inspired by BEiT, DiT also employs Masked Image Modeling (MIM) for pre-training. In this step, the input image is viewed in two perspectives, both as a patch sequence and as vision tokens. The model needs to encode the masked patch sequence and predict the visual tokens of the masked patches. Figure 2 illustrates this model structure and the overall pre-training steps.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"572\" height=\"336\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture3.png\" alt=\"diagram\" class=\"wp-image-861741\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture3.png 572w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture3-300x176.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture3-240x141.png 240w\" sizes=\"auto, (max-width: 572px) 100vw, 572px\" \/><figcaption>Figure 2: The model structure and the pre-training procedure<\/figcaption><\/figure>\n\n\n\n<p>The dVAE tokenizer used in the training of BEiT was derived from DALL-E, which was trained from a dataset of 400 million general images. Using the same tokenizer directly for training DiT would lead to domain inconsistency, so the researchers retrained a new dVAE tokenizer on the IIT-CDIP document image dataset. Figure 3 shows the reconstruction results of document images using both the tokenizer trained on IIT-CDIP data and the DALL-E tokenizer. It can be seen that the reconstructed images obtained using the new tokenizer have sharper edges, while the images obtained using the DALL-E tokenizer are blurrier.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture4.png\" alt=\"diagram\" class=\"wp-image-861744\" width=\"512\" height=\"382\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture4.png 409w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture4-300x224.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture4-80x60.png 80w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture4-240x180.png 240w\" sizes=\"auto, (max-width: 512px) 100vw, 512px\" \/><figcaption>Figure 3: Reconstruction results of different tokenizers. Left\uff1aRaw images\uff0cMiddle\uff1aNew tokenizer\uff0cRight\uff1aDALL-E tokenizer<\/figcaption><\/figure>\n\n\n\n<h2 id=\"fine-tuning-downstream-tasks\">Fine-tuning downstream tasks<\/h2>\n\n\n\n<p>The downstream tasks used to validate DiT include document image classification, document layout analysis, table detection, and text detection. These can be classified into two main task categories: image classification and object detection.<\/p>\n\n\n\n<p>For the image classification task, a simple mean pooling layer is used to integrate the representations of the patch sequence into a global representation, which is fed into a simple linear layer for classification.<\/p>\n\n\n\n<p>For the object detection tasks, as shown in Figure 4, researchers use Mask R-CNN as the default detection framework and then further use the more advanced Cascade R-CNN for some tasks. Based on existing work, the researchers designed corresponding scaling modules in the four different DiT layers to adapt to the multi-scale FPNs required in the detection algorithm. Specifically, if the model has <em>d<\/em> layers, the output size of the <em>3\/d<\/em> layer is enlarged by a factor of 4, the output size of the <em>2\/d<\/em> layer is enlarged by a factor of 2, the output of the <em>2\/3d<\/em> layer remains the same, and the output size of the last layer is reduced by a factor of 2. These scaled feature maps are then fed into FPN in the detection framework.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture5.png\" alt=\"diagram\" class=\"wp-image-861747\" width=\"328\" height=\"515\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture5.png 214w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture5-191x300.png 191w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture5-115x180.png 115w\" sizes=\"auto, (max-width: 328px) 100vw, 328px\" \/><figcaption>Figure 4: Applying DiT in various detection frameworks<\/figcaption><\/figure>\n\n\n\n<h2 id=\"experimental-results\">Experimental results<\/h2>\n\n\n\n<h3 id=\"1-pre-training\">1. Pre-training<\/h3>\n\n\n\n<p>DiT has two versions, a base size and a large size, and both were trained on the IIT-CDIP Test Collection 1.0 dataset. 42 million document images were obtained after slicing the multi-page documents in the dataset into single pages. During the pre-training process, the researchers used random scaling and cropping for data augmentation.<\/p>\n\n\n\n<h3 id=\"2-document-image-classification\">2. Document Image Classification<\/h3>\n\n\n\n<p>The document image classification task uses RVL-CDIP as the test dataset, containing 400,000 document images, each of which belongs to one of the 16 common document types. Experimental results are shown in Table 1. It can be seen that DiT shows significant performance improvement in the single model scenario over previous methods. DiT-L has achieved results that are even comparable with previous ensemble models.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture6.png\" alt=\"table\" class=\"wp-image-861750\" width=\"491\" height=\"258\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture6.png 348w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture6-300x158.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture6-240x126.png 240w\" sizes=\"auto, (max-width: 491px) 100vw, 491px\" \/><figcaption>Table 1: Experimental results of DiT on document image classification (RVL-CDIP)<\/figcaption><\/figure>\n\n\n\n<h3 id=\"3-document-layout-analysis\">3. Document layout analysis<\/h3>\n\n\n\n<p>The document image classification task uses PubLayNet as the test dataset, containing over 360,000 document images. This task requires the model to detect common document elements from images, such as text, titles, lists, figures and tables. Experimental results are shown in Table 2, where DiT not only far exceeds existing SOTA methods, but also significantly outperforms various baseline models of vision Transformer. When a more advanced detection framework (e.g., Cascade R-CNN) is used, DiT achieves more accurate detection results.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture7.png\" alt=\"table\" class=\"wp-image-861753\" width=\"466\" height=\"195\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture7.png 344w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture7-300x126.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture7-240x100.png 240w\" sizes=\"auto, (max-width: 466px) 100vw, 466px\" \/><figcaption>Table 2: Experimental results of DiT on document layout analysis (PubLayNet)<\/figcaption><\/figure>\n\n\n\n<h3 id=\"4-table-detection\">4. Table detection<\/h3>\n\n\n\n<p>The table detection task uses ICDAR 2019 cTDaR &#8211; TrackA as the test dataset. This task requires the model to detect all tables in an image and to cover both modern and archival documents. The modern documents used contain 600 training samples and 240 test samples, consisting of screenshots of various types of PDF files. The archival documents used consist of 600 training samples and 199 test cases and are composed of older handwritten archival images. Because there is a significant difference in backgrounds between the archival files and the pre-training data, adaptive binarization pre-processing needs to be carried out before finetuning (as shown in Figure 5). Experimental results are shown in Table 3. DiT achieves higher scores than previous SOTA and baseline models on both branches of the task and is able to perform better after switching to the more advanced Cascade R-CNN.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture8.png\" alt=\"diagram\" class=\"wp-image-861756\" width=\"413\" height=\"293\"\/><figcaption>Figure 5: Adaptive binarization pre-processing for archival documents<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"476\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture9.png\" alt=\"table\" class=\"wp-image-861759\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture9.png 384w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture9-242x300.png 242w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture9-145x180.png 145w\" sizes=\"auto, (max-width: 384px) 100vw, 384px\" \/><figcaption>Table 3: Experimental results of DiT on table detection (ICDAR 2019 cTDaR \u2013 TrackA).<\/figcaption><\/figure>\n\n\n\n<h3 id=\"5-text-detection\">5. Text detection<\/h3>\n\n\n\n<p>The text detection task uses FUNSD&#8217;s OCR text recognition branch as the test dataset and requires the model to detect text bounding boxes in document images. This task contains 150 training cases and 49 test cases. Experimental results are shown in Table 4. DiT\u2019s detection results are much better than those of DBNet, a commonly used online OCR engine, and an existing commercial OCR engine, and it also performs significantly better than the chosen baseline models. Researchers further trained DiT on a synthetic dataset containing 1 million document images, and this further improved the text detection performance of the model.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"337\" height=\"214\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture10.png\" alt=\"table\" class=\"wp-image-861762\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture10.png 337w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture10-300x191.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2022\/07\/Picture10-240x152.png 240w\" sizes=\"auto, (max-width: 337px) 100vw, 337px\" \/><figcaption>Table 4: Experimental results of DiT on text detection (FUNSD).<\/figcaption><\/figure>\n\n\n\n<h2 id=\"future-work\">Future work<\/h2>\n\n\n\n<p>To fill the current gap in unsupervised pre-trained document vision models, researchers at Microsoft Research Asia have proposed and trained the DiT model, which leverages large-scale and diverse unlabeled document image data and is therefore ideal for vision models in a variety of downstream document tasks. DiT outperforms several state-of-the-art baseline models in tasks such as document image classification, document layout analysis, table detection, and text detection, all with results that reach the latest SOTA. Currently, the DiT model and related codes have been open-sourced (code link: https:\/\/aka.ms\/msdit) to facilitate further research in the field of document AI.<\/p>\n\n\n\n<p>In the future, researchers at Microsoft Research Asia will try to train DiT on larger datasets and further improve performance on various downstream tasks. In addition, new visually rich document understanding models, such as LayoutLMv3, may also adopt DiT as their basic vision models, thereby building a unified framework for computer vision and natural language understanding applications in the field of document AI.<\/p>\n\n\n\n<p>Paper: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2203.02378\">https:\/\/arxiv.org\/abs\/2203.02378<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n\n\n\n<p>Code&Models: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/msdit\">https:\/\/aka.ms\/msdit<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since 2019, researchers at Microsoft Research Asia have been exploring Document AI and have completed a series of work on LayoutLM\/LayoutLMv2\/LayoutXLM, TrOCR, MarkupLM, etc., achieving breakthroughs in a range of typical tasks such as object detection, information extraction, and document classification. However, most of the vision models used have been those trained on general domains, [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":861741,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-861732","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/861732","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":6,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/861732\/revisions"}],"predecessor-version":[{"id":861780,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/861732\/revisions\/861780"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media\/861741"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=861732"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=861732"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=861732"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=861732"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}