|
| 1 | +# MViTv2: Improved Multiscale Vision Transformers for Classification and Detection |
| 2 | + |
| 3 | +Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer* |
| 4 | + |
| 5 | +[[`arXiv`](https://arxiv.org/abs/2203.16527)] [[`BibTeX`](#CitingMViTv2)] |
| 6 | + |
| 7 | +In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to [MViTv2 repo](https://github.com/facebookresearch/mvit). |
| 8 | + |
| 9 | +## Results and Pretrained Models |
| 10 | + |
| 11 | +### COCO |
| 12 | + |
| 13 | +<table><tbody> |
| 14 | +<!-- START TABLE --> |
| 15 | +<!-- TABLE HEADER --> |
| 16 | +<th valign="bottom">Name</th> |
| 17 | +<th valign="bottom">pre-train</th> |
| 18 | +<th valign="bottom">Method</th> |
| 19 | +<th valign="bottom">epochs</th> |
| 20 | +<th valign="bottom">box<br/>AP</th> |
| 21 | +<th valign="bottom">mask<br/>AP</th> |
| 22 | +<th valign="bottom">#params</th> |
| 23 | +<th valign="bottom">FLOPS</th> |
| 24 | +<th valign="bottom">model id</th> |
| 25 | +<th valign="bottom">download</th> |
| 26 | +<!-- TABLE BODY --> |
| 27 | +<!-- ROW: mask_rcnn_mvitv2_t_3x --> |
| 28 | + <tr><td align="left"><a href="configs/mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td> |
| 29 | +<td align="center">IN1K</td> |
| 30 | +<td align="center">Mask R-CNN</td> |
| 31 | +<td align="center">36</td> |
| 32 | +<td align="center">48.3</td> |
| 33 | +<td align="center">43.8</td> |
| 34 | +<td align="center">44M</td> |
| 35 | +<td align="center">279G</td> |
| 36 | +<td align="center">307611773</td> |
| 37 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/mask_rcnn_mvitv2_t_3x/f307611773/model_final_1a1c30.pkl">model</a></td> |
| 38 | +</tr> |
| 39 | +<!-- ROW: cascade_mask_rcnn_mvitv2_t_3x --> |
| 40 | + <tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td> |
| 41 | +<td align="center">IN1K</td> |
| 42 | +<td align="center">Cascade Mask R-CNN</td> |
| 43 | +<td align="center">36</td> |
| 44 | +<td align="center">52.2</td> |
| 45 | +<td align="center">45.0</td> |
| 46 | +<td align="center">76M</td> |
| 47 | +<td align="center">701G</td> |
| 48 | +<td align="center">308344828</td> |
| 49 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_t_3x/f308344828/model_final_c6967a.pkl">model</a></td> |
| 50 | +</tr> |
| 51 | +<!-- ROW: cascade_mask_rcnn_mvitv2_s_3x --> |
| 52 | +<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_s_3x.py">MViTV2-S</a></td> |
| 53 | +<td align="center">IN1K</td> |
| 54 | +<td align="center">Cascade Mask R-CNN</td> |
| 55 | +<td align="center">36</td> |
| 56 | +<td align="center">53.2</td> |
| 57 | +<td align="center">46.0</td> |
| 58 | +<td align="center">87M</td> |
| 59 | +<td align="center">748G</td> |
| 60 | +<td align="center">308344647</td> |
| 61 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_s_3x/f308344647/model_final_279baf.pkl">model</a></td> |
| 62 | +</tr> |
| 63 | +<!-- ROW: cascade_mask_rcnn_mvitv2_b_3x --> |
| 64 | +<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_3x.py">MViTV2-B</a></td> |
| 65 | +<td align="center">IN1K</td> |
| 66 | +<td align="center">Cascade Mask R-CNN</td> |
| 67 | +<td align="center">36</td> |
| 68 | +<td align="center">54.1</td> |
| 69 | +<td align="center">46.7</td> |
| 70 | +<td align="center">103M</td> |
| 71 | +<td align="center">814G</td> |
| 72 | +<td align="center">308109448</td> |
| 73 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_3x/f308109448/model_final_421a91.pkl">model</a></td> |
| 74 | +</tr> |
| 75 | +<!-- ROW: cascade_mask_rcnn_mvitv2_b_in21k_3x --> |
| 76 | +<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_in21k_3x.py">MViTV2-B</a></td> |
| 77 | +<td align="center">IN21K</td> |
| 78 | +<td align="center">Cascade Mask R-CNN</td> |
| 79 | +<td align="center">36</td> |
| 80 | +<td align="center">54.9</td> |
| 81 | +<td align="center">47.4</td> |
| 82 | +<td align="center">103M</td> |
| 83 | +<td align="center">814G</td> |
| 84 | +<td align="center">309003202</td> |
| 85 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_in12k_3x/f309003202/model_final_be5168.pkl">model</a></td> |
| 86 | +</tr> |
| 87 | +<!-- ROW: cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep --> |
| 88 | +<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep.py">MViTV2-L</a></td> |
| 89 | +<td align="center">IN21K</td> |
| 90 | +<td align="center">Cascade Mask R-CNN</td> |
| 91 | +<td align="center">50</td> |
| 92 | +<td align="center">55.8</td> |
| 93 | +<td align="center">48.3</td> |
| 94 | +<td align="center">270M</td> |
| 95 | +<td align="center">1519G</td> |
| 96 | +<td align="center">308099658</td> |
| 97 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_l_in12k_lsj_50ep/f308099658/model_final_c41c5a.pkl">model</a></td> |
| 98 | +</tr> |
| 99 | +<!-- ROW: cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x --> |
| 100 | +<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x.py">MViTV2-H</a></td> |
| 101 | +<td align="center">IN21K</td> |
| 102 | +<td align="center">Cascade Mask R-CNN</td> |
| 103 | +<td align="center">36</td> |
| 104 | +<td align="center">56.1</td> |
| 105 | +<td align="center">48.5</td> |
| 106 | +<td align="center">718M</td> |
| 107 | +<td align="center">3084G</td> |
| 108 | +<td align="center">309013744</td> |
| 109 | +<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_h_in12k_lsj_3x/f309013744/model_final_30d36b.pkl">model</a></td> |
| 110 | +</tr> |
| 111 | +</tbody></table> |
| 112 | + |
| 113 | +Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from [MViTv2 repo](https://github.com/facebookresearch/mvit). |
| 114 | + |
| 115 | +## Training |
| 116 | +All configs can be trained with: |
| 117 | + |
| 118 | +``` |
| 119 | +../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py |
| 120 | +``` |
| 121 | +By default, we use 64 GPUs with batch size as 64 for training. |
| 122 | + |
| 123 | +## Evaluation |
| 124 | +Model evaluation can be done similarly: |
| 125 | +``` |
| 126 | +../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint |
| 127 | +``` |
| 128 | + |
| 129 | + |
| 130 | + |
| 131 | +## <a name="CitingMViTv2"></a>Citing MViTv2 |
| 132 | + |
| 133 | +If you use MViTv2, please use the following BibTeX entry. |
| 134 | + |
| 135 | +```BibTeX |
| 136 | +@inproceedings{li2021improved, |
| 137 | + title={MViTv2: Improved multiscale vision transformers for classification and detection}, |
| 138 | + author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph}, |
| 139 | + booktitle={CVPR}, |
| 140 | + year={2022} |
| 141 | +} |
| 142 | +``` |
0 commit comments