Skip to content

Commit c9cf7c9

Browse files
lyttonhaofacebook-github-bot
authored andcommitted
MViTv2 README & configs
Reviewed By: feichtenhofer Differential Revision: D36550846 fbshipit-source-id: 7e387001763faf5310fb9a38bda4f07c9d3a8ba7
1 parent 9957012 commit c9cf7c9

12 files changed

+386
-1
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Explain Like I’m 5: Detectron2 | Using Machine Learning with Detec
2424

2525
## What's New
2626
* Includes new capabilities such as panoptic segmentation, Densepose, Cascade R-CNN, rotated bounding boxes, PointRend,
27-
DeepLab, ViTDet, etc.
27+
DeepLab, ViTDet, MViTv2 etc.
2828
* Used as a library to support building [research projects](projects/) on top of it.
2929
* Models can be exported to TorchScript format or Caffe2 format for deployment.
3030
* It [trains much faster](https://detectron2.readthedocs.io/notes/benchmarks.html).

projects/MViTv2/README.md

+142
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
2+
3+
Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, Christoph Feichtenhofer*
4+
5+
[[`arXiv`](https://arxiv.org/abs/2203.16527)] [[`BibTeX`](#CitingMViTv2)]
6+
7+
In this repository, we provide detection configs and models for MViTv2 (CVPR 2022) in Detectron2. For image classification tasks, please refer to [MViTv2 repo](https://github.com/facebookresearch/mvit).
8+
9+
## Results and Pretrained Models
10+
11+
### COCO
12+
13+
<table><tbody>
14+
<!-- START TABLE -->
15+
<!-- TABLE HEADER -->
16+
<th valign="bottom">Name</th>
17+
<th valign="bottom">pre-train</th>
18+
<th valign="bottom">Method</th>
19+
<th valign="bottom">epochs</th>
20+
<th valign="bottom">box<br/>AP</th>
21+
<th valign="bottom">mask<br/>AP</th>
22+
<th valign="bottom">#params</th>
23+
<th valign="bottom">FLOPS</th>
24+
<th valign="bottom">model id</th>
25+
<th valign="bottom">download</th>
26+
<!-- TABLE BODY -->
27+
<!-- ROW: mask_rcnn_mvitv2_t_3x -->
28+
<tr><td align="left"><a href="configs/mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td>
29+
<td align="center">IN1K</td>
30+
<td align="center">Mask R-CNN</td>
31+
<td align="center">36</td>
32+
<td align="center">48.3</td>
33+
<td align="center">43.8</td>
34+
<td align="center">44M</td>
35+
<td align="center">279G</td>
36+
<td align="center">307611773</td>
37+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/mask_rcnn_mvitv2_t_3x/f307611773/model_final_1a1c30.pkl">model</a></td>
38+
</tr>
39+
<!-- ROW: cascade_mask_rcnn_mvitv2_t_3x -->
40+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_t_3x.py">MViTV2-T</a></td>
41+
<td align="center">IN1K</td>
42+
<td align="center">Cascade Mask R-CNN</td>
43+
<td align="center">36</td>
44+
<td align="center">52.2</td>
45+
<td align="center">45.0</td>
46+
<td align="center">76M</td>
47+
<td align="center">701G</td>
48+
<td align="center">308344828</td>
49+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_t_3x/f308344828/model_final_c6967a.pkl">model</a></td>
50+
</tr>
51+
<!-- ROW: cascade_mask_rcnn_mvitv2_s_3x -->
52+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_s_3x.py">MViTV2-S</a></td>
53+
<td align="center">IN1K</td>
54+
<td align="center">Cascade Mask R-CNN</td>
55+
<td align="center">36</td>
56+
<td align="center">53.2</td>
57+
<td align="center">46.0</td>
58+
<td align="center">87M</td>
59+
<td align="center">748G</td>
60+
<td align="center">308344647</td>
61+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_s_3x/f308344647/model_final_279baf.pkl">model</a></td>
62+
</tr>
63+
<!-- ROW: cascade_mask_rcnn_mvitv2_b_3x -->
64+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_3x.py">MViTV2-B</a></td>
65+
<td align="center">IN1K</td>
66+
<td align="center">Cascade Mask R-CNN</td>
67+
<td align="center">36</td>
68+
<td align="center">54.1</td>
69+
<td align="center">46.7</td>
70+
<td align="center">103M</td>
71+
<td align="center">814G</td>
72+
<td align="center">308109448</td>
73+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_3x/f308109448/model_final_421a91.pkl">model</a></td>
74+
</tr>
75+
<!-- ROW: cascade_mask_rcnn_mvitv2_b_in21k_3x -->
76+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_b_in21k_3x.py">MViTV2-B</a></td>
77+
<td align="center">IN21K</td>
78+
<td align="center">Cascade Mask R-CNN</td>
79+
<td align="center">36</td>
80+
<td align="center">54.9</td>
81+
<td align="center">47.4</td>
82+
<td align="center">103M</td>
83+
<td align="center">814G</td>
84+
<td align="center">309003202</td>
85+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_b_in12k_3x/f309003202/model_final_be5168.pkl">model</a></td>
86+
</tr>
87+
<!-- ROW: cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep -->
88+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_l_in21k_lsj_50ep.py">MViTV2-L</a></td>
89+
<td align="center">IN21K</td>
90+
<td align="center">Cascade Mask R-CNN</td>
91+
<td align="center">50</td>
92+
<td align="center">55.8</td>
93+
<td align="center">48.3</td>
94+
<td align="center">270M</td>
95+
<td align="center">1519G</td>
96+
<td align="center">308099658</td>
97+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_l_in12k_lsj_50ep/f308099658/model_final_c41c5a.pkl">model</a></td>
98+
</tr>
99+
<!-- ROW: cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x -->
100+
<tr><td align="left"><a href="configs/cascade_mask_rcnn_mvitv2_h_in21k_lsj_3x.py">MViTV2-H</a></td>
101+
<td align="center">IN21K</td>
102+
<td align="center">Cascade Mask R-CNN</td>
103+
<td align="center">36</td>
104+
<td align="center">56.1</td>
105+
<td align="center">48.5</td>
106+
<td align="center">718M</td>
107+
<td align="center">3084G</td>
108+
<td align="center">309013744</td>
109+
<td align="center"><a href="https://dl.fbaipublicfiles.com/detectron2/MViTv2/cascade_mask_rcnn_mvitv2_h_in12k_lsj_3x/f309013744/model_final_30d36b.pkl">model</a></td>
110+
</tr>
111+
</tbody></table>
112+
113+
Note that the above models were trained and measured on 8-node with 64 NVIDIA A100 GPUs in total. The ImageNet pre-trained model weights are obtained from [MViTv2 repo](https://github.com/facebookresearch/mvit).
114+
115+
## Training
116+
All configs can be trained with:
117+
118+
```
119+
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py
120+
```
121+
By default, we use 64 GPUs with batch size as 64 for training.
122+
123+
## Evaluation
124+
Model evaluation can be done similarly:
125+
```
126+
../../tools/lazyconfig_train_net.py --config-file configs/path/to/config.py --eval-only train.init_checkpoint=/path/to/model_checkpoint
127+
```
128+
129+
130+
131+
## <a name="CitingMViTv2"></a>Citing MViTv2
132+
133+
If you use MViTv2, please use the following BibTeX entry.
134+
135+
```BibTeX
136+
@inproceedings{li2021improved,
137+
title={MViTv2: Improved multiscale vision transformers for classification and detection},
138+
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
139+
booktitle={CVPR},
140+
year={2022}
141+
}
142+
```
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from .cascade_mask_rcnn_mvitv2_t_3x import model, dataloader, optimizer, lr_multiplier, train
2+
3+
4+
model.backbone.bottom_up.depth = 24
5+
model.backbone.bottom_up.last_block_indexes = (1, 4, 20, 23)
6+
model.backbone.bottom_up.drop_path_rate = 0.4
7+
8+
train.init_checkpoint = "detectron2://ImageNetPretrained/mvitv2/MViTv2_B_in1k.pyth"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from .cascade_mask_rcnn_mvitv2_b_3x import model, dataloader, optimizer, lr_multiplier, train
2+
3+
train.init_checkpoint = "detectron2://ImageNetPretrained/mvitv2/MViTv2_B_in21k.pyth"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from .cascade_mask_rcnn_mvitv2_b_3x import model, optimizer, train, lr_multiplier
2+
from .common.coco_loader_lsj import dataloader
3+
4+
5+
model.backbone.bottom_up.embed_dim = 192
6+
model.backbone.bottom_up.depth = 80
7+
model.backbone.bottom_up.num_heads = 3
8+
model.backbone.bottom_up.last_block_indexes = (3, 11, 71, 79)
9+
model.backbone.bottom_up.drop_path_rate = 0.6
10+
model.backbone.bottom_up.use_act_checkpoint = True
11+
12+
train.init_checkpoint = "detectron2://ImageNetPretrained/mvitv2/MViTv2_H_in21k.pyth"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
from fvcore.common.param_scheduler import MultiStepParamScheduler
2+
3+
from detectron2.config import LazyCall as L
4+
from detectron2.solver import WarmupParamScheduler
5+
6+
from .cascade_mask_rcnn_mvitv2_b_3x import model, optimizer, train
7+
from .common.coco_loader_lsj import dataloader
8+
9+
10+
model.backbone.bottom_up.embed_dim = 144
11+
model.backbone.bottom_up.depth = 48
12+
model.backbone.bottom_up.num_heads = 2
13+
model.backbone.bottom_up.last_block_indexes = (1, 7, 43, 47)
14+
model.backbone.bottom_up.drop_path_rate = 0.5
15+
16+
train.init_checkpoint = "detectron2://ImageNetPretrained/mvitv2/MViTv2_L_in21k.pyth"
17+
18+
# Schedule
19+
# 50ep = 184375 // 2 iters * 64 images/iter / 118000 images/ep
20+
train.max_iter = 184375 // 2
21+
lr_multiplier = L(WarmupParamScheduler)(
22+
scheduler=L(MultiStepParamScheduler)(
23+
values=[1.0, 0.1, 0.01],
24+
milestones=[163889 // 2, 177546 // 2],
25+
num_updates=train.max_iter,
26+
),
27+
warmup_length=250 / train.max_iter,
28+
warmup_factor=0.001,
29+
)
30+
31+
optimizer.lr = 1e-4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
from .cascade_mask_rcnn_mvitv2_t_3x import model, dataloader, optimizer, lr_multiplier, train
2+
3+
4+
model.backbone.bottom_up.depth = 16
5+
model.backbone.bottom_up.last_block_indexes = (0, 2, 13, 15)
6+
7+
train.init_checkpoint = "detectron2://ImageNetPretrained/mvitv2/MViTv2_S_in1k.pyth"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
from detectron2.config import LazyCall as L
2+
from detectron2.layers import ShapeSpec
3+
from detectron2.modeling.box_regression import Box2BoxTransform
4+
from detectron2.modeling.matcher import Matcher
5+
from detectron2.modeling.roi_heads import FastRCNNOutputLayers, FastRCNNConvFCHead, CascadeROIHeads
6+
from detectron2.layers.batch_norm import NaiveSyncBatchNorm
7+
8+
from .mask_rcnn_mvitv2_t_3x import model, dataloader, optimizer, lr_multiplier, train
9+
10+
11+
# arguments that don't exist for Cascade R-CNN
12+
[model.roi_heads.pop(k) for k in ["box_head", "box_predictor", "proposal_matcher"]]
13+
14+
model.roi_heads.update(
15+
_target_=CascadeROIHeads,
16+
box_heads=[
17+
L(FastRCNNConvFCHead)(
18+
input_shape=ShapeSpec(channels=256, height=7, width=7),
19+
conv_dims=[256, 256, 256, 256],
20+
fc_dims=[1024],
21+
conv_norm=lambda c: NaiveSyncBatchNorm(c, stats_mode="N"),
22+
)
23+
for _ in range(3)
24+
],
25+
box_predictors=[
26+
L(FastRCNNOutputLayers)(
27+
input_shape=ShapeSpec(channels=1024),
28+
test_score_thresh=0.05,
29+
box2box_transform=L(Box2BoxTransform)(weights=(w1, w1, w2, w2)),
30+
cls_agnostic_bbox_reg=True,
31+
num_classes="${...num_classes}",
32+
)
33+
for (w1, w2) in [(10, 5), (20, 10), (30, 15)]
34+
],
35+
proposal_matchers=[
36+
L(Matcher)(thresholds=[th], labels=[0, 1], allow_low_quality_matches=False)
37+
for th in [0.5, 0.6, 0.7]
38+
],
39+
)
40+
41+
# Using NaiveSyncBatchNorm becase heads may have empty input. That is not supported by
42+
# torch.nn.SyncBatchNorm. We can remove this after
43+
# https://github.com/pytorch/pytorch/issues/36530 is fixed.
44+
model.roi_heads.mask_head.conv_norm = lambda c: NaiveSyncBatchNorm(c, stats_mode="N")
45+
46+
# 2conv in RPN:
47+
# https://github.com/tensorflow/tpu/blob/b24729de804fdb751b06467d3dce0637fa652060/models/official/detection/modeling/architecture/heads.py#L95-L97 # noqa: E501, B950
48+
model.proposal_generator.head.conv_dims = [-1, -1]
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
from omegaconf import OmegaConf
2+
3+
import detectron2.data.transforms as T
4+
from detectron2.config import LazyCall as L
5+
from detectron2.data import (
6+
DatasetMapper,
7+
build_detection_test_loader,
8+
build_detection_train_loader,
9+
get_detection_dataset_dicts,
10+
)
11+
from detectron2.evaluation import COCOEvaluator
12+
13+
dataloader = OmegaConf.create()
14+
15+
dataloader.train = L(build_detection_train_loader)(
16+
dataset=L(get_detection_dataset_dicts)(names="coco_2017_train"),
17+
mapper=L(DatasetMapper)(
18+
is_train=True,
19+
augmentations=[
20+
L(T.RandomApply)(
21+
tfm_or_aug=L(T.AugmentationList)(
22+
augs=[
23+
L(T.ResizeShortestEdge)(
24+
short_edge_length=[400, 500, 600], sample_style="choice"
25+
),
26+
L(T.RandomCrop)(crop_type="absolute_range", crop_size=(384, 600)),
27+
]
28+
),
29+
prob=0.5,
30+
),
31+
L(T.ResizeShortestEdge)(
32+
short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800),
33+
sample_style="choice",
34+
max_size=1333,
35+
),
36+
L(T.RandomFlip)(horizontal=True),
37+
],
38+
image_format="RGB",
39+
use_instance_mask=True,
40+
),
41+
total_batch_size=16,
42+
num_workers=4,
43+
)
44+
45+
dataloader.test = L(build_detection_test_loader)(
46+
dataset=L(get_detection_dataset_dicts)(names="coco_2017_val", filter_empty=False),
47+
mapper=L(DatasetMapper)(
48+
is_train=False,
49+
augmentations=[
50+
L(T.ResizeShortestEdge)(short_edge_length=800, max_size=1333),
51+
],
52+
image_format="${...train.mapper.image_format}",
53+
),
54+
num_workers=4,
55+
)
56+
57+
dataloader.evaluator = L(COCOEvaluator)(
58+
dataset_name="${..test.dataset.names}",
59+
)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
import detectron2.data.transforms as T
2+
from detectron2 import model_zoo
3+
from detectron2.config import LazyCall as L
4+
5+
from .coco_loader import dataloader
6+
7+
# Data using LSJ
8+
image_size = 1024
9+
dataloader.train.mapper.augmentations = [
10+
L(T.RandomFlip)(horizontal=True), # flip first
11+
L(T.ResizeScale)(
12+
min_scale=0.1, max_scale=2.0, target_height=image_size, target_width=image_size
13+
),
14+
L(T.FixedSizeCrop)(crop_size=(image_size, image_size)),
15+
]
16+
dataloader.train.mapper.image_format = "RGB"
17+
dataloader.train.total_batch_size = 64
18+
# recompute boxes due to cropping
19+
dataloader.train.mapper.recompute_boxes = True

0 commit comments

Comments
 (0)