huggingface
diff --git a/‎README.md
+15-2 b/‎README.md
+15-2
diff --git a/‎UPGRADING.md
+3-3 b/‎UPGRADING.md
+3-3
diff --git a/‎hfdocs/source/quickstart.mdx
+8-8 b/‎hfdocs/source/quickstart.mdx
+8-8
diff --git a/‎tests/test_layers.py
+40-1 b/‎tests/test_layers.py
+40-1
diff --git a/‎tests/test_models.py
+5-5 b/‎tests/test_models.py
+5-5
diff --git a/‎timm/layers/__init__.py
+1-1 b/‎timm/layers/__init__.py
+1-1
diff --git a/‎timm/layers/attention2d.py
+9-9 b/‎timm/layers/attention2d.py
+9-9
diff --git a/‎timm/layers/create_norm.py
+3-1 b/‎timm/layers/create_norm.py
+3-1
diff --git a/‎timm/layers/fast_norm.py
+62-6 b/‎timm/layers/fast_norm.py
+62-6
@@ -12,6 +12,19 @@
 
 ## What's New
 
+## Dec 31, 2024
+* `convnext_nano` 384x384 ImageNet-12k pretrain & fine-tune. https://huggingface.co/models?search=convnext_nano%20r384
+* Add AIM-v2 encoders from https://github.com/apple/ml-aim, see on Hub: https://huggingface.co/models?search=timm%20aimv2
+* Add PaliGemma2 encoders from https://github.com/google-research/big_vision to existing PaliGemma, see on Hub: https://huggingface.co/models?search=timm%20pali2
+* Add missing L/14 DFN2B 39B CLIP ViT, `vit_large_patch14_clip_224.dfn2b_s39b`
+* Fix existing RmsProp layer to match standard formulation, use PT 2.5 impl when possible. Move old impl to `SimpleNorm` layer, it's LN w/o centering or bias. There were only two `timm` models using it, and they have been updated.
+* Allow override of `cache_dir` arg for model creation
+* Pass through `trust_remote_code` for HF datasets wrapper
+* `inception_next_atto` model added by creator
+* Adan optimizer caution, and Lamb decoupled weighgt decay options
+* Some feature_info metadata fixed by https://github.com/brianhou0208
+* All OpenCLIP and JAX (CLIP, SigLIP, Pali, etc) model weights that used load time remapping were given their own HF Hub instances so that they work with `hf-hub:` based loading, and thus will work with new Transformers `TimmWrapperModel`
+
 ## Nov 28, 2024
 * More optimizers
   * Add MARS optimizer (https://arxiv.org/abs/2411.10438, https://github.com/AGI-Arena/MARS)
@@ -248,7 +261,7 @@ Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weight
 ### April 11, 2024
 * Prepping for a long overdue 1.0 release, things have been stable for a while now.
 * Significant feature that's been missing for a while, `features_only=True` support for ViT models with flat hidden states or non-std module layouts (so far covering  `'vit_*', 'twins_*', 'deit*', 'beit*', 'mvitv2*', 'eva*', 'samvit_*', 'flexivit*'`)
-* Above feature support achieved through a new `forward_intermediates()` API that can be used with a feature wrapping module or direclty.
+* Above feature support achieved through a new `forward_intermediates()` API that can be used with a feature wrapping module or directly.
 ```python
 model = timm.create_model('vit_base_patch16_224')
 final_feat, intermediates = model.forward_intermediates(input) 
@@ -486,7 +499,7 @@ Included optimizers available via `timm.optim.create_optimizer_v2` factory metho
 * `madgrad` an implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075
 * `mars` MARS optimizer from https://github.com/AGI-Arena/MARS - https://arxiv.org/abs/2411.10438
 * `nadam` an implementation of Adam w/ Nesterov momentum
-* `nadamw` an impementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiency
+* `nadamw` an implementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiency
 * `novograd` by [Masashi Kimura](https://github.com/convergence-lab/novograd) - https://arxiv.org/abs/1905.11286
 * `radam` by [Liyuan Liu](https://github.com/LiyuanLucasLiu/RAdam) - https://arxiv.org/abs/1908.03265
 * `rmsprop_tf` adapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behaviour
 
@@ -1,10 +1,10 @@
 # Upgrading from previous versions
 
-I generally try to maintain code interface and especially model weight compability across many `timm` versions. Sometimes there are exceptions.
+I generally try to maintain code interface and especially model weight compatibility across many `timm` versions. Sometimes there are exceptions.
 
 ## Checkpoint remapping
 
-Pretrained weight remapping is handled by `checkpoint_filter_fn` in a model implementation module. This remaps old pretrained checkpoints to new, and also 3rd party (original) checkpoints to `timm` format if the model was modified when brough into `timm`.
+Pretrained weight remapping is handled by `checkpoint_filter_fn` in a model implementation module. This remaps old pretrained checkpoints to new, and also 3rd party (original) checkpoints to `timm` format if the model was modified when brought into `timm`.
 
 The `checkpoint_filter_fn` is automatically called when loading pretrained weights via `pretrained=True`, but they can be called manually if you call the fn directly with the current model instance and old state dict.
 
@@ -19,6 +19,6 @@ Many changes were made since the 0.6.x stable releases. They were previewed in 0
   * The pretrained_tag is the specific weight variant (different head) for the architecture.
   * Using only `architecture` defaults to the first weights in the default_cfgs for that model architecture.
   * In adding pretrained tags, many model names that existed to differentiate were renamed to use the tag  (ex: `vit_base_patch16_224_in21k` -> `vit_base_patch16_224.augreg_in21k`). There are deprecation mappings for these.
-* A number of models had their checkpoints remaped to match architecture changes needed to better support `features_only=True`, there are `checkpoint_filter_fn` methods in any model module that was remapped. These can be passed to `timm.models.load_checkpoint(..., filter_fn=timm.models.swin_transformer_v2.checkpoint_filter_fn)` to remap your existing checkpoint.
+* A number of models had their checkpoints remapped to match architecture changes needed to better support `features_only=True`, there are `checkpoint_filter_fn` methods in any model module that was remapped. These can be passed to `timm.models.load_checkpoint(..., filter_fn=timm.models.swin_transformer_v2.checkpoint_filter_fn)` to remap your existing checkpoint.
 * The Hugging Face Hub (https://huggingface.co/timm) is now the primary source for `timm` weights. Model cards include link to papers, original source, license. 
 * Previous 0.6.x can be cloned from [0.6.x](https://github.com/rwightman/pytorch-image-models/tree/0.6.x) branch or installed via pip with version.
@@ -164,14 +164,14 @@ First we'll need an image to do inference on. Here we load a picture of a leaf f
 >>> import requests
 >>> from PIL import Image
 >>> from io import BytesIO
->>> url = 'https://datasets-server.huggingface.co/assets/imagenet-1k/--/default/test/12/image/image.jpg'
+>>> url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/timm/cat.jpg'
 >>> image = Image.open(requests.get(url, stream=True).raw)
 >>> image
 ```
 
 Here's the image we loaded:
 
-<img src="https://datasets-server.huggingface.co/assets/imagenet-1k/--/default/test/12/image/image.jpg" alt="An Image from a link" width="300"/>
+<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/timm/cat.jpg" alt="An Image from a link" width="300"/>
 
 Now, we'll create our model and transforms again. This time, we make sure to set our model in evaluation mode.
 
@@ -211,7 +211,7 @@ Now we'll find the top 5 predicted class indexes and values using `torch.topk`.
 ```py
 >>> values, indices = torch.topk(probabilities, 5)
 >>> indices
-tensor([162, 166, 161, 164, 167])
+tensor([281, 282, 285, 673, 670])
 ```
 
 If we check the imagenet labels for the top index, we can see what the model predicted...
@@ -220,9 +220,9 @@ If we check the imagenet labels for the top index, we can see what the model pre
 >>> IMAGENET_1k_URL = 'https://storage.googleapis.com/bit_models/ilsvrc2012_wordnet_lemmas.txt'
 >>> IMAGENET_1k_LABELS = requests.get(IMAGENET_1k_URL).text.strip().split('\n')
 >>> [{'label': IMAGENET_1k_LABELS[idx], 'value': val.item()} for val, idx in zip(values, indices)]
-[{'label': 'beagle', 'value': 0.8486220836639404},
- {'label': 'Walker_hound, Walker_foxhound', 'value': 0.03753996267914772},
- {'label': 'basset, basset_hound', 'value': 0.024628572165966034},
- {'label': 'bluetick', 'value': 0.010317106731235981},
- {'label': 'English_foxhound', 'value': 0.006958036217838526}]
+[{'label': 'tabby, tabby_cat', 'value': 0.5101025700569153},
+ {'label': 'tiger_cat', 'value': 0.22490699589252472},
+ {'label': 'Egyptian_cat', 'value': 0.1835290789604187},
+ {'label': 'mouse, computer_mouse', 'value': 0.006752475164830685},
+ {'label': 'motor_scooter, scooter', 'value': 0.004942195490002632}]
 ```
@@ -1,7 +1,8 @@
+import pytest
 import torch
 import torch.nn as nn
 
-from timm.layers import create_act_layer, set_layer_config, get_act_layer, get_act_fn
+from timm.layers import create_act_layer, set_layer_config, get_act_layer, get_act_fn, Attention2d, MultiQueryAttentionV2
 
 import importlib
 import os
@@ -119,3 +120,41 @@ def test_get_act_fn_none():
     assert get_act_fn(None) is None
     assert get_act_fn('') is None
 
+
+@pytest.mark.parametrize("dim", [128])
+@pytest.mark.parametrize("dim_out", [128, 256])
+@pytest.mark.parametrize("use_m", [True, False])
+def test_mqa_v2(dim, dim_out, use_m):
+    mqa = MultiQueryAttentionV2(dim, dim_out)
+    
+    x = torch.randn(1, dim, 32, 48)
+    if use_m:
+        m = torch.randn(1, dim, 16, 24)
+    else:
+        m = None
+        
+    y = mqa(x, m=m)
+    
+    assert (y.shape) == (1, dim_out, 32, 48)
+
+
+@pytest.mark.parametrize("bias", [True, False])
+@pytest.mark.parametrize("expand_first", [True, False])
+@pytest.mark.parametrize("head_first", [True, False])
+@pytest.mark.parametrize("attn_mask", [True, False])
+def test_attn2d(bias, expand_first, head_first, attn_mask):
+    x = torch.randn(1, 128, 32, 48)
+    attn = Attention2d(
+        128, 128, num_heads=4, bias=bias, expand_first=expand_first, head_first=head_first
+    )
+    
+    if attn_mask:
+        mask = torch.randint(0, 1, size=(32 * 48, 32 * 48), dtype=torch.float32)
+    else:
+        mask = None
+    
+    o1 = attn(x, mask)
+    attn.fused_attn = False
+    o2 = attn(x, mask)
+    
+    assert torch.allclose(o1, o2, atol=1e-5), f"{torch.abs(o1 - o2).max()}"
@@ -53,13 +53,13 @@
     'vision_transformer', 'vision_transformer_sam', 'vision_transformer_hybrid', 'vision_transformer_relpos',
     'beit', 'mvitv2', 'eva', 'cait', 'xcit', 'volo', 'twins', 'deit', 'swin_transformer', 'swin_transformer_v2',
     'swin_transformer_v2_cr', 'maxxvit', 'efficientnet', 'mobilenetv3', 'levit', 'efficientformer', 'resnet',
-    'regnet', 'byobnet', 'byoanet', 'mlp_mixer', 'hiera', 'fastvit', 'hieradet_sam2'
+    'regnet', 'byobnet', 'byoanet', 'mlp_mixer', 'hiera', 'fastvit', 'hieradet_sam2', 'aimv2*'
 ]
 
 # transformer / hybrid models don't support full set of spatial / feature APIs and/or have spatial output.
 NON_STD_FILTERS = [
     'vit_*', 'tnt_*', 'pit_*', 'coat_*', 'cait_*', '*mixer_*', 'gmlp_*', 'resmlp_*', 'twins_*',
-    'convit_*', 'levit*', 'visformer*', 'deit*', 'xcit_*', 'crossvit_*', 'beit*',
+    'convit_*', 'levit*', 'visformer*', 'deit*', 'xcit_*', 'crossvit_*', 'beit*', 'aimv2*',
     'poolformer_*', 'volo_*', 'sequencer2d_*', 'mvitv2*', 'gcvit*', 'efficientformer*', 'sam_hiera*',
     'eva_*', 'flexivit*', 'eva02*', 'samvit_*', 'efficientvit_m*', 'tiny_vit_*', 'hiera_*', 'vitamin*', 'test_vit*',
 ]
@@ -72,11 +72,11 @@
         '*efficientnet_l2*', '*resnext101_32x48d', '*in21k', '*152x4_bitm', '*101x3_bitm', '*50x3_bitm',
         '*nfnet_f3*', '*nfnet_f4*', '*nfnet_f5*', '*nfnet_f6*', '*nfnet_f7*', '*efficientnetv2_xl*',
         '*resnetrs350*', '*resnetrs420*', 'xcit_large_24_p8*', '*huge*', '*giant*', '*gigantic*',
-        '*enormous*', 'maxvit_xlarge*', 'regnet*1280', 'regnet*2560']
-    NON_STD_EXCLUDE_FILTERS = ['*huge*', '*giant*',  '*gigantic*', '*enormous*']
+        '*enormous*', 'maxvit_xlarge*', 'regnet*1280', 'regnet*2560', '*_1b_*', '*_3b_*']
+    NON_STD_EXCLUDE_FILTERS = ['*huge*', '*giant*',  '*gigantic*', '*enormous*', '*_1b_*', '*_3b_*']
 else:
     EXCLUDE_FILTERS = ['*enormous*']
-    NON_STD_EXCLUDE_FILTERS = ['*gigantic*', '*enormous*']
+    NON_STD_EXCLUDE_FILTERS = ['*gigantic*', '*enormous*', '*_3b_*']
 
 EXCLUDE_JIT_FILTERS = ['hiera_*']
 
 
@@ -34,7 +34,7 @@
 from .mixed_conv2d import MixedConv2d
 from .mlp import Mlp, GluMlp, GatedMlp, SwiGLU, SwiGLUPacked, ConvMlp, GlobalResponseNormMlp
 from .non_local_attn import NonLocalAttn, BatNonLocalAttn
-from .norm import GroupNorm, GroupNorm1, LayerNorm, LayerNorm2d, RmsNorm, RmsNorm2d
+from .norm import GroupNorm, GroupNorm1, LayerNorm, LayerNorm2d, RmsNorm, RmsNorm2d, SimpleNorm, SimpleNorm2d
 from .norm_act import BatchNormAct2d, GroupNormAct, GroupNorm1Act, LayerNormAct, LayerNormAct2d,\
     SyncBatchNormAct, convert_sync_batchnorm, FrozenBatchNormAct2d, freeze_batch_norm_2d, unfreeze_batch_norm_2d
 from .padding import get_padding, get_same_padding, pad_same
 
@@ -59,24 +59,24 @@ def _reshape_input(self, t):
 
     def forward(self, x, m: Optional[torch.Tensor] = None):
         """Run layer computation."""
-        s = x.shape
-        m = m or x
+        b, _, h, w = x.shape
+        m = m if m is not None else x
 
         reshaped_x = self._reshape_input(x)
         reshaped_m = self._reshape_input(m)
 
         q = torch.einsum('bnd,hkd->bnhk', reshaped_x, self.query_proj)
         k = torch.einsum('bmd,dk->bmk', reshaped_m, self.key_proj)
 
-        attn = torch.einsum('bnhk,bmk->bnhm', q, k)
+        attn = torch.einsum('bnhk,bmk->bnhm', q, k) * self.scale
         attn = attn.softmax(dim=-1)
         attn = self.attn_drop(attn)
 
         v = torch.einsum('bmd,dv->bmv', reshaped_m, self.value_proj)
         o = torch.einsum('bnhm,bmv->bnhv', attn, v)
-        result = torch.einsum('bnhv,dhv->bnd', o, self.out_proj)
+        result = torch.einsum('bnhv,dhv->bdn', o, self.out_proj)
         result = self.proj_drop(result)
-        return result.reshape(s)
+        return result.reshape(b, -1, h, w)
 
 
 class MultiQueryAttention2d(nn.Module):
@@ -312,7 +312,6 @@ def __init__(
         self.num_heads = num_heads
         self.dim_head = dim_attn // num_heads
         self.head_first = head_first
-        self.scale = num_heads ** -0.5
         self.fused_attn = use_fused_attn()
 
         self.qkv = nn.Conv2d(dim, dim_attn * 3, 1, bias=bias)
@@ -337,14 +336,15 @@ def forward(self, x, attn_mask: Optional[torch.Tensor] = None):
                 dropout_p=self.attn_drop.p if self.training else 0.,
             ).transpose(-1, -2).reshape(B, -1, H, W)
         else:
-            q = q * self.scale
-            attn = q.transpose(-2, -1) @ k
+            q = q.transpose(-1, -2)
+            v = v.transpose(-1, -2)
+            attn = q @ k * q.size(-1) ** -0.5
             if attn_mask is not None:
                 # NOTE: assumes mask is float and in correct shape
                 attn = attn + attn_mask
             attn = attn.softmax(dim=-1)
             attn = self.attn_drop(attn)
-            x = (v @ attn.transpose(-2, -1)).view(B, -1, H, W)
+            x = (attn @ v).transpose(-1, -2).reshape(B, -1, H, W)
 
         x = self.proj(x)
         x = self.proj_drop(x)
 
@@ -10,7 +10,7 @@
 
 import torch.nn as nn
 
-from .norm import GroupNorm, GroupNorm1, LayerNorm, LayerNorm2d, RmsNorm, RmsNorm2d
+from .norm import GroupNorm, GroupNorm1, LayerNorm, LayerNorm2d, RmsNorm, RmsNorm2d, SimpleNorm, SimpleNorm2d
 from torchvision.ops.misc import FrozenBatchNorm2d
 
 _NORM_MAP = dict(
@@ -23,6 +23,8 @@
     layernorm2d=LayerNorm2d,
     rmsnorm=RmsNorm,
     rmsnorm2d=RmsNorm2d,
+    simplenorm=SimpleNorm,
+    simplenorm2d=SimpleNorm2d,
     frozenbatchnorm2d=FrozenBatchNorm2d,
 )
 _NORM_TYPES = {m for n, m in _NORM_MAP.items()}
 
@@ -24,6 +24,8 @@
     has_apex_rmsnorm = False
 
 
+has_torch_rms_norm = hasattr(F, 'rms_norm')
+
 # fast (ie lower precision LN) can be disabled with this flag if issues crop up
 _USE_FAST_NORM = False  # defaulting to False for now
 
@@ -75,7 +77,6 @@ def fast_group_norm(
     if is_autocast_enabled(x.device.type):
         # normally native AMP casts GN inputs to float32
         # here we use the low precision autocast dtype
-        # FIXME what to do re CPU autocast?
         dt = get_autocast_dtype(x.device.type)
         x, weight, bias = x.to(dt), weight.to(dt), bias.to(dt) if bias is not None else None
 
@@ -101,7 +102,6 @@ def fast_layer_norm(
         # normally native AMP casts LN inputs to float32
         # apex LN does not, this is behaving like Apex
         dt = get_autocast_dtype(x.device.type)
-        # FIXME what to do re CPU autocast?
         x, weight, bias = x.to(dt), weight.to(dt), bias.to(dt) if bias is not None else None
 
     with torch.amp.autocast(device_type=x.device.type, enabled=False):
@@ -115,15 +115,16 @@ def rms_norm(
     eps: float = 1e-5,
 ):
     norm_ndim = len(normalized_shape)
+    v = x.pow(2)
     if torch.jit.is_scripting():
         # ndim = len(x.shape)
         # dims = list(range(ndim - norm_ndim, ndim))  # this doesn't work on pytorch <= 1.13.x
         # NOTE -ve dims cause torchscript to crash in some cases, out of options to work around
         assert norm_ndim == 1
-        v = torch.var(x, dim=-1).unsqueeze(-1)  # ts crashes with -ve dim + keepdim=True
+        v = torch.mean(v, dim=-1).unsqueeze(-1)  # ts crashes with -ve dim + keepdim=True
     else:
         dims = tuple(range(-1, -norm_ndim - 1, -1))
-        v = torch.var(x, dim=dims, keepdim=True)
+        v = torch.mean(v, dim=dims, keepdim=True)
     x = x * torch.rsqrt(v + eps)
     if weight is not None:
         x = x * weight
@@ -146,5 +147,60 @@ def fast_rms_norm(
         else:
             return fused_rms_norm_affine(x, weight, normalized_shape, eps)
 
-    # fallback
-    return rms_norm(x, normalized_shape, weight, eps)
+    if is_autocast_enabled(x.device.type):
+        # normally native AMP casts LN inputs to float32
+        # apex LN does not, this is behaving like Apex
+        dt = get_autocast_dtype(x.device.type)
+        x, weight = x.to(dt), weight.to(dt)
+
+    with torch.amp.autocast(device_type=x.device.type, enabled=False):
+        if has_torch_rms_norm:
+            x = F.rms_norm(x, normalized_shape, weight, eps)
+        else:
+            x = rms_norm(x, normalized_shape, weight, eps)
+
+    return x
+
+
+def simple_norm(
+    x: torch.Tensor,
+    normalized_shape: List[int],
+    weight: Optional[torch.Tensor] = None,
+    eps: float = 1e-5,
+):
+    norm_ndim = len(normalized_shape)
+    if torch.jit.is_scripting():
+        # ndim = len(x.shape)
+        # dims = list(range(ndim - norm_ndim, ndim))  # this doesn't work on pytorch <= 1.13.x
+        # NOTE -ve dims cause torchscript to crash in some cases, out of options to work around
+        assert norm_ndim == 1
+        v = torch.var(x, dim=-1).unsqueeze(-1)  # ts crashes with -ve dim + keepdim=True
+    else:
+        dims = tuple(range(-1, -norm_ndim - 1, -1))
+        v = torch.var(x, dim=dims, keepdim=True)
+    x = x * torch.rsqrt(v + eps)
+    if weight is not None:
+        x = x * weight
+    return x
+
+
+def fast_simple_norm(
+    x: torch.Tensor,
+    normalized_shape: List[int],
+    weight: Optional[torch.Tensor] = None,
+    eps: float = 1e-5,
+) -> torch.Tensor:
+    if torch.jit.is_scripting():
+        # this must be by itself, cannot merge with has_apex_rmsnorm
+        return simple_norm(x, normalized_shape, weight, eps)
+
+    if is_autocast_enabled(x.device.type):
+        # normally native AMP casts LN inputs to float32
+        # apex LN does not, this is behaving like Apex
+        dt = get_autocast_dtype(x.device.type)
+        x, weight = x.to(dt), weight.to(dt)
+
+    with torch.amp.autocast(device_type=x.device.type, enabled=False):
+        x = simple_norm(x, normalized_shape, weight, eps)
+    return x
+