diff --git a/README.md b/README.md
index 30be3cb157..84a370ae11 100644
--- a/README.md
+++ b/README.md
@@ -64,9 +64,9 @@
 
 ## Nov 28, 2024
 * More optimizers
-  * Add MARS optimizer (https://arxiv.org/abs/2411.10438, https://github.com/AGI-Arena/MARS)
-  * Add LaProp optimizer (https://arxiv.org/abs/2002.04839, https://github.com/Z-T-WANG/LaProp-Optimizer)
-  * Add masking from 'Cautious Optimizers' (https://arxiv.org/abs/2411.16085, https://github.com/kyleliang919/C-Optim) to Adafactor, Adafactor Big Vision, AdamW (legacy), Adopt, Lamb, LaProp, Lion, NadamW, RMSPropTF, SGDW
+  * Add MARS optimizer (https://huggingface.co/papers/2411.10438, https://github.com/AGI-Arena/MARS)
+  * Add LaProp optimizer (https://huggingface.co/papers/2002.04839, https://github.com/Z-T-WANG/LaProp-Optimizer)
+  * Add masking from 'Cautious Optimizers' (https://huggingface.co/papers/2411.16085, https://github.com/kyleliang919/C-Optim) to Adafactor, Adafactor Big Vision, AdamW (legacy), Adopt, Lamb, LaProp, Lion, NadamW, RMSPropTF, SGDW
   * Cleanup some docstrings and type annotations re optimizers and factory
 * Add MobileNet-V4 Conv Medium models pretrained on in12k and fine-tuned in1k @ 384x384
   * https://huggingface.co/timm/mobilenetv4_conv_medium.e250_r384_in12k_ft_in1k
@@ -173,7 +173,7 @@ Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weight
 |hiera_small_abswin_256.sbb2_pd_e200_in12k_ft_in1k |84.560|97.106|35.01      |
 
 ### Aug 8, 2024
-* Add RDNet ('DenseNets Reloaded', https://arxiv.org/abs/2403.19588), thanks [Donghyun Kim](https://github.com/dhkim0225)
+* Add RDNet ('DenseNets Reloaded', https://huggingface.co/papers/2403.19588), thanks [Donghyun Kim](https://github.com/dhkim0225)
   
 ### July 28, 2024
 * Add `mobilenet_edgetpu_v2_m` weights w/ `ra4` mnv4-small based recipe. 80.1% top-1 @ 224 and 80.7 @ 256.
@@ -258,8 +258,8 @@ Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weight
 | [mobilenetv4_conv_small.e2400_r224_in1k](http://hf.co/timm/mobilenetv4_conv_small.e2400_r224_in1k)                 |73.756|26.244  |91.422|8.578   |3.77       |224     |
 | [mobilenetv4_conv_small.e1200_r224_in1k](http://hf.co/timm/mobilenetv4_conv_small.e1200_r224_in1k)                 |73.454|26.546  |91.34 |8.66    |3.77       |224     |
 
-* Apple MobileCLIP (https://arxiv.org/pdf/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
-* ViTamin (https://arxiv.org/abs/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
+* Apple MobileCLIP (https://huggingface.co/papers/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
+* ViTamin (https://huggingface.co/papers/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
 * OpenAI CLIP Modified ResNet image tower modelling & weight support (via ByobNet). Refactor AttentionPool2d.
 
 ### May 14, 2024
@@ -358,161 +358,161 @@ The work of many others is present here. I've tried to make sure all source mate
 
 All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated.
 
-* Aggregating Nested Transformers - https://arxiv.org/abs/2105.12723
-* BEiT - https://arxiv.org/abs/2106.08254
-* Big Transfer ResNetV2 (BiT) - https://arxiv.org/abs/1912.11370
-* Bottleneck Transformers - https://arxiv.org/abs/2101.11605
-* CaiT (Class-Attention in Image Transformers) - https://arxiv.org/abs/2103.17239
-* CoaT (Co-Scale Conv-Attentional Image Transformers) - https://arxiv.org/abs/2104.06399
-* CoAtNet (Convolution and Attention) - https://arxiv.org/abs/2106.04803
-* ConvNeXt - https://arxiv.org/abs/2201.03545
-* ConvNeXt-V2 - http://arxiv.org/abs/2301.00808
-* ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://arxiv.org/abs/2103.10697
-* CspNet (Cross-Stage Partial Networks) - https://arxiv.org/abs/1911.11929
-* DeiT - https://arxiv.org/abs/2012.12877
-* DeiT-III - https://arxiv.org/pdf/2204.07118.pdf
-* DenseNet - https://arxiv.org/abs/1608.06993
-* DLA - https://arxiv.org/abs/1707.06484
-* DPN (Dual-Path Network) - https://arxiv.org/abs/1707.01629
-* EdgeNeXt - https://arxiv.org/abs/2206.10589
-* EfficientFormer - https://arxiv.org/abs/2206.01191
+* Aggregating Nested Transformers - https://huggingface.co/papers/2105.12723
+* BEiT - https://huggingface.co/papers/2106.08254
+* Big Transfer ResNetV2 (BiT) - https://huggingface.co/papers/1912.11370
+* Bottleneck Transformers - https://huggingface.co/papers/2101.11605
+* CaiT (Class-Attention in Image Transformers) - https://huggingface.co/papers/2103.17239
+* CoaT (Co-Scale Conv-Attentional Image Transformers) - https://huggingface.co/papers/2104.06399
+* CoAtNet (Convolution and Attention) - https://huggingface.co/papers/2106.04803
+* ConvNeXt - https://huggingface.co/papers/2201.03545
+* ConvNeXt-V2 - https://huggingface.co/papers/2301.00808
+* ConViT (Soft Convolutional Inductive Biases Vision Transformers)- https://huggingface.co/papers/2103.10697
+* CspNet (Cross-Stage Partial Networks) - https://huggingface.co/papers/1911.11929
+* DeiT - https://huggingface.co/papers/2012.12877
+* DeiT-III - https://huggingface.co/papers/2204.07118
+* DenseNet - https://huggingface.co/papers/1608.06993
+* DLA - https://huggingface.co/papers/1707.06484
+* DPN (Dual-Path Network) - https://huggingface.co/papers/1707.01629
+* EdgeNeXt - https://huggingface.co/papers/2206.10589
+* EfficientFormer - https://huggingface.co/papers/2206.01191
 * EfficientNet (MBConvNet Family)
-    * EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
-    * EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
-    * EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946
+    * EfficientNet NoisyStudent (B0-B7, L2) - https://huggingface.co/papers/1911.04252
+    * EfficientNet AdvProp (B0-B8) - https://huggingface.co/papers/1911.09665
+    * EfficientNet (B0-B7) - https://huggingface.co/papers/1905.11946
     * EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
-    * EfficientNet V2 - https://arxiv.org/abs/2104.00298
-    * FBNet-C - https://arxiv.org/abs/1812.03443
-    * MixNet - https://arxiv.org/abs/1907.09595
-    * MNASNet B1, A1 (Squeeze-Excite), and Small - https://arxiv.org/abs/1807.11626
-    * MobileNet-V2 - https://arxiv.org/abs/1801.04381
-    * Single-Path NAS - https://arxiv.org/abs/1904.02877
-    * TinyNet - https://arxiv.org/abs/2010.14819
-* EfficientViT (MIT) - https://arxiv.org/abs/2205.14756
-* EfficientViT (MSRA) - https://arxiv.org/abs/2305.07027
-* EVA - https://arxiv.org/abs/2211.07636
-* EVA-02 - https://arxiv.org/abs/2303.11331
-* FastViT - https://arxiv.org/abs/2303.14189
-* FlexiViT - https://arxiv.org/abs/2212.08013
-* FocalNet (Focal Modulation Networks) - https://arxiv.org/abs/2203.11926
-* GCViT (Global Context Vision Transformer) - https://arxiv.org/abs/2206.09959
-* GhostNet - https://arxiv.org/abs/1911.11907
-* GhostNet-V2 - https://arxiv.org/abs/2211.12905
-* gMLP - https://arxiv.org/abs/2105.08050
-* GPU-Efficient Networks - https://arxiv.org/abs/2006.14090
-* Halo Nets - https://arxiv.org/abs/2103.12731
+    * EfficientNet V2 - https://huggingface.co/papers/2104.00298
+    * FBNet-C - https://huggingface.co/papers/1812.03443
+    * MixNet - https://huggingface.co/papers/1907.09595
+    * MNASNet B1, A1 (Squeeze-Excite), and Small - https://huggingface.co/papers/1807.11626
+    * MobileNet-V2 - https://huggingface.co/papers/1801.04381
+    * Single-Path NAS - https://huggingface.co/papers/1904.02877
+    * TinyNet - https://huggingface.co/papers/2010.14819
+* EfficientViT (MIT) - https://huggingface.co/papers/2205.14756
+* EfficientViT (MSRA) - https://huggingface.co/papers/2305.07027
+* EVA - https://huggingface.co/papers/2211.07636
+* EVA-02 - https://huggingface.co/papers/2303.11331
+* FastViT - https://huggingface.co/papers/2303.14189
+* FlexiViT - https://huggingface.co/papers/2212.08013
+* FocalNet (Focal Modulation Networks) - https://huggingface.co/papers/2203.11926
+* GCViT (Global Context Vision Transformer) - https://huggingface.co/papers/2206.09959
+* GhostNet - https://huggingface.co/papers/1911.11907
+* GhostNet-V2 - https://huggingface.co/papers/2211.12905
+* gMLP - https://huggingface.co/papers/2105.08050
+* GPU-Efficient Networks - https://huggingface.co/papers/2006.14090
+* Halo Nets - https://huggingface.co/papers/2103.12731
 * HGNet / HGNet-V2 - TBD
-* HRNet - https://arxiv.org/abs/1908.07919
-* InceptionNeXt - https://arxiv.org/abs/2303.16900
-* Inception-V3 - https://arxiv.org/abs/1512.00567
-* Inception-ResNet-V2 and Inception-V4 - https://arxiv.org/abs/1602.07261
-* Lambda Networks - https://arxiv.org/abs/2102.08602
-* LeViT (Vision Transformer in ConvNet's Clothing) - https://arxiv.org/abs/2104.01136
-* MambaOut - https://arxiv.org/abs/2405.07992
-* MaxViT (Multi-Axis Vision Transformer) - https://arxiv.org/abs/2204.01697
-* MetaFormer (PoolFormer-v2, ConvFormer, CAFormer) - https://arxiv.org/abs/2210.13452
-* MLP-Mixer - https://arxiv.org/abs/2105.01601
-* MobileCLIP - https://arxiv.org/abs/2311.17049
-* MobileNet-V3 (MBConvNet w/ Efficient Head) - https://arxiv.org/abs/1905.02244
-  * FBNet-V3 - https://arxiv.org/abs/2006.02049
-  * HardCoRe-NAS - https://arxiv.org/abs/2102.11646
-  * LCNet - https://arxiv.org/abs/2109.15099
-* MobileNetV4 - https://arxiv.org/abs/2404.10518
-* MobileOne - https://arxiv.org/abs/2206.04040
-* MobileViT - https://arxiv.org/abs/2110.02178
-* MobileViT-V2 - https://arxiv.org/abs/2206.02680
-* MViT-V2 (Improved Multiscale Vision Transformer) - https://arxiv.org/abs/2112.01526
-* NASNet-A - https://arxiv.org/abs/1707.07012
-* NesT - https://arxiv.org/abs/2105.12723
-* Next-ViT - https://arxiv.org/abs/2207.05501
-* NFNet-F - https://arxiv.org/abs/2102.06171
-* NF-RegNet / NF-ResNet - https://arxiv.org/abs/2101.08692
-* PNasNet - https://arxiv.org/abs/1712.00559
-* PoolFormer (MetaFormer) - https://arxiv.org/abs/2111.11418
-* Pooling-based Vision Transformer (PiT) - https://arxiv.org/abs/2103.16302
-* PVT-V2 (Improved Pyramid Vision Transformer) - https://arxiv.org/abs/2106.13797
-* RDNet (DenseNets Reloaded) - https://arxiv.org/abs/2403.19588
-* RegNet - https://arxiv.org/abs/2003.13678
-* RegNetZ - https://arxiv.org/abs/2103.06877
-* RepVGG - https://arxiv.org/abs/2101.03697
-* RepGhostNet - https://arxiv.org/abs/2211.06088
-* RepViT - https://arxiv.org/abs/2307.09283
-* ResMLP - https://arxiv.org/abs/2105.03404
+* HRNet - https://huggingface.co/papers/1908.07919
+* InceptionNeXt - https://huggingface.co/papers/2303.16900
+* Inception-V3 - https://huggingface.co/papers/1512.00567
+* Inception-ResNet-V2 and Inception-V4 - https://huggingface.co/papers/1602.07261
+* Lambda Networks - https://huggingface.co/papers/2102.08602
+* LeViT (Vision Transformer in ConvNet's Clothing) - https://huggingface.co/papers/2104.01136
+* MambaOut - https://huggingface.co/papers/2405.07992
+* MaxViT (Multi-Axis Vision Transformer) - https://huggingface.co/papers/2204.01697
+* MetaFormer (PoolFormer-v2, ConvFormer, CAFormer) - https://huggingface.co/papers/2210.13452
+* MLP-Mixer - https://huggingface.co/papers/2105.01601
+* MobileCLIP - https://huggingface.co/papers/2311.17049
+* MobileNet-V3 (MBConvNet w/ Efficient Head) - https://huggingface.co/papers/1905.02244
+  * FBNet-V3 - https://huggingface.co/papers/2006.02049
+  * HardCoRe-NAS - https://huggingface.co/papers/2102.11646
+  * LCNet - https://huggingface.co/papers/2109.15099
+* MobileNetV4 - https://huggingface.co/papers/2404.10518
+* MobileOne - https://huggingface.co/papers/2206.04040
+* MobileViT - https://huggingface.co/papers/2110.02178
+* MobileViT-V2 - https://huggingface.co/papers/2206.02680
+* MViT-V2 (Improved Multiscale Vision Transformer) - https://huggingface.co/papers/2112.01526
+* NASNet-A - https://huggingface.co/papers/1707.07012
+* NesT - https://huggingface.co/papers/2105.12723
+* Next-ViT - https://huggingface.co/papers/2207.05501
+* NFNet-F - https://huggingface.co/papers/2102.06171
+* NF-RegNet / NF-ResNet - https://huggingface.co/papers/2101.08692
+* PNasNet - https://huggingface.co/papers/1712.00559
+* PoolFormer (MetaFormer) - https://huggingface.co/papers/2111.11418
+* Pooling-based Vision Transformer (PiT) - https://huggingface.co/papers/2103.16302
+* PVT-V2 (Improved Pyramid Vision Transformer) - https://huggingface.co/papers/2106.13797
+* RDNet (DenseNets Reloaded) - https://huggingface.co/papers/2403.19588
+* RegNet - https://huggingface.co/papers/2003.13678
+* RegNetZ - https://huggingface.co/papers/2103.06877
+* RepVGG - https://huggingface.co/papers/2101.03697
+* RepGhostNet - https://huggingface.co/papers/2211.06088
+* RepViT - https://huggingface.co/papers/2307.09283
+* ResMLP - https://huggingface.co/papers/2105.03404
 * ResNet/ResNeXt
-    * ResNet (v1b/v1.5) - https://arxiv.org/abs/1512.03385
-    * ResNeXt - https://arxiv.org/abs/1611.05431
-    * 'Bag of Tricks' / Gluon C, D, E, S variations - https://arxiv.org/abs/1812.01187
-    * Weakly-supervised (WSL) Instagram pretrained / ImageNet tuned ResNeXt101 - https://arxiv.org/abs/1805.00932
-    * Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet/ResNeXts - https://arxiv.org/abs/1905.00546
-    * ECA-Net (ECAResNet) - https://arxiv.org/abs/1910.03151v4
-    * Squeeze-and-Excitation Networks (SEResNet) - https://arxiv.org/abs/1709.01507
-    * ResNet-RS - https://arxiv.org/abs/2103.07579
-* Res2Net - https://arxiv.org/abs/1904.01169
-* ResNeSt - https://arxiv.org/abs/2004.08955
-* ReXNet - https://arxiv.org/abs/2007.00992
-* SelecSLS - https://arxiv.org/abs/1907.00837
-* Selective Kernel Networks - https://arxiv.org/abs/1903.06586
-* Sequencer2D - https://arxiv.org/abs/2205.01972
-* SigLIP (image encoder) - https://arxiv.org/abs/2303.15343
-* SigLIP 2 (image encoder) - https://arxiv.org/abs/2502.14786
-* Swin S3 (AutoFormerV2) - https://arxiv.org/abs/2111.14725
-* Swin Transformer - https://arxiv.org/abs/2103.14030
-* Swin Transformer V2 - https://arxiv.org/abs/2111.09883
-* Transformer-iN-Transformer (TNT) - https://arxiv.org/abs/2103.00112
-* TResNet - https://arxiv.org/abs/2003.13630
-* Twins (Spatial Attention in Vision Transformers) - https://arxiv.org/pdf/2104.13840.pdf
-* Visformer - https://arxiv.org/abs/2104.12533
-* Vision Transformer - https://arxiv.org/abs/2010.11929
-* ViTamin - https://arxiv.org/abs/2404.02132
-* VOLO (Vision Outlooker) - https://arxiv.org/abs/2106.13112
-* VovNet V2 and V1 - https://arxiv.org/abs/1911.06667
-* Xception - https://arxiv.org/abs/1610.02357
-* Xception (Modified Aligned, Gluon) - https://arxiv.org/abs/1802.02611
-* Xception (Modified Aligned, TF) - https://arxiv.org/abs/1802.02611
-* XCiT (Cross-Covariance Image Transformers) - https://arxiv.org/abs/2106.09681
+    * ResNet (v1b/v1.5) - https://huggingface.co/papers/1512.03385
+    * ResNeXt - https://huggingface.co/papers/1611.05431
+    * 'Bag of Tricks' / Gluon C, D, E, S variations - https://huggingface.co/papers/1812.01187
+    * Weakly-supervised (WSL) Instagram pretrained / ImageNet tuned ResNeXt101 - https://huggingface.co/papers/1805.00932
+    * Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet/ResNeXts - https://huggingface.co/papers/1905.00546
+    * ECA-Net (ECAResNet) - https://huggingface.co/papers/1910.03151v4
+    * Squeeze-and-Excitation Networks (SEResNet) - https://huggingface.co/papers/1709.01507
+    * ResNet-RS - https://huggingface.co/papers/2103.07579
+* Res2Net - https://huggingface.co/papers/1904.01169
+* ResNeSt - https://huggingface.co/papers/2004.08955
+* ReXNet - https://huggingface.co/papers/2007.00992
+* SelecSLS - https://huggingface.co/papers/1907.00837
+* Selective Kernel Networks - https://huggingface.co/papers/1903.06586
+* Sequencer2D - https://huggingface.co/papers/2205.01972
+* SigLIP (image encoder) - https://huggingface.co/papers/2303.15343
+* SigLIP 2 (image encoder) - https://huggingface.co/papers/2502.14786
+* Swin S3 (AutoFormerV2) - https://huggingface.co/papers/2111.14725
+* Swin Transformer - https://huggingface.co/papers/2103.14030
+* Swin Transformer V2 - https://huggingface.co/papers/2111.09883
+* Transformer-iN-Transformer (TNT) - https://huggingface.co/papers/2103.00112
+* TResNet - https://huggingface.co/papers/2003.13630
+* Twins (Spatial Attention in Vision Transformers) - https://huggingface.co/papers/2104.13840
+* Visformer - https://huggingface.co/papers/2104.12533
+* Vision Transformer - https://huggingface.co/papers/2010.11929
+* ViTamin - https://huggingface.co/papers/2404.02132
+* VOLO (Vision Outlooker) - https://huggingface.co/papers/2106.13112
+* VovNet V2 and V1 - https://huggingface.co/papers/1911.06667
+* Xception - https://huggingface.co/papers/1610.02357
+* Xception (Modified Aligned, Gluon) - https://huggingface.co/papers/1802.02611
+* Xception (Modified Aligned, TF) - https://huggingface.co/papers/1802.02611
+* XCiT (Cross-Covariance Image Transformers) - https://huggingface.co/papers/2106.09681
 
 ### Optimizers
 To see full list of optimizers w/ descriptions: `timm.optim.list_optimizers(with_description=True)`
 
 Included optimizers available via `timm.optim.create_optimizer_v2` factory method:
-* `adabelief` an implementation of AdaBelief adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer - https://arxiv.org/abs/2010.07468
-* `adafactor` adapted from [FAIRSeq impl](https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py) - https://arxiv.org/abs/1804.04235
-* `adafactorbv` adapted from [Big Vision](https://github.com/google-research/big_vision/blob/main/big_vision/optax.py) - https://arxiv.org/abs/2106.04560
-* `adahessian` by [David Samuel](https://github.com/davda54/ada-hessian) - https://arxiv.org/abs/2006.00719
-* `adamp` and `sgdp` by [Naver ClovAI](https://github.com/clovaai) - https://arxiv.org/abs/2006.08217
-* `adan` an implementation of Adan adapted from https://github.com/sail-sg/Adan - https://arxiv.org/abs/2208.06677
-* `adopt` ADOPT adapted from https://github.com/iShohei220/adopt - https://arxiv.org/abs/2411.02853
+* `adabelief` an implementation of AdaBelief adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer - https://huggingface.co/papers/2010.07468
+* `adafactor` adapted from [FAIRSeq impl](https://github.com/pytorch/fairseq/blob/master/fairseq/optim/adafactor.py) - https://huggingface.co/papers/1804.04235
+* `adafactorbv` adapted from [Big Vision](https://github.com/google-research/big_vision/blob/main/big_vision/optax.py) - https://huggingface.co/papers/2106.04560
+* `adahessian` by [David Samuel](https://github.com/davda54/ada-hessian) - https://huggingface.co/papers/2006.00719
+* `adamp` and `sgdp` by [Naver ClovAI](https://github.com/clovaai) - https://huggingface.co/papers/2006.08217
+* `adan` an implementation of Adan adapted from https://github.com/sail-sg/Adan - https://huggingface.co/papers/2208.06677
+* `adopt` ADOPT adapted from https://github.com/iShohei220/adopt - https://huggingface.co/papers/2411.02853
 * `kron` PSGD w/ Kronecker-factored preconditioner from https://github.com/evanatyourservice/kron_torch - https://sites.google.com/site/lixilinx/home/psgd
-* `lamb` an implementation of Lamb and LambC (w/ trust-clipping) cleaned up and modified to support use with XLA - https://arxiv.org/abs/1904.00962
-* `laprop` optimizer from https://github.com/Z-T-WANG/LaProp-Optimizer - https://arxiv.org/abs/2002.04839
-* `lars` an implementation of LARS and LARC (w/ trust-clipping) - https://arxiv.org/abs/1708.03888
-* `lion` and implementation of Lion adapted from https://github.com/google/automl/tree/master/lion - https://arxiv.org/abs/2302.06675
-* `lookahead` adapted from impl by [Liam](https://github.com/alphadl/lookahead.pytorch) - https://arxiv.org/abs/1907.08610
-* `madgrad` an implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075
-* `mars` MARS optimizer from https://github.com/AGI-Arena/MARS - https://arxiv.org/abs/2411.10438
+* `lamb` an implementation of Lamb and LambC (w/ trust-clipping) cleaned up and modified to support use with XLA - https://huggingface.co/papers/1904.00962
+* `laprop` optimizer from https://github.com/Z-T-WANG/LaProp-Optimizer - https://huggingface.co/papers/2002.04839
+* `lars` an implementation of LARS and LARC (w/ trust-clipping) - https://huggingface.co/papers/1708.03888
+* `lion` and implementation of Lion adapted from https://github.com/google/automl/tree/master/lion - https://huggingface.co/papers/2302.06675
+* `lookahead` adapted from impl by [Liam](https://github.com/alphadl/lookahead.pytorch) - https://huggingface.co/papers/1907.08610
+* `madgrad` an implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://huggingface.co/papers/2101.11075
+* `mars` MARS optimizer from https://github.com/AGI-Arena/MARS - https://huggingface.co/papers/2411.10438
 * `nadam` an implementation of Adam w/ Nesterov momentum
 * `nadamw` an implementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiency
-* `novograd` by [Masashi Kimura](https://github.com/convergence-lab/novograd) - https://arxiv.org/abs/1905.11286
-* `radam` by [Liyuan Liu](https://github.com/LiyuanLucasLiu/RAdam) - https://arxiv.org/abs/1908.03265
+* `novograd` by [Masashi Kimura](https://github.com/convergence-lab/novograd) - https://huggingface.co/papers/1905.11286
+* `radam` by [Liyuan Liu](https://github.com/LiyuanLucasLiu/RAdam) - https://huggingface.co/papers/1908.03265
 * `rmsprop_tf` adapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behaviour
 * `sgdw` and implementation of SGD w/ decoupled weight-decay
 * `fused<name>` optimizers by name with [NVIDIA Apex](https://github.com/NVIDIA/apex/tree/master/apex/optimizers) installed
 * `bnb<name>` optimizers by name with [BitsAndBytes](https://github.com/TimDettmers/bitsandbytes) installed
-* `cadamw`, `clion`, and more 'Cautious' optimizers from https://github.com/kyleliang919/C-Optim - https://arxiv.org/abs/2411.16085
+* `cadamw`, `clion`, and more 'Cautious' optimizers from https://github.com/kyleliang919/C-Optim - https://huggingface.co/papers/2411.16085
 * `adam`, `adamw`, `rmsprop`, `adadelta`, `adagrad`, and `sgd` pass through to `torch.optim` implementations
 
 ### Augmentations
-* Random Erasing from [Zhun Zhong](https://github.com/zhunzhong07/Random-Erasing/blob/master/transforms.py) - https://arxiv.org/abs/1708.04896)
-* Mixup - https://arxiv.org/abs/1710.09412
-* CutMix - https://arxiv.org/abs/1905.04899
-* AutoAugment (https://arxiv.org/abs/1805.09501) and RandAugment (https://arxiv.org/abs/1909.13719) ImageNet configurations modeled after impl for EfficientNet training (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py)
-* AugMix w/ JSD loss, JSD w/ clean + augmented mixing support works with AutoAugment and RandAugment as well - https://arxiv.org/abs/1912.02781
+* Random Erasing from [Zhun Zhong](https://github.com/zhunzhong07/Random-Erasing/blob/master/transforms.py) - https://huggingface.co/papers/1708.04896
+* Mixup - https://huggingface.co/papers/1710.09412
+* CutMix - https://huggingface.co/papers/1905.04899
+* AutoAugment (https://huggingface.co/papers/1805.09501) and RandAugment (https://huggingface.co/papers/1909.13719) ImageNet configurations modeled after impl for EfficientNet training (https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/autoaugment.py)
+* AugMix w/ JSD loss, JSD w/ clean + augmented mixing support works with AutoAugment and RandAugment as well - https://huggingface.co/papers/1912.02781
 * SplitBachNorm - allows splitting batch norm layers between clean and augmented (auxiliary batch norm) data
 
 ### Regularization
-* DropPath aka "Stochastic Depth" - https://arxiv.org/abs/1603.09382
-* DropBlock - https://arxiv.org/abs/1810.12890
-* Blur Pooling - https://arxiv.org/abs/1904.11486
+* DropPath aka "Stochastic Depth" - https://huggingface.co/papers/1603.09382
+* DropBlock - https://huggingface.co/papers/1810.12890
+* Blur Pooling - https://huggingface.co/papers/1904.11486
 
 ### Other
 
@@ -538,25 +538,25 @@ Several (less common) features that I often utilize in my projects are included.
   * Ideas adopted from
      * [AllenNLP schedulers](https://github.com/allenai/allennlp/tree/master/allennlp/training/learning_rate_schedulers)
      * [FAIRseq lr_scheduler](https://github.com/pytorch/fairseq/tree/master/fairseq/optim/lr_scheduler)
-     * SGDR: Stochastic Gradient Descent with Warm Restarts (https://arxiv.org/abs/1608.03983)
+     * SGDR: Stochastic Gradient Descent with Warm Restarts (https://huggingface.co/papers/1608.03983)
   * Schedulers include `step`, `cosine` w/ restarts, `tanh` w/ restarts, `plateau`
-* Space-to-Depth by [mrT23](https://github.com/mrT23/TResNet/blob/master/src/models/tresnet/layers/space_to_depth.py) (https://arxiv.org/abs/1801.04590) -- original paper?
-* Adaptive Gradient Clipping (https://arxiv.org/abs/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets)
+* Space-to-Depth by [mrT23](https://github.com/mrT23/TResNet/blob/master/src/models/tresnet/layers/space_to_depth.py) (https://huggingface.co/papers/1801.04590) -- original paper?
+* Adaptive Gradient Clipping (https://huggingface.co/papers/2102.06171, https://github.com/deepmind/deepmind-research/tree/master/nfnets)
 * An extensive selection of channel and/or spatial attention modules:
-    * Bottleneck Transformer - https://arxiv.org/abs/2101.11605
-    * CBAM - https://arxiv.org/abs/1807.06521
-    * Effective Squeeze-Excitation (ESE) - https://arxiv.org/abs/1911.06667
-    * Efficient Channel Attention (ECA) - https://arxiv.org/abs/1910.03151
-    * Gather-Excite (GE) - https://arxiv.org/abs/1810.12348
-    * Global Context (GC) - https://arxiv.org/abs/1904.11492
-    * Halo - https://arxiv.org/abs/2103.12731
-    * Involution - https://arxiv.org/abs/2103.06255
-    * Lambda Layer - https://arxiv.org/abs/2102.08602
-    * Non-Local (NL) -  https://arxiv.org/abs/1711.07971
-    * Squeeze-and-Excitation (SE) - https://arxiv.org/abs/1709.01507
-    * Selective Kernel (SK) - (https://arxiv.org/abs/1903.06586
-    * Split (SPLAT) - https://arxiv.org/abs/2004.08955
-    * Shifted Window (SWIN) - https://arxiv.org/abs/2103.14030
+    * Bottleneck Transformer - https://huggingface.co/papers/2101.11605
+    * CBAM - https://huggingface.co/papers/1807.06521
+    * Effective Squeeze-Excitation (ESE) - https://huggingface.co/papers/1911.06667
+    * Efficient Channel Attention (ECA) - https://huggingface.co/papers/1910.03151
+    * Gather-Excite (GE) - https://huggingface.co/papers/1810.12348
+    * Global Context (GC) - https://huggingface.co/papers/1904.11492
+    * Halo - https://huggingface.co/papers/2103.12731
+    * Involution - https://huggingface.co/papers/2103.06255
+    * Lambda Layer - https://huggingface.co/papers/2102.08602
+    * Non-Local (NL) -  https://huggingface.co/papers/1711.07971
+    * Squeeze-and-Excitation (SE) - https://huggingface.co/papers/1709.01507
+    * Selective Kernel (SK) - https://huggingface.co/papers/1903.06586
+    * Split (SPLAT) - https://huggingface.co/papers/2004.08955
+    * Shifted Window (SWIN) - https://huggingface.co/papers/2103.14030
 
 ## Results
 
diff --git a/hfdocs/source/changes.mdx b/hfdocs/source/changes.mdx
index 741b13a2d6..b7226b1c1d 100644
--- a/hfdocs/source/changes.mdx
+++ b/hfdocs/source/changes.mdx
@@ -33,9 +33,9 @@
 
 ## Nov 28, 2024
 * More optimizers
-  * Add MARS optimizer (https://arxiv.org/abs/2411.10438, https://github.com/AGI-Arena/MARS)
-  * Add LaProp optimizer (https://arxiv.org/abs/2002.04839, https://github.com/Z-T-WANG/LaProp-Optimizer)
-  * Add masking from 'Cautious Optimizers' (https://arxiv.org/abs/2411.16085, https://github.com/kyleliang919/C-Optim) to Adafactor, Adafactor Big Vision, AdamW (legacy), Adopt, Lamb, LaProp, Lion, NadamW, RMSPropTF, SGDW
+  * Add MARS optimizer (https://huggingface.co/papers/2411.10438, https://github.com/AGI-Arena/MARS)
+  * Add LaProp optimizer (https://huggingface.co/papers/2002.04839, https://github.com/Z-T-WANG/LaProp-Optimizer)
+  * Add masking from 'Cautious Optimizers' (https://huggingface.co/papers/2411.16085, https://github.com/kyleliang919/C-Optim) to Adafactor, Adafactor Big Vision, AdamW (legacy), Adopt, Lamb, LaProp, Lion, NadamW, RMSPropTF, SGDW
   * Cleanup some docstrings and type annotations re optimizers and factory
 * Add MobileNet-V4 Conv Medium models pretrained on in12k and fine-tuned in1k @ 384x384
   * https://huggingface.co/timm/mobilenetv4_conv_medium.e250_r384_in12k_ft_in1k
@@ -142,7 +142,7 @@ Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weight
 |hiera_small_abswin_256.sbb2_pd_e200_in12k_ft_in1k |84.560|97.106|35.01      |
 
 ### Aug 8, 2024
-* Add RDNet ('DenseNets Reloaded', https://arxiv.org/abs/2403.19588), thanks [Donghyun Kim](https://github.com/dhkim0225)
+* Add RDNet ('DenseNets Reloaded', https://huggingface.co/papers/2403.19588), thanks [Donghyun Kim](https://github.com/dhkim0225)
 
 ### July 28, 2024
 * Add `mobilenet_edgetpu_v2_m` weights w/ `ra4` mnv4-small based recipe. 80.1% top-1 @ 224 and 80.7 @ 256.
@@ -227,8 +227,8 @@ Add a set of new very well trained ResNet & ResNet-V2 18/34 (basic block) weight
 | [mobilenetv4_conv_small.e2400_r224_in1k](http://hf.co/timm/mobilenetv4_conv_small.e2400_r224_in1k)                 |73.756|26.244  |91.422|8.578   |3.77       |224     |
 | [mobilenetv4_conv_small.e1200_r224_in1k](http://hf.co/timm/mobilenetv4_conv_small.e1200_r224_in1k)                 |73.454|26.546  |91.34 |8.66    |3.77       |224     |
 
-* Apple MobileCLIP (https://arxiv.org/pdf/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
-* ViTamin (https://arxiv.org/abs/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
+* Apple MobileCLIP (https://huggingface.co/papers/2311.17049, FastViT and ViT-B) image tower model support & weights added (part of OpenCLIP support).
+* ViTamin (https://huggingface.co/papers/2404.02132) CLIP image tower model & weights added (part of OpenCLIP support).
 * OpenAI CLIP Modified ResNet image tower modelling & weight support (via ByobNet). Refactor AttentionPool2d.
 
 ### May 14, 2024
@@ -373,13 +373,13 @@ Datasets & transform refactoring
 
 ### Aug 25, 2023
 * Many new models since last release
-  * FastViT - https://arxiv.org/abs/2303.14189
-  * MobileOne - https://arxiv.org/abs/2206.04040
-  * InceptionNeXt - https://arxiv.org/abs/2303.16900
-  * RepGhostNet - https://arxiv.org/abs/2211.06088 (thanks https://github.com/ChengpengChen)
-  * GhostNetV2 - https://arxiv.org/abs/2211.12905 (thanks https://github.com/yehuitang)
-  * EfficientViT (MSRA) - https://arxiv.org/abs/2305.07027 (thanks https://github.com/seefun)
-  * EfficientViT (MIT) - https://arxiv.org/abs/2205.14756 (thanks https://github.com/seefun)
+  * FastViT - https://huggingface.co/papers/2303.14189
+  * MobileOne - https://huggingface.co/papers/2206.04040
+  * InceptionNeXt - https://huggingface.co/papers/2303.16900
+  * RepGhostNet - https://huggingface.co/papers/2211.06088 (thanks https://github.com/ChengpengChen)
+  * GhostNetV2 - https://huggingface.co/papers/2211.12905 (thanks https://github.com/yehuitang)
+  * EfficientViT (MSRA) - https://huggingface.co/papers/2305.07027 (thanks https://github.com/seefun)
+  * EfficientViT (MIT) - https://huggingface.co/papers/2205.14756 (thanks https://github.com/seefun)
 * Add `--reparam` arg to `benchmark.py`, `onnx_export.py`, and `validate.py` to trigger layer reparameterization / fusion for models with any one of `reparameterize()`, `switch_to_deploy()` or `fuse()`
   * Including FastViT, MobileOne, RepGhostNet, EfficientViT (MSRA), RepViT, RepVGG, and LeViT
 * Preparing 0.9.6 'back to school' release
@@ -396,7 +396,7 @@ Datasets & transform refactoring
 
 ### July 27, 2023
 * Added timm trained `seresnextaa201d_32x8d.sw_in12k_ft_in1k_384` weights (and `.sw_in12k` pretrain) with 87.3% top-1 on ImageNet-1k, best ImageNet ResNet family model I'm aware of.
-* RepViT model and weights (https://arxiv.org/abs/2307.09283) added by [wangao](https://github.com/jameslahm)
+* RepViT model and weights (https://huggingface.co/papers/2307.09283) added by [wangao](https://github.com/jameslahm)
 * I-JEPA ViT feature weights (no classifier) added by [SeeFun](https://github.com/seefun)
 * SAM-ViT (segment anything) feature weights (no classifier) added by [SeeFun](https://github.com/seefun)
 * Add support for alternative feat extraction methods and -ve indices to EfficientNet
@@ -506,9 +506,9 @@ Datasets & transform refactoring
 
 ### Feb 16, 2023
 * `safetensor` checkpoint support added
-* Add ideas from 'Scaling Vision Transformers to 22 B. Params' (https://arxiv.org/abs/2302.05442) -- qk norm, RmsNorm, parallel block
+* Add ideas from 'Scaling Vision Transformers to 22 B. Params' (https://huggingface.co/papers/2302.05442) -- qk norm, RmsNorm, parallel block
 * Add F.scaled_dot_product_attention support (PyTorch 2.0 only) to `vit_*`, `vit_relpos*`, `coatnet` / `maxxvit` (to start)
-* Lion optimizer (w/ multi-tensor option) added (https://arxiv.org/abs/2302.06675)
+* Lion optimizer (w/ multi-tensor option) added (https://huggingface.co/papers/2302.06675)
 * gradient checkpointing works with `features_only=True`
 
 ### Feb 7, 2023
@@ -596,11 +596,11 @@ Datasets & transform refactoring
 
 ### Jan 5, 2023
 * ConvNeXt-V2 models and weights added to existing `convnext.py`
-  * Paper: [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)
+  * Paper: [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://huggingface.co/papers/2301.00808)
   * Reference impl: https://github.com/facebookresearch/ConvNeXt-V2 (NOTE: weights currently CC-BY-NC)
 @dataclass
 ### Dec 23, 2022 🎄☃
-* Add FlexiViT models and weights from https://github.com/google-research/big_vision (check out paper at https://arxiv.org/abs/2212.08013)
+* Add FlexiViT models and weights from https://github.com/google-research/big_vision (check out paper at https://huggingface.co/papers/2212.08013)
   * NOTE currently resizing is static on model creation, on-the-fly dynamic / train patch size sampling is a WIP
 * Many more models updated to multi-weight and downloadable via HF hub now (convnext, efficientnet, mobilenet, vision_transformer*, beit)
 * More model pretrained tag and adjustments, some model names changed (working on deprecation translations, consider main branch DEV branch right now, use 0.6.x for stable use)
@@ -624,7 +624,7 @@ Datasets & transform refactoring
 ### Dec 6, 2022
 * Add 'EVA g', BEiT style ViT-g/14 model weights w/ both MIM pretrain and CLIP pretrain to `beit.py`.
   * original source: https://github.com/baaivision/EVA
-  * paper: https://arxiv.org/abs/2211.07636
+  * paper: https://huggingface.co/papers/2211.07636
 
 | model                                    |   top1 |   param_count |   gmac |   macts | hub                                     |
 |:-----------------------------------------|-------:|--------------:|-------:|--------:|:----------------------------------------|
@@ -738,7 +738,7 @@ Datasets & transform refactoring
   * `maxvit_rmlp_nano_rw_256` - 83.0 @ 256, 83.6 @ 320  (T)
 
 ### Aug 26, 2022
-* CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) `timm` original models
+* CoAtNet (https://huggingface.co/papers/2106.04803) and MaxVit (https://huggingface.co/papers/2204.01697) `timm` original models
   * both found in [`maxxvit.py`](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/maxxvit.py) model def, contains numerous experiments outside scope of original papers
   * an unfinished Tensorflow version from MaxVit authors can be found https://github.com/google-research/maxvit
 * Initial CoAtNet and MaxVit timm pretrained weights (working on more):
@@ -834,7 +834,7 @@ More models, more fixes
   * `vit_relpos_base_patch16_gapcls_224` - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)
 * Bring 512 dim, 8-head 'medium' ViT model variant back to life (after using in a pre DeiT 'small' model for first ViT impl back in 2020)
 * Add ViT relative position support for switching btw existing impl and some additions in official Swin-V2 impl for future trials
-* Sequencer2D impl (https://arxiv.org/abs/2205.01972), added via PR from author (https://github.com/okojoalg)
+* Sequencer2D impl (https://huggingface.co/papers/2205.01972), added via PR from author (https://github.com/okojoalg)
 
 ### May 2, 2022
 * Vision Transformer experiments adding Relative Position (Swin-V2 log-coord) (`vision_transformer_relpos.py`) and Residual Post-Norm branches (from Swin-V2) (`vision_transformer*.py`)
@@ -851,7 +851,7 @@ More models, more fixes
   * `seresnextaa101d_32x8d` (anti-aliased w/ AvgPool2d) - 83.85 @ 224, 84.57 @ 288
 
 ### March 23, 2022
-* Add `ParallelBlock` and `LayerScale` option to base vit models to support model configs in [Three things everyone should know about ViT](https://arxiv.org/abs/2203.09795)
+* Add `ParallelBlock` and `LayerScale` option to base vit models to support model configs in [Three things everyone should know about ViT](https://huggingface.co/papers/2203.09795)
 * `convnext_tiny_hnf` (head norm first) weights trained with (close to) A2 recipe, 82.2% top-1, could do better with more epochs.
 
 ### March 21, 2022
@@ -908,11 +908,11 @@ More models, more fixes
 
 ### Jan 5, 2023
 * ConvNeXt-V2 models and weights added to existing `convnext.py`
-  * Paper: [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](http://arxiv.org/abs/2301.00808)
+  * Paper: [ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders](https://huggingface.co/papers/2301.00808)
   * Reference impl: https://github.com/facebookresearch/ConvNeXt-V2 (NOTE: weights currently CC-BY-NC)
 
 ### Dec 23, 2022 🎄☃
-* Add FlexiViT models and weights from https://github.com/google-research/big_vision (check out paper at https://arxiv.org/abs/2212.08013)
+* Add FlexiViT models and weights from https://github.com/google-research/big_vision (check out paper at https://huggingface.co/papers/2212.08013)
   * NOTE currently resizing is static on model creation, on-the-fly dynamic / train patch size sampling is a WIP
 * Many more models updated to multi-weight and downloadable via HF hub now (convnext, efficientnet, mobilenet, vision_transformer*, beit)
 * More model pretrained tag and adjustments, some model names changed (working on deprecation translations, consider main branch DEV branch right now, use 0.6.x for stable use)
@@ -936,7 +936,7 @@ More models, more fixes
 ### Dec 6, 2022
 * Add 'EVA g', BEiT style ViT-g/14 model weights w/ both MIM pretrain and CLIP pretrain to `beit.py`. 
   * original source: https://github.com/baaivision/EVA
-  * paper: https://arxiv.org/abs/2211.07636
+  * paper: https://huggingface.co/papers/2211.07636
 
 | model                                    |   top1 |   param_count |   gmac |   macts | hub                                     |
 |:-----------------------------------------|-------:|--------------:|-------:|--------:|:----------------------------------------|
@@ -1050,7 +1050,7 @@ More models, more fixes
   * `maxvit_rmlp_nano_rw_256` - 83.0 @ 256, 83.6 @ 320  (T)
 
 ### Aug 26, 2022
-* CoAtNet (https://arxiv.org/abs/2106.04803) and MaxVit (https://arxiv.org/abs/2204.01697) `timm` original models
+* CoAtNet (https://huggingface.co/papers/2106.04803) and MaxVit (https://huggingface.co/papers/2204.01697) `timm` original models
   * both found in [`maxxvit.py`](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/maxxvit.py) model def, contains numerous experiments outside scope of original papers
   * an unfinished Tensorflow version from MaxVit authors can be found https://github.com/google-research/maxvit
 * Initial CoAtNet and MaxVit timm pretrained weights (working on more):
@@ -1147,7 +1147,7 @@ More models, more fixes
   * `vit_relpos_base_patch16_gapcls_224` - 82.8 @ 224, 83.9 @ 320 -- rel pos, layer scale, class token, avg pool (by mistake)
 * Bring 512 dim, 8-head 'medium' ViT model variant back to life (after using in a pre DeiT 'small' model for first ViT impl back in 2020)
 * Add ViT relative position support for switching btw existing impl and some additions in official Swin-V2 impl for future trials
-* Sequencer2D impl (https://arxiv.org/abs/2205.01972), added via PR from author (https://github.com/okojoalg)
+* Sequencer2D impl (https://huggingface.co/papers/2205.01972), added via PR from author (https://github.com/okojoalg)
 
 ### May 2, 2022
 * Vision Transformer experiments adding Relative Position (Swin-V2 log-coord) (`vision_transformer_relpos.py`) and Residual Post-Norm branches (from Swin-V2) (`vision_transformer*.py`)
@@ -1164,7 +1164,7 @@ More models, more fixes
   * `seresnextaa101d_32x8d` (anti-aliased w/ AvgPool2d) - 83.85 @ 224, 84.57 @ 288
 
 ### March 23, 2022
-* Add `ParallelBlock` and `LayerScale` option to base vit models to support model configs in [Three things everyone should know about ViT](https://arxiv.org/abs/2203.09795)
+* Add `ParallelBlock` and `LayerScale` option to base vit models to support model configs in [Three things everyone should know about ViT](https://huggingface.co/papers/2203.09795)
 * `convnext_tiny_hnf` (head norm first) weights trained with (close to) A2 recipe, 82.2% top-1, could do better with more epochs.
 
 ### March 21, 2022
diff --git a/hfdocs/source/models.mdx b/hfdocs/source/models.mdx
index 97ff00b9ec..c061ce1493 100644
--- a/hfdocs/source/models.mdx
+++ b/hfdocs/source/models.mdx
@@ -15,77 +15,77 @@ A more exciting view (with pretty pictures) of the models within `timm` can be f
 ## Big Transfer ResNetV2 (BiT)
 
 * Implementation: [resnetv2.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnetv2.py)
-* Paper: `Big Transfer (BiT): General Visual Representation Learning` - https://arxiv.org/abs/1912.11370
+* Paper: `Big Transfer (BiT): General Visual Representation Learning` - https://huggingface.co/papers/1912.11370
 * Reference code: https://github.com/google-research/big_transfer
 
 ## Cross-Stage Partial Networks
 
 * Implementation: [cspnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/cspnet.py)
-* Paper: `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://arxiv.org/abs/1911.11929
+* Paper: `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://huggingface.co/papers/1911.11929
 * Reference impl: https://github.com/WongKinYiu/CrossStagePartialNetworks
 
 ## DenseNet
 
 * Implementation: [densenet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/densenet.py)
-* Paper: `Densely Connected Convolutional Networks` - https://arxiv.org/abs/1608.06993
+* Paper: `Densely Connected Convolutional Networks` - https://huggingface.co/papers/1608.06993
 * Code: https://github.com/pytorch/vision/tree/master/torchvision/models
 
 ## DLA
 
 * Implementation: [dla.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/dla.py)
-* Paper: `Deep Layer Aggregation` - https://arxiv.org/abs/1707.06484
+* Paper: `Deep Layer Aggregation` - https://huggingface.co/papers/1707.06484
 * Code: https://github.com/ucbdrive/dla
 
 ## Dual-Path Networks
 
 * Implementation: [dpn.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/dpn.py)
-* Paper: `Dual Path Networks` - https://arxiv.org/abs/1707.01629
+* Paper: `Dual Path Networks` - https://huggingface.co/papers/1707.01629
 * My PyTorch code: https://github.com/rwightman/pytorch-dpn-pretrained
 * Reference code: https://github.com/cypw/DPNs
 
 ## GPU-Efficient Networks
 
 * Implementation: [byobnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/byobnet.py)
-* Paper: `Neural Architecture Design for GPU-Efficient Networks` - https://arxiv.org/abs/2006.14090
+* Paper: `Neural Architecture Design for GPU-Efficient Networks` - https://huggingface.co/papers/2006.14090
 * Reference code: https://github.com/idstcv/GPU-Efficient-Networks
 
 ## HRNet
 
 * Implementation: [hrnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/hrnet.py)
-* Paper: `Deep High-Resolution Representation Learning for Visual Recognition` - https://arxiv.org/abs/1908.07919
+* Paper: `Deep High-Resolution Representation Learning for Visual Recognition` - https://huggingface.co/papers/1908.07919
 * Code: https://github.com/HRNet/HRNet-Image-Classification
 
 ## Inception-V3
 
 * Implementation: [inception_v3.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/inception_v3.py)
-* Paper: `Rethinking the Inception Architecture for Computer Vision` - https://arxiv.org/abs/1512.00567
+* Paper: `Rethinking the Inception Architecture for Computer Vision` - https://huggingface.co/papers/1512.00567
 * Code: https://github.com/pytorch/vision/tree/master/torchvision/models
 
 ## Inception-V4
 
 * Implementation: [inception_v4.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/inception_v4.py)
-* Paper: `Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning` - https://arxiv.org/abs/1602.07261
+* Paper: `Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning` - https://huggingface.co/papers/1602.07261
 * Code: https://github.com/Cadene/pretrained-models.pytorch
 * Reference code: https://github.com/tensorflow/models/tree/master/research/slim/nets
 
 ## Inception-ResNet-V2
 
 * Implementation: [inception_resnet_v2.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/inception_resnet_v2.py)
-* Paper: `Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning` - https://arxiv.org/abs/1602.07261
+* Paper: `Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning` - https://huggingface.co/papers/1602.07261
 * Code: https://github.com/Cadene/pretrained-models.pytorch
 * Reference code: https://github.com/tensorflow/models/tree/master/research/slim/nets
 
 ## NASNet-A
 
 * Implementation: [nasnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/nasnet.py)
-* Paper: `Learning Transferable Architectures for Scalable Image Recognition` - https://arxiv.org/abs/1707.07012
+* Paper: `Learning Transferable Architectures for Scalable Image Recognition` - https://huggingface.co/papers/1707.07012
 * Code: https://github.com/Cadene/pretrained-models.pytorch
 * Reference code: https://github.com/tensorflow/models/tree/master/research/slim/nets/nasnet
 
 ## PNasNet-5
 
 * Implementation: [pnasnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/pnasnet.py)
-* Paper: `Progressive Neural Architecture Search` - https://arxiv.org/abs/1712.00559
+* Paper: `Progressive Neural Architecture Search` - https://huggingface.co/papers/1712.00559
 * Code: https://github.com/Cadene/pretrained-models.pytorch
 * Reference code: https://github.com/tensorflow/models/tree/master/research/slim/nets/nasnet
 
@@ -93,34 +93,34 @@ A more exciting view (with pretty pictures) of the models within `timm` can be f
 
 * Implementation: [efficientnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/efficientnet.py)
 * Papers:
-  * EfficientNet NoisyStudent (B0-B7, L2) - https://arxiv.org/abs/1911.04252
-  * EfficientNet AdvProp (B0-B8) - https://arxiv.org/abs/1911.09665
-  * EfficientNet (B0-B7) - https://arxiv.org/abs/1905.11946
+  * EfficientNet NoisyStudent (B0-B7, L2) - https://huggingface.co/papers/1911.04252
+  * EfficientNet AdvProp (B0-B8) - https://huggingface.co/papers/1911.09665
+  * EfficientNet (B0-B7) - https://huggingface.co/papers/1905.11946
   * EfficientNet-EdgeTPU (S, M, L) - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
-  * MixNet - https://arxiv.org/abs/1907.09595
-  * MNASNet B1, A1 (Squeeze-Excite), and Small - https://arxiv.org/abs/1807.11626
-  * MobileNet-V2 - https://arxiv.org/abs/1801.04381
-  * FBNet-C - https://arxiv.org/abs/1812.03443
-  * Single-Path NAS - https://arxiv.org/abs/1904.02877
+  * MixNet - https://huggingface.co/papers/1907.09595
+  * MNASNet B1, A1 (Squeeze-Excite), and Small - https://huggingface.co/papers/1807.11626
+  * MobileNet-V2 - https://huggingface.co/papers/1801.04381
+  * FBNet-C - https://huggingface.co/papers/1812.03443
+  * Single-Path NAS - https://huggingface.co/papers/1904.02877
 * My PyTorch code: https://github.com/rwightman/gen-efficientnet-pytorch
 * Reference code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
 
 ## MobileNet-V3
 
 * Implementation: [mobilenetv3.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/mobilenetv3.py)
-* Paper: `Searching for MobileNetV3` - https://arxiv.org/abs/1905.02244
+* Paper: `Searching for MobileNetV3` - https://huggingface.co/papers/1905.02244
 * Reference code: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
 
 ## RegNet
 
 * Implementation: [regnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/regnet.py)
-* Paper: `Designing Network Design Spaces` - https://arxiv.org/abs/2003.13678
+* Paper: `Designing Network Design Spaces` - https://huggingface.co/papers/2003.13678
 * Reference code: https://github.com/facebookresearch/pycls/blob/master/pycls/models/regnet.py
 
 ## RepVGG
 
 * Implementation: [byobnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/byobnet.py)
-* Paper: `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+* Paper: `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
 * Reference code: https://github.com/DingXiaoH/RepVGG
 
 ## ResNet, ResNeXt
@@ -128,55 +128,55 @@ A more exciting view (with pretty pictures) of the models within `timm` can be f
 * Implementation: [resnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnet.py)
 
 * ResNet (V1B)
-  * Paper: `Deep Residual Learning for Image Recognition` - https://arxiv.org/abs/1512.03385
+  * Paper: `Deep Residual Learning for Image Recognition` - https://huggingface.co/papers/1512.03385
   * Code: https://github.com/pytorch/vision/tree/master/torchvision/models
 * ResNeXt
-  * Paper: `Aggregated Residual Transformations for Deep Neural Networks` - https://arxiv.org/abs/1611.05431
+  * Paper: `Aggregated Residual Transformations for Deep Neural Networks` - https://huggingface.co/papers/1611.05431
   * Code: https://github.com/pytorch/vision/tree/master/torchvision/models
 * 'Bag of Tricks' / Gluon C, D, E, S ResNet variants
-  * Paper: `Bag of Tricks for Image Classification with CNNs` - https://arxiv.org/abs/1812.01187
+  * Paper: `Bag of Tricks for Image Classification with CNNs` - https://huggingface.co/papers/1812.01187
   * Code: https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/resnetv1b.py
 * Instagram pretrained / ImageNet tuned ResNeXt101
-  * Paper: `Exploring the Limits of Weakly Supervised Pretraining` - https://arxiv.org/abs/1805.00932
+  * Paper: `Exploring the Limits of Weakly Supervised Pretraining` - https://huggingface.co/papers/1805.00932
   * Weights: https://pytorch.org/hub/facebookresearch_WSL-Images_resnext (NOTE: CC BY-NC 4.0 License, NOT commercial friendly)
 * Semi-supervised (SSL) / Semi-weakly Supervised (SWSL) ResNet and ResNeXts
-  * Paper: `Billion-scale semi-supervised learning for image classification` - https://arxiv.org/abs/1905.00546
+  * Paper: `Billion-scale semi-supervised learning for image classification` - https://huggingface.co/papers/1905.00546
   * Weights: https://github.com/facebookresearch/semi-supervised-ImageNet1K-models (NOTE: CC BY-NC 4.0 License, NOT commercial friendly)
 * Squeeze-and-Excitation Networks
-  * Paper: `Squeeze-and-Excitation Networks` - https://arxiv.org/abs/1709.01507
+  * Paper: `Squeeze-and-Excitation Networks` - https://huggingface.co/papers/1709.01507
   * Code: Added to ResNet base, this is current version going forward, old `senet.py` is being deprecated
 * ECAResNet (ECA-Net)
-  * Paper: `ECA-Net: Efficient Channel Attention for Deep CNN` - https://arxiv.org/abs/1910.03151v4
+  * Paper: `ECA-Net: Efficient Channel Attention for Deep CNN` - https://huggingface.co/papers/1910.03151v4
   * Code: Added to ResNet base, ECA module contributed by @VRandme, reference https://github.com/BangguWu/ECANet
 
 ## Res2Net
 
 * Implementation: [res2net.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/res2net.py)
-* Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://arxiv.org/abs/1904.01169
+* Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://huggingface.co/papers/1904.01169
 * Code: https://github.com/gasvn/Res2Net
 
 ## ResNeSt
 
 * Implementation: [resnest.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/resnest.py)
-* Paper: `ResNeSt: Split-Attention Networks` - https://arxiv.org/abs/2004.08955
+* Paper: `ResNeSt: Split-Attention Networks` - https://huggingface.co/papers/2004.08955
 * Code: https://github.com/zhanghang1989/ResNeSt
 
 ## ReXNet
 
 * Implementation: [rexnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/rexnet.py)
-* Paper: `ReXNet: Diminishing Representational Bottleneck on CNN` - https://arxiv.org/abs/2007.00992
+* Paper: `ReXNet: Diminishing Representational Bottleneck on CNN` - https://huggingface.co/papers/2007.00992
 * Code: https://github.com/clovaai/rexnet
 
 ## Selective-Kernel Networks
 
 * Implementation: [sknet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/sknet.py)
-* Paper: `Selective-Kernel Networks` - https://arxiv.org/abs/1903.06586
+* Paper: `Selective-Kernel Networks` - https://huggingface.co/papers/1903.06586
 * Code: https://github.com/implus/SKNet, https://github.com/clovaai/assembled-cnn
 
 ## SelecSLS
 
 * Implementation: [selecsls.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/selecsls.py)
-* Paper: `XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera` - https://arxiv.org/abs/1907.00837
+* Paper: `XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera` - https://huggingface.co/papers/1907.00837
 * Code: https://github.com/mehtadushy/SelecSLS-Pytorch
 
 ## Squeeze-and-Excitation Networks
@@ -184,47 +184,47 @@ A more exciting view (with pretty pictures) of the models within `timm` can be f
 * Implementation: [senet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/senet.py)
 NOTE: I am deprecating this version of the networks, the new ones are part of `resnet.py`
 
-* Paper: `Squeeze-and-Excitation Networks` - https://arxiv.org/abs/1709.01507
+* Paper: `Squeeze-and-Excitation Networks` - https://huggingface.co/papers/1709.01507
 * Code: https://github.com/Cadene/pretrained-models.pytorch 
 
 ## TResNet
 
 * Implementation: [tresnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/tresnet.py)
-* Paper: `TResNet: High Performance GPU-Dedicated Architecture` - https://arxiv.org/abs/2003.13630
+* Paper: `TResNet: High Performance GPU-Dedicated Architecture` - https://huggingface.co/papers/2003.13630
 * Code: https://github.com/mrT23/TResNet
 
 ## VGG
 
 * Implementation: [vgg.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vgg.py)
-* Paper: `Very Deep Convolutional Networks For Large-Scale Image Recognition` - https://arxiv.org/pdf/1409.1556.pdf
+* Paper: `Very Deep Convolutional Networks For Large-Scale Image Recognition` - https://huggingface.co/papers/1409.1556
 * Reference code: https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py
 
 ## Vision Transformer
 
 * Implementation: [vision_transformer.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py)
-* Paper: `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale` - https://arxiv.org/abs/2010.11929
+* Paper: `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale` - https://huggingface.co/papers/2010.11929
 * Reference code and pretrained weights: https://github.com/google-research/vision_transformer
 
 ## VovNet V2 and V1
 
 * Implementation: [vovnet.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vovnet.py)
-* Paper: `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://arxiv.org/abs/1911.06667
+* Paper: `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://huggingface.co/papers/1911.06667
 * Reference code: https://github.com/youngwanLEE/vovnet-detectron2
 
 ## Xception
 
 * Implementation: [xception.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/xception.py)
-* Paper: `Xception: Deep Learning with Depthwise Separable Convolutions` - https://arxiv.org/abs/1610.02357
+* Paper: `Xception: Deep Learning with Depthwise Separable Convolutions` - https://huggingface.co/papers/1610.02357
 * Code: https://github.com/Cadene/pretrained-models.pytorch
 
 ## Xception (Modified Aligned, Gluon)
 
 * Implementation: [gluon_xception.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/gluon_xception.py)
-* Paper: `Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation` - https://arxiv.org/abs/1802.02611
+* Paper: `Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation` - https://huggingface.co/papers/1802.02611
 * Reference code: https://github.com/dmlc/gluon-cv/tree/master/gluoncv/model_zoo, https://github.com/jfzhang95/pytorch-deeplab-xception/
 
 ## Xception (Modified Aligned, TF)
 
 * Implementation: [aligned_xception.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/aligned_xception.py)
-* Paper: `Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation` - https://arxiv.org/abs/1802.02611
+* Paper: `Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation` - https://huggingface.co/papers/1802.02611
 * Reference code: https://github.com/tensorflow/models/tree/master/research/deeplab
diff --git a/results/README.md b/results/README.md
index 81f30061c3..2bca4b467f 100644
--- a/results/README.md
+++ b/results/README.md
@@ -13,28 +13,28 @@ The test set results include rank and top-1/top-5 differences from clean validat
 The standard 50,000 image ImageNet-1k validation set. Model selection during training utilizes this validation set, so it is not a true test set. Question: Does anyone have the official ImageNet-1k test set classification labels now that challenges are done?
 
 * Source: http://image-net.org/challenges/LSVRC/2012/index
-* Paper: "ImageNet Large Scale Visual Recognition Challenge" - https://arxiv.org/abs/1409.0575
+* Paper: "ImageNet Large Scale Visual Recognition Challenge" - https://huggingface.co/papers/1409.0575
 
 ### ImageNet-"Real Labels" - [`results-imagenet-real.csv`](results-imagenet-real.csv)
 
 The usual ImageNet-1k validation set with a fresh new set of labels intended to improve on mistakes in the original annotation process.
 
 * Source: https://github.com/google-research/reassessed-imagenet
-* Paper: "Are we done with ImageNet?" - https://arxiv.org/abs/2006.07159
+* Paper: "Are we done with ImageNet?" - https://huggingface.co/papers/2006.07159
 
 ### ImageNetV2 Matched Frequency - [`results-imagenetv2-matched-frequency.csv`](results-imagenetv2-matched-frequency.csv)
 
 An ImageNet test set of 10,000 images sampled from new images roughly 10 years after the original. Care was taken to replicate the original ImageNet curation/sampling process.
 
 * Source: https://github.com/modestyachts/ImageNetV2
-* Paper: "Do ImageNet Classifiers Generalize to ImageNet?" - https://arxiv.org/abs/1902.10811
+* Paper: "Do ImageNet Classifiers Generalize to ImageNet?" - https://huggingface.co/papers/1902.10811
 
 ### ImageNet-Sketch - [`results-sketch.csv`](results-sketch.csv)
 
 50,000 non photographic (or photos of such) images (sketches, doodles, mostly monochromatic) covering all 1000 ImageNet classes.
 
 * Source: https://github.com/HaohanWang/ImageNet-Sketch
-* Paper: "Learning Robust Global Representations by Penalizing Local Predictive Power" - https://arxiv.org/abs/1905.13549
+* Paper: "Learning Robust Global Representations by Penalizing Local Predictive Power" - https://huggingface.co/papers/1905.13549
 
 ### ImageNet-Adversarial - [`results-imagenet-a.csv`](results-imagenet-a.csv)
 
@@ -43,7 +43,7 @@ A collection of 7500 images covering 200 of the 1000 ImageNet classes. Images ar
 For clean validation with same 200 classes, see [`results-imagenet-a-clean.csv`](results-imagenet-a-clean.csv) 
 
 * Source: https://github.com/hendrycks/natural-adv-examples
-* Paper: "Natural Adversarial Examples" - https://arxiv.org/abs/1907.07174
+* Paper: "Natural Adversarial Examples" - https://huggingface.co/papers/1907.07174
 
 ### ImageNet-Rendition - [`results-imagenet-r.csv`](results-imagenet-r.csv)
 
@@ -52,7 +52,7 @@ Renditions of 200 ImageNet classes resulting in 30,000 images for testing robust
 For clean validation with same 200 classes, see [`results-imagenet-r-clean.csv`](results-imagenet-r-clean.csv) 
 
 * Source: https://github.com/hendrycks/imagenet-r
-* Paper: "The Many Faces of Robustness" - https://arxiv.org/abs/2006.16241
+* Paper: "The Many Faces of Robustness" - https://huggingface.co/papers/2006.16241
 
 ### TODO
 * Explore adding a reduced version of ImageNet-C (Corruptions) and ImageNet-P (Perturbations) from https://github.com/hendrycks/robustness. The originals are huge and image size specific.
diff --git a/timm/data/auto_augment.py b/timm/data/auto_augment.py
index 36dd08fa05..ac453e5687 100644
--- a/timm/data/auto_augment.py
+++ b/timm/data/auto_augment.py
@@ -12,11 +12,11 @@
 3-Augment based on: https://github.com/facebookresearch/deit/blob/main/README_revenge.md
 
 Papers:
-    AutoAugment: Learning Augmentation Policies from Data - https://arxiv.org/abs/1805.09501
-    Learning Data Augmentation Strategies for Object Detection - https://arxiv.org/abs/1906.11172
-    RandAugment: Practical automated data augmentation... - https://arxiv.org/abs/1909.13719
-    AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty - https://arxiv.org/abs/1912.02781
-    3-Augment: DeiT III: Revenge of the ViT - https://arxiv.org/abs/2204.07118
+    AutoAugment: Learning Augmentation Policies from Data - https://huggingface.co/papers/1805.09501
+    Learning Data Augmentation Strategies for Object Detection - https://huggingface.co/papers/1906.11172
+    RandAugment: Practical automated data augmentation... - https://huggingface.co/papers/1909.13719
+    AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty - https://huggingface.co/papers/1912.02781
+    3-Augment: DeiT III: Revenge of the ViT - https://huggingface.co/papers/2204.07118
 
 Hacked together by / Copyright 2019, Ross Wightman
 """
@@ -472,7 +472,7 @@ def auto_augment_policy_v0r(hparams):
 
 
 def auto_augment_policy_original(hparams):
-    # ImageNet policy from https://arxiv.org/abs/1805.09501
+    # ImageNet policy from https://huggingface.co/papers/1805.09501
     policy = [
         [('PosterizeOriginal', 0.4, 8), ('Rotate', 0.6, 9)],
         [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
@@ -505,7 +505,7 @@ def auto_augment_policy_original(hparams):
 
 
 def auto_augment_policy_originalr(hparams):
-    # ImageNet policy from https://arxiv.org/abs/1805.09501 with research posterize variation
+    # ImageNet policy from https://huggingface.co/papers/1805.09501 with research posterize variation
     policy = [
         [('PosterizeIncreasing', 0.4, 8), ('Rotate', 0.6, 9)],
         [('Solarize', 0.6, 5), ('AutoContrast', 0.6, 5)],
@@ -879,7 +879,7 @@ class AugMixAugment:
     """ AugMix Transform
     Adapted and improved from impl here: https://github.com/google-research/augmix/blob/master/imagenet.py
     From paper: 'AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty -
-    https://arxiv.org/abs/1912.02781
+    https://huggingface.co/papers/1912.02781
     """
     def __init__(self, ops, alpha=1., width=3, depth=-1, blended=False):
         self.ops = ops
diff --git a/timm/data/mixup.py b/timm/data/mixup.py
index 26dc239152..9afa9e0ed9 100644
--- a/timm/data/mixup.py
+++ b/timm/data/mixup.py
@@ -1,9 +1,9 @@
 """ Mixup and Cutmix
 
 Papers:
-mixup: Beyond Empirical Risk Minimization (https://arxiv.org/abs/1710.09412)
+mixup: Beyond Empirical Risk Minimization (https://huggingface.co/papers/1710.09412)
 
-CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (https://arxiv.org/abs/1905.04899)
+CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features (https://huggingface.co/papers/1905.04899)
 
 Code Reference:
 CutMix: https://github.com/clovaai/CutMix-PyTorch
diff --git a/timm/data/random_erasing.py b/timm/data/random_erasing.py
index 1dee5f86a2..f07dbffb1e 100644
--- a/timm/data/random_erasing.py
+++ b/timm/data/random_erasing.py
@@ -26,7 +26,7 @@ def _get_pixels(per_pixel, rand_color, patch_size, dtype=torch.float32, device='
 class RandomErasing:
     """ Randomly selects a rectangle region in an image and erases its pixels.
         'Random Erasing Data Augmentation' by Zhong et al.
-        See https://arxiv.org/pdf/1708.04896.pdf
+        See https://huggingface.co/papers/1708.04896
 
         This variant of RandomErasing is intended to be applied to either a batch
         or single image tensor after it has been normalized by dataset mean and std.
diff --git a/timm/data/real_labels.py b/timm/data/real_labels.py
index 20f1b319a5..1b68f1f38f 100644
--- a/timm/data/real_labels.py
+++ b/timm/data/real_labels.py
@@ -1,5 +1,5 @@
 """ Real labels evaluator for ImageNet
-Paper: `Are we done with ImageNet?` - https://arxiv.org/abs/2006.07159
+Paper: `Are we done with ImageNet?` - https://huggingface.co/papers/2006.07159
 Based on Numpy example at https://github.com/google-research/reassessed-imagenet
 
 Hacked together by / Copyright 2020 Ross Wightman
diff --git a/timm/layers/activations.py b/timm/layers/activations.py
index a863e6964b..ae3fadca84 100644
--- a/timm/layers/activations.py
+++ b/timm/layers/activations.py
@@ -12,7 +12,7 @@
 
 
 def swish(x, inplace: bool = False):
-    """Swish - Described in: https://arxiv.org/abs/1710.05941
+    """Swish - Described in: https://huggingface.co/papers/1710.05941
     """
     return x.mul_(x.sigmoid()) if inplace else x.mul(x.sigmoid())
 
@@ -27,14 +27,14 @@ def forward(self, x):
 
 
 def mish(x, inplace: bool = False):
-    """Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://arxiv.org/abs/1908.08681
+    """Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://huggingface.co/papers/1908.08681
     NOTE: I don't have a working inplace variant
     """
     return x.mul(F.softplus(x).tanh())
 
 
 class Mish(nn.Module):
-    """Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://arxiv.org/abs/1908.08681
+    """Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://huggingface.co/papers/1908.08681
     """
     def __init__(self, inplace: bool = False):
         super(Mish, self).__init__()
diff --git a/timm/layers/activations_me.py b/timm/layers/activations_me.py
index b0ddd5cb0d..4b051ce356 100644
--- a/timm/layers/activations_me.py
+++ b/timm/layers/activations_me.py
@@ -66,7 +66,7 @@ def mish_bwd(x, grad_output):
 
 
 class MishAutoFn(torch.autograd.Function):
-    """ Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://arxiv.org/abs/1908.08681
+    """ Mish: A Self Regularized Non-Monotonic Neural Activation Function - https://huggingface.co/papers/1908.08681
     A memory efficient variant of Mish
     """
     @staticmethod
diff --git a/timm/layers/attention2d.py b/timm/layers/attention2d.py
index 6a542828bc..0827a39ebd 100644
--- a/timm/layers/attention2d.py
+++ b/timm/layers/attention2d.py
@@ -14,7 +14,7 @@ class MultiQueryAttentionV2(nn.Module):
     """Multi Query Attention.
 
     Fast Transformer Decoding: One Write-Head is All You Need
-    https://arxiv.org/pdf/1911.02150.pdf
+    https://huggingface.co/papers/1911.02150
 
     This is an acceletor optimized version - removing multiple unnecessary
     tensor transpose by re-arranging indices according to the following rules: 1)
diff --git a/timm/layers/bottleneck_attn.py b/timm/layers/bottleneck_attn.py
index c3db464e5a..f31178d595 100644
--- a/timm/layers/bottleneck_attn.py
+++ b/timm/layers/bottleneck_attn.py
@@ -1,6 +1,6 @@
 """ Bottleneck Self Attention (Bottleneck Transformers)
 
-Paper: `Bottleneck Transformers for Visual Recognition` - https://arxiv.org/abs/2101.11605
+Paper: `Bottleneck Transformers for Visual Recognition` - https://huggingface.co/papers/2101.11605
 
 @misc{2101.11605,
 Author = {Aravind Srinivas and Tsung-Yi Lin and Niki Parmar and Jonathon Shlens and Pieter Abbeel and Ashish Vaswani},
@@ -29,7 +29,7 @@ def rel_logits_1d(q, rel_k, permute_mask: List[int]):
     """ Compute relative logits along one dimension
 
     As per: https://gist.github.com/aravindsrinivas/56359b79f0ce4449bcb04ab4b56a57a2
-    Originally from: `Attention Augmented Convolutional Networks` - https://arxiv.org/abs/1904.09925
+    Originally from: `Attention Augmented Convolutional Networks` - https://huggingface.co/papers/1904.09925
 
     Args:
         q: (batch, heads, height, width, dim)
@@ -56,7 +56,7 @@ def rel_logits_1d(q, rel_k, permute_mask: List[int]):
 class PosEmbedRel(nn.Module):
     """ Relative Position Embedding
     As per: https://gist.github.com/aravindsrinivas/56359b79f0ce4449bcb04ab4b56a57a2
-    Originally from: `Attention Augmented Convolutional Networks` - https://arxiv.org/abs/1904.09925
+    Originally from: `Attention Augmented Convolutional Networks` - https://huggingface.co/papers/1904.09925
     """
     def __init__(self, feat_size, dim_head, scale):
         super().__init__()
@@ -83,7 +83,7 @@ def forward(self, q):
 
 class BottleneckAttn(nn.Module):
     """ Bottleneck Attention
-    Paper: `Bottleneck Transformers for Visual Recognition` - https://arxiv.org/abs/2101.11605
+    Paper: `Bottleneck Transformers for Visual Recognition` - https://huggingface.co/papers/2101.11605
 
     The internal dimensions of the attention module are controlled by the interaction of several arguments.
       * the output dimension of the module is specified by dim_out, which falls back to input dim if not set
diff --git a/timm/layers/cbam.py b/timm/layers/cbam.py
index 576a8306d9..3af762dfff 100644
--- a/timm/layers/cbam.py
+++ b/timm/layers/cbam.py
@@ -1,6 +1,6 @@
 """ CBAM (sort-of) Attention
 
-Experimental impl of CBAM: Convolutional Block Attention Module: https://arxiv.org/abs/1807.06521
+Experimental impl of CBAM: Convolutional Block Attention Module: https://huggingface.co/papers/1807.06521
 
 WARNING: Results with these attention layers have been mixed. They can significantly reduce performance on
 some tasks, especially fine-grained it seems. I may end up removing this impl.
diff --git a/timm/layers/cond_conv2d.py b/timm/layers/cond_conv2d.py
index 43654c5972..df498a9f4a 100644
--- a/timm/layers/cond_conv2d.py
+++ b/timm/layers/cond_conv2d.py
@@ -1,7 +1,7 @@
 """ PyTorch Conditionally Parameterized Convolution (CondConv)
 
 Paper: CondConv: Conditionally Parameterized Convolutions for Efficient Inference
-(https://arxiv.org/abs/1904.04971)
+(https://huggingface.co/papers/1904.04971)
 
 Hacked together by / Copyright 2020 Ross Wightman
 """
diff --git a/timm/layers/drop.py b/timm/layers/drop.py
index 289245f5ad..086e84b0fa 100644
--- a/timm/layers/drop.py
+++ b/timm/layers/drop.py
@@ -3,9 +3,9 @@
 PyTorch implementations of DropBlock and DropPath (Stochastic Depth) regularization layers.
 
 Papers:
-DropBlock: A regularization method for convolutional networks (https://arxiv.org/abs/1810.12890)
+DropBlock: A regularization method for convolutional networks (https://huggingface.co/papers/1810.12890)
 
-Deep Networks with Stochastic Depth (https://arxiv.org/abs/1603.09382)
+Deep Networks with Stochastic Depth (https://huggingface.co/papers/1603.09382)
 
 Code:
 DropBlock impl inspired by two Tensorflow impl that I liked:
@@ -30,7 +30,7 @@ def drop_block_2d(
         inplace: bool = False,
         batchwise: bool = False
 ):
-    """ DropBlock. See https://arxiv.org/pdf/1810.12890.pdf
+    """ DropBlock. See https://huggingface.co/papers/1810.12890
 
     DropBlock with an experimental gaussian noise option. This layer has been tested on a few training
     runs with success, but needs further validation and possibly optimization for lower runtime impact.
@@ -83,7 +83,7 @@ def drop_block_fast_2d(
         with_noise: bool = False,
         inplace: bool = False,
 ):
-    """ DropBlock. See https://arxiv.org/pdf/1810.12890.pdf
+    """ DropBlock. See https://huggingface.co/papers/1810.12890
 
     DropBlock with an experimental gaussian noise option. Simplied from above without concern for valid
     block mask at edges.
@@ -115,7 +115,7 @@ def drop_block_fast_2d(
 
 
 class DropBlock2d(nn.Module):
-    """ DropBlock. See https://arxiv.org/pdf/1810.12890.pdf
+    """ DropBlock. See https://huggingface.co/papers/1810.12890
     """
 
     def __init__(
diff --git a/timm/layers/eca.py b/timm/layers/eca.py
index e29be6ac3c..c8391ffecd 100644
--- a/timm/layers/eca.py
+++ b/timm/layers/eca.py
@@ -2,7 +2,7 @@
 ECA module from ECAnet
 
 paper: ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks
-https://arxiv.org/abs/1910.03151
+https://huggingface.co/papers/1910.03151
 
 Original ECA model borrowed from https://github.com/BangguWu/ECANet
 
@@ -49,7 +49,7 @@ class EcaModule(nn.Module):
         channels: Number of channels of the input feature map for use in adaptive kernel sizes
             for actual calculations according to channel.
             gamma, beta: when channel is given parameters of mapping function
-            refer to original paper https://arxiv.org/pdf/1910.03151.pdf
+            refer to original paper https://huggingface.co/papers/1910.03151
             (default=None. if channel size not given, use k_size given for kernel size.)
         kernel_size: Adaptive selection of kernel size (default=3)
         gamm: used in kernel_size calc, see above
@@ -109,7 +109,7 @@ class CecaModule(nn.Module):
         channels: Number of channels of the input feature map for use in adaptive kernel sizes
             for actual calculations according to channel.
             gamma, beta: when channel is given parameters of mapping function
-            refer to original paper https://arxiv.org/pdf/1910.03151.pdf
+            refer to original paper https://huggingface.co/papers/1910.03151
             (default=None. if channel size not given, use k_size given for kernel size.)
         kernel_size: Adaptive selection of kernel size (default=3)
         gamm: used in kernel_size calc, see above
diff --git a/timm/layers/evo_norm.py b/timm/layers/evo_norm.py
index ea77620712..2cdf3d2375 100644
--- a/timm/layers/evo_norm.py
+++ b/timm/layers/evo_norm.py
@@ -1,6 +1,6 @@
 """ EvoNorm in PyTorch
 
-Based on `Evolving Normalization-Activation Layers` - https://arxiv.org/abs/2004.02967
+Based on `Evolving Normalization-Activation Layers` - https://huggingface.co/papers/2004.02967
 @inproceedings{NEURIPS2020,
  author = {Liu, Hanxiao and Brock, Andy and Simonyan, Karen and Le, Quoc},
  booktitle = {Advances in Neural Information Processing Systems},
diff --git a/timm/layers/filter_response_norm.py b/timm/layers/filter_response_norm.py
index a66a1cd493..babb822e59 100644
--- a/timm/layers/filter_response_norm.py
+++ b/timm/layers/filter_response_norm.py
@@ -1,6 +1,6 @@
 """ Filter Response Norm in PyTorch
 
-Based on `Filter Response Normalization Layer` - https://arxiv.org/abs/1911.09737
+Based on `Filter Response Normalization Layer` - https://huggingface.co/papers/1911.09737
 
 Hacked together by / Copyright 2021 Ross Wightman
 """
diff --git a/timm/layers/gather_excite.py b/timm/layers/gather_excite.py
index 2d60dc961e..14de814594 100644
--- a/timm/layers/gather_excite.py
+++ b/timm/layers/gather_excite.py
@@ -1,6 +1,6 @@
 """ Gather-Excite Attention Block
 
-Paper: `Gather-Excite: Exploiting Feature Context in CNNs` - https://arxiv.org/abs/1810.12348
+Paper: `Gather-Excite: Exploiting Feature Context in CNNs` - https://huggingface.co/papers/1810.12348
 
 Official code here, but it's only partial impl in Caffe: https://github.com/hujie-frank/GENet
 
diff --git a/timm/layers/global_context.py b/timm/layers/global_context.py
index de7fb5c15f..bd8274e2d4 100644
--- a/timm/layers/global_context.py
+++ b/timm/layers/global_context.py
@@ -1,7 +1,7 @@
 """ Global Context Attention Block
 
 Paper: `GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond`
-    - https://arxiv.org/abs/1904.11492
+    - https://huggingface.co/papers/1904.11492
 
 Official code consulted as reference: https://github.com/xvjiarui/GCNet
 
diff --git a/timm/layers/grn.py b/timm/layers/grn.py
index ae71e013fc..600cf80648 100644
--- a/timm/layers/grn.py
+++ b/timm/layers/grn.py
@@ -1,7 +1,7 @@
 """ Global Response Normalization Module
 
 Based on the GRN layer presented in
-`ConvNeXt-V2 - Co-designing and Scaling ConvNets with Masked Autoencoders` - https://arxiv.org/abs/2301.00808
+`ConvNeXt-V2 - Co-designing and Scaling ConvNets with Masked Autoencoders` - https://huggingface.co/papers/2301.00808
 
 This implementation
 * works for both NCHW and NHWC tensor layouts
diff --git a/timm/layers/halo_attn.py b/timm/layers/halo_attn.py
index f2ac64f85e..eb9464c121 100644
--- a/timm/layers/halo_attn.py
+++ b/timm/layers/halo_attn.py
@@ -1,7 +1,7 @@
 """ Halo Self Attention
 
 Paper: `Scaling Local Self-Attention for Parameter Efficient Visual Backbones`
-    - https://arxiv.org/abs/2103.12731
+    - https://huggingface.co/papers/2103.12731
 
 @misc{2103.12731,
 Author = {Ashish Vaswani and Prajit Ramachandran and Aravind Srinivas and Niki Parmar and Blake Hechtman and
@@ -31,7 +31,7 @@ def rel_logits_1d(q, rel_k, permute_mask: List[int]):
     """ Compute relative logits along one dimension
 
     As per: https://gist.github.com/aravindsrinivas/56359b79f0ce4449bcb04ab4b56a57a2
-    Originally from: `Attention Augmented Convolutional Networks` - https://arxiv.org/abs/1904.09925
+    Originally from: `Attention Augmented Convolutional Networks` - https://huggingface.co/papers/1904.09925
 
     Args:
         q: (batch, height, width, dim)
@@ -61,7 +61,7 @@ def rel_logits_1d(q, rel_k, permute_mask: List[int]):
 class PosEmbedRel(nn.Module):
     """ Relative Position Embedding
     As per: https://gist.github.com/aravindsrinivas/56359b79f0ce4449bcb04ab4b56a57a2
-    Originally from: `Attention Augmented Convolutional Networks` - https://arxiv.org/abs/1904.09925
+    Originally from: `Attention Augmented Convolutional Networks` - https://huggingface.co/papers/1904.09925
 
     """
     def __init__(self, block_size, win_size, dim_head, scale):
@@ -98,7 +98,7 @@ class HaloAttn(nn.Module):
     """ Halo Attention
 
     Paper: `Scaling Local Self-Attention for Parameter Efficient Visual Backbones`
-        - https://arxiv.org/abs/2103.12731
+        - https://huggingface.co/papers/2103.12731
 
     The internal dimensions of the attention module are controlled by the interaction of several arguments.
       * the output dimension of the module is specified by dim_out, which falls back to input dim if not set
diff --git a/timm/layers/lambda_layer.py b/timm/layers/lambda_layer.py
index 9192e266e6..88a3d4f0cc 100644
--- a/timm/layers/lambda_layer.py
+++ b/timm/layers/lambda_layer.py
@@ -1,7 +1,7 @@
 """ Lambda Layer
 
 Paper: `LambdaNetworks: Modeling Long-Range Interactions Without Attention`
-    - https://arxiv.org/abs/2102.08602
+    - https://huggingface.co/papers/2102.08602
 
 @misc{2102.08602,
 Author = {Irwan Bello},
@@ -42,7 +42,7 @@ class LambdaLayer(nn.Module):
     """Lambda Layer
 
     Paper: `LambdaNetworks: Modeling Long-Range Interactions Without Attention`
-        - https://arxiv.org/abs/2102.08602
+        - https://huggingface.co/papers/2102.08602
 
     NOTE: intra-depth parameter 'u' is fixed at 1. It did not appear worth the complexity to add.
 
diff --git a/timm/layers/mixed_conv2d.py b/timm/layers/mixed_conv2d.py
index fa0ce565c0..5473c67781 100644
--- a/timm/layers/mixed_conv2d.py
+++ b/timm/layers/mixed_conv2d.py
@@ -1,6 +1,6 @@
 """ PyTorch Mixed Convolution
 
-Paper: MixConv: Mixed Depthwise Convolutional Kernels (https://arxiv.org/abs/1907.09595)
+Paper: MixConv: Mixed Depthwise Convolutional Kernels (https://huggingface.co/papers/1907.09595)
 
 Hacked together by / Copyright 2020 Ross Wightman
 """
diff --git a/timm/layers/mlp.py b/timm/layers/mlp.py
index 188c6b530b..98f52df255 100644
--- a/timm/layers/mlp.py
+++ b/timm/layers/mlp.py
@@ -52,7 +52,7 @@ def forward(self, x):
 
 class GluMlp(nn.Module):
     """ MLP w/ GLU style gating
-    See: https://arxiv.org/abs/1612.08083, https://arxiv.org/abs/2002.05202
+    See: https://huggingface.co/papers/1612.08083, https://huggingface.co/papers/2002.05202
 
     NOTE: When use_conv=True, expects 2D NCHW tensors, otherwise N*C expected.
     """
diff --git a/timm/layers/patch_dropout.py b/timm/layers/patch_dropout.py
index 4428fe042f..5bd324909a 100644
--- a/timm/layers/patch_dropout.py
+++ b/timm/layers/patch_dropout.py
@@ -6,7 +6,7 @@
 
 class PatchDropout(nn.Module):
     """
-    https://arxiv.org/abs/2212.00794 and https://arxiv.org/pdf/2208.07220
+    https://huggingface.co/papers/2212.00794 and https://huggingface.co/papers/2208.07220
     """
     return_indices: torch.jit.Final[bool]
 
diff --git a/timm/layers/pos_embed_rel.py b/timm/layers/pos_embed_rel.py
index 4fcb111e99..07653d1bcf 100644
--- a/timm/layers/pos_embed_rel.py
+++ b/timm/layers/pos_embed_rel.py
@@ -331,7 +331,7 @@ def gen_relative_log_coords(
 
 class RelPosMlp(nn.Module):
     """ Log-Coordinate Relative Position MLP
-    Based on ideas presented in Swin-V2 paper (https://arxiv.org/abs/2111.09883)
+    Based on ideas presented in Swin-V2 paper (https://huggingface.co/papers/2111.09883)
 
     This impl covers the 'swin' implementation as well as two timm specific modes ('cr', and 'rw')
     """
diff --git a/timm/layers/selective_kernel.py b/timm/layers/selective_kernel.py
index ec8ee6ce27..0aae7a52db 100644
--- a/timm/layers/selective_kernel.py
+++ b/timm/layers/selective_kernel.py
@@ -1,6 +1,6 @@
 """ Selective Kernel Convolution/Attention
 
-Paper: Selective Kernel Networks (https://arxiv.org/abs/1903.06586)
+Paper: Selective Kernel Networks (https://huggingface.co/papers/1903.06586)
 
 Hacked together by / Copyright 2020 Ross Wightman
 """
@@ -53,7 +53,7 @@ def __init__(self, in_channels, out_channels=None, kernel_size=None, stride=1, d
                  act_layer=nn.ReLU, norm_layer=nn.BatchNorm2d, aa_layer=None, drop_layer=None):
         """ Selective Kernel Convolution Module
 
-        As described in Selective Kernel Networks (https://arxiv.org/abs/1903.06586) with some modifications.
+        As described in Selective Kernel Networks (https://huggingface.co/papers/1903.06586) with some modifications.
 
         Largest change is the input split, which divides the input channels across each convolution path, this can
         be viewed as a grouping of sorts, but the output channel counts expand to the module level value. This keeps
diff --git a/timm/layers/split_attn.py b/timm/layers/split_attn.py
index ac54f8988a..565a5c2541 100644
--- a/timm/layers/split_attn.py
+++ b/timm/layers/split_attn.py
@@ -1,6 +1,6 @@
 """ Split Attention Conv2d (for ResNeSt Models)
 
-Paper: `ResNeSt: Split-Attention Networks` - /https://arxiv.org/abs/2004.08955
+Paper: `ResNeSt: Split-Attention Networks` - /https://huggingface.co/papers/2004.08955
 
 Adapted from original PyTorch impl at https://github.com/zhanghang1989/ResNeSt
 
diff --git a/timm/layers/squeeze_excite.py b/timm/layers/squeeze_excite.py
index 4fe568fe8f..27d57c9957 100644
--- a/timm/layers/squeeze_excite.py
+++ b/timm/layers/squeeze_excite.py
@@ -3,10 +3,10 @@
 An SE implementation originally based on PyTorch SE-Net impl.
 Has since evolved with additional functionality / configuration.
 
-Paper: `Squeeze-and-Excitation Networks` - https://arxiv.org/abs/1709.01507
+Paper: `Squeeze-and-Excitation Networks` - https://huggingface.co/papers/1709.01507
 
 Also included is Effective Squeeze-Excitation (ESE).
-Paper: `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://arxiv.org/abs/1911.06667
+Paper: `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://huggingface.co/papers/1911.06667
 
 Hacked together by / Copyright 2021 Ross Wightman
 """
@@ -54,7 +54,7 @@ def forward(self, x):
 
 class EffectiveSEModule(nn.Module):
     """ 'Effective Squeeze-Excitation
-    From `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://arxiv.org/abs/1911.06667
+    From `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://huggingface.co/papers/1911.06667
     """
     def __init__(self, channels, add_maxpool=False, gate_layer='hard_sigmoid', **_):
         super(EffectiveSEModule, self).__init__()
diff --git a/timm/layers/std_conv.py b/timm/layers/std_conv.py
index d896ba5c2f..ad752b19c1 100644
--- a/timm/layers/std_conv.py
+++ b/timm/layers/std_conv.py
@@ -11,7 +11,7 @@
 
 ScaledStdConv:
 Paper: `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-    - https://arxiv.org/abs/2101.08692
+    - https://huggingface.co/papers/2101.08692
 Official Deepmind JAX code: https://github.com/deepmind/deepmind-research/tree/master/nfnets
 
 Hacked together by / copyright Ross Wightman, 2021.
@@ -27,7 +27,7 @@ class StdConv2d(nn.Conv2d):
     """Conv2d with Weight Standardization. Used for BiT ResNet-V2 models.
 
     Paper: `Micro-Batch Training with Batch-Channel Normalization and Weight Standardization` -
-        https://arxiv.org/abs/1903.10520v2
+        https://huggingface.co/papers/1903.10520v2
     """
     def __init__(
             self, in_channel, out_channels, kernel_size, stride=1, padding=None,
@@ -51,7 +51,7 @@ class StdConv2dSame(nn.Conv2d):
     """Conv2d with Weight Standardization. TF compatible SAME padding. Used for ViT Hybrid model.
 
     Paper: `Micro-Batch Training with Batch-Channel Normalization and Weight Standardization` -
-        https://arxiv.org/abs/1903.10520v2
+        https://huggingface.co/papers/1903.10520v2
     """
     def __init__(
             self, in_channel, out_channels, kernel_size, stride=1, padding='SAME',
@@ -77,7 +77,7 @@ class ScaledStdConv2d(nn.Conv2d):
     """Conv2d layer with Scaled Weight Standardization.
 
     Paper: `Characterizing signal propagation to close the performance gap in unnormalized ResNets` -
-        https://arxiv.org/abs/2101.08692
+        https://huggingface.co/papers/2101.08692
 
     NOTE: the operations used in this impl differ slightly from the DeepMind Haiku impl. The impact is minor.
     """
@@ -106,7 +106,7 @@ class ScaledStdConv2dSame(nn.Conv2d):
     """Conv2d layer with Scaled Weight Standardization and Tensorflow-like SAME padding support
 
     Paper: `Characterizing signal propagation to close the performance gap in unnormalized ResNets` -
-        https://arxiv.org/abs/2101.08692
+        https://huggingface.co/papers/2101.08692
 
     NOTE: the operations used in this impl differ slightly from the DeepMind Haiku impl. The impact is minor.
     """
diff --git a/timm/loss/jsd.py b/timm/loss/jsd.py
index dd64e156c2..be499e78f0 100644
--- a/timm/loss/jsd.py
+++ b/timm/loss/jsd.py
@@ -10,7 +10,7 @@ class JsdCrossEntropy(nn.Module):
 
     Based on impl here: https://github.com/google-research/augmix/blob/master/imagenet.py
     From paper: 'AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty -
-    https://arxiv.org/abs/1912.02781
+    https://huggingface.co/papers/1912.02781
 
     Hacked together by / Copyright 2020 Ross Wightman
     """
diff --git a/timm/models/_efficientnet_blocks.py b/timm/models/_efficientnet_blocks.py
index 6ac2f8cd6d..4c0c57cd43 100644
--- a/timm/models/_efficientnet_blocks.py
+++ b/timm/models/_efficientnet_blocks.py
@@ -201,11 +201,11 @@ def forward(self, x):
 class InvertedResidual(nn.Module):
     """ Inverted residual block w/ optional SE
 
-    Originally used in MobileNet-V2 - https://arxiv.org/abs/1801.04381v4, this layer is often
+    Originally used in MobileNet-V2 - https://huggingface.co/papers/1801.04381v4, this layer is often
     referred to as 'MBConv' for (Mobile inverted bottleneck conv) and is also used in
-      * MNasNet - https://arxiv.org/abs/1807.11626
-      * EfficientNet - https://arxiv.org/abs/1905.11946
-      * MobileNet-V3 - https://arxiv.org/abs/1905.02244
+      * MNasNet - https://huggingface.co/papers/1807.11626
+      * EfficientNet - https://huggingface.co/papers/1905.11946
+      * MobileNet-V3 - https://huggingface.co/papers/1905.02244
     """
 
     def __init__(
@@ -487,7 +487,7 @@ def __init__(
         self.has_query_stride = any([s > 1 for s in self.query_strides])
 
         # This CPE is different than the one suggested in the original paper.
-        # https://arxiv.org/abs/2102.10882
+        # https://huggingface.co/papers/2102.10882
         # 1. Rather than adding one CPE before the attention blocks, we add a CPE
         #    into every attention block.
         # 2. We replace the expensive Conv2D by a Separable DW Conv.
@@ -632,9 +632,9 @@ class EdgeResidual(nn.Module):
         - https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html
 
     This layer is also called FusedMBConv in the MobileDet, EfficientNet-X, and EfficientNet-V2 papers
-      * MobileDet - https://arxiv.org/abs/2004.14525
-      * EfficientNet-X - https://arxiv.org/abs/2102.05610
-      * EfficientNet-V2 - https://arxiv.org/abs/2104.00298
+      * MobileDet - https://huggingface.co/papers/2004.14525
+      * EfficientNet-X - https://huggingface.co/papers/2102.05610
+      * EfficientNet-V2 - https://huggingface.co/papers/2104.00298
     """
 
     def __init__(
diff --git a/timm/models/beit.py b/timm/models/beit.py
index 5123a60627..5032a64da8 100644
--- a/timm/models/beit.py
+++ b/timm/models/beit.py
@@ -1,4 +1,4 @@
-""" BEiT: BERT Pre-Training of Image Transformers (https://arxiv.org/abs/2106.08254)
+""" BEiT: BERT Pre-Training of Image Transformers (https://huggingface.co/papers/2106.08254)
 
 Model from official source: https://github.com/microsoft/unilm/tree/master/beit
 
@@ -27,7 +27,7 @@
 Modifications by / Copyright 2021 Ross Wightman, original copyrights below
 """
 # --------------------------------------------------------
-# BEIT: BERT Pre-Training of Image Transformers (https://arxiv.org/abs/2106.08254)
+# BEIT: BERT Pre-Training of Image Transformers (https://huggingface.co/papers/2106.08254)
 # Github source: https://github.com/microsoft/unilm/tree/master/beit
 # Copyright (c) 2021 Microsoft
 # Licensed under The MIT License [see LICENSE for details]
diff --git a/timm/models/byobnet.py b/timm/models/byobnet.py
index 764d5ad5eb..1fc32221fd 100644
--- a/timm/models/byobnet.py
+++ b/timm/models/byobnet.py
@@ -5,15 +5,15 @@
 This model is currently used to implement the following networks:
 
 GPU Efficient (ResNets) - gernet_l/m/s (original versions called genet, but this was already used (by SENet author)).
-Paper: `Neural Architecture Design for GPU-Efficient Networks` - https://arxiv.org/abs/2006.14090
+Paper: `Neural Architecture Design for GPU-Efficient Networks` - https://huggingface.co/papers/2006.14090
 Code and weights: https://github.com/idstcv/GPU-Efficient-Networks, licensed Apache 2.0
 
 RepVGG - repvgg_*
-Paper: `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+Paper: `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
 Code and weights: https://github.com/DingXiaoH/RepVGG, licensed MIT
 
 MobileOne - mobileone_*
-Paper: `MobileOne: An Improved One millisecond Mobile Backbone` - https://arxiv.org/abs/2206.04040
+Paper: `MobileOne: An Improved One millisecond Mobile Backbone` - https://huggingface.co/papers/2206.04040
 Code and weights: https://github.com/apple/ml-mobileone, licensed MIT
 
 In all cases the models have been modified to fit within the design of ByobNet. I've remapped
@@ -553,7 +553,7 @@ def forward(self, x):
 
     def reparameterize(self):
         """ Following works like `RepVGG: Making VGG-style ConvNets Great Again` -
-        https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched
+        https://huggingface.co/papers/2101.03697. We re-parameterize multi-branched
         architecture used at training time to obtain a plain CNN-like structure
         for inference.
         """
@@ -649,7 +649,7 @@ class MobileOneBlock(nn.Module):
         and plain-CNN style architecture at inference time
         For more details, please refer to our paper:
         `An Improved One millisecond Mobile Backbone` -
-        https://arxiv.org/pdf/2206.04040.pdf
+        https://huggingface.co/papers/2206.04040
     """
 
     def __init__(
@@ -738,7 +738,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
     def reparameterize(self):
         """ Following works like `RepVGG: Making VGG-style ConvNets Great Again` -
-        https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched
+        https://huggingface.co/papers/2101.03697. We re-parameterize multi-branched
         architecture used at training time to obtain a plain CNN-like structure
         for inference.
         """
@@ -2386,7 +2386,7 @@ def _cfgr(url='', **kwargs):
 @register_model
 def gernet_l(pretrained=False, **kwargs) -> ByobNet:
     """ GEResNet-Large (GENet-Large from official impl)
-    `Neural Architecture Design for GPU-Efficient Networks` - https://arxiv.org/abs/2006.14090
+    `Neural Architecture Design for GPU-Efficient Networks` - https://huggingface.co/papers/2006.14090
     """
     return _create_byobnet('gernet_l', pretrained=pretrained, **kwargs)
 
@@ -2394,7 +2394,7 @@ def gernet_l(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def gernet_m(pretrained=False, **kwargs) -> ByobNet:
     """ GEResNet-Medium (GENet-Normal from official impl)
-    `Neural Architecture Design for GPU-Efficient Networks` - https://arxiv.org/abs/2006.14090
+    `Neural Architecture Design for GPU-Efficient Networks` - https://huggingface.co/papers/2006.14090
     """
     return _create_byobnet('gernet_m', pretrained=pretrained, **kwargs)
 
@@ -2402,7 +2402,7 @@ def gernet_m(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def gernet_s(pretrained=False, **kwargs) -> ByobNet:
     """ EResNet-Small (GENet-Small from official impl)
-    `Neural Architecture Design for GPU-Efficient Networks` - https://arxiv.org/abs/2006.14090
+    `Neural Architecture Design for GPU-Efficient Networks` - https://huggingface.co/papers/2006.14090
     """
     return _create_byobnet('gernet_s', pretrained=pretrained, **kwargs)
 
@@ -2410,7 +2410,7 @@ def gernet_s(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_a0(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-A0
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_a0', pretrained=pretrained, **kwargs)
 
@@ -2418,7 +2418,7 @@ def repvgg_a0(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_a1(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-A1
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_a1', pretrained=pretrained, **kwargs)
 
@@ -2426,7 +2426,7 @@ def repvgg_a1(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_a2(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-A2
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_a2', pretrained=pretrained, **kwargs)
 
@@ -2434,7 +2434,7 @@ def repvgg_a2(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b0(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B0
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b0', pretrained=pretrained, **kwargs)
 
@@ -2442,7 +2442,7 @@ def repvgg_b0(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b1(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B1
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b1', pretrained=pretrained, **kwargs)
 
@@ -2450,7 +2450,7 @@ def repvgg_b1(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b1g4(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B1g4
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b1g4', pretrained=pretrained, **kwargs)
 
@@ -2458,7 +2458,7 @@ def repvgg_b1g4(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b2(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B2
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b2', pretrained=pretrained, **kwargs)
 
@@ -2466,7 +2466,7 @@ def repvgg_b2(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b2g4(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B2g4
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b2g4', pretrained=pretrained, **kwargs)
 
@@ -2474,7 +2474,7 @@ def repvgg_b2g4(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b3(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B3
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b3', pretrained=pretrained, **kwargs)
 
@@ -2482,7 +2482,7 @@ def repvgg_b3(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_b3g4(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-B3g4
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_b3g4', pretrained=pretrained, **kwargs)
 
@@ -2490,7 +2490,7 @@ def repvgg_b3g4(pretrained=False, **kwargs) -> ByobNet:
 @register_model
 def repvgg_d2se(pretrained=False, **kwargs) -> ByobNet:
     """ RepVGG-D2se
-    `Making VGG-style ConvNets Great Again` - https://arxiv.org/abs/2101.03697
+    `Making VGG-style ConvNets Great Again` - https://huggingface.co/papers/2101.03697
     """
     return _create_byobnet('repvgg_d2se', pretrained=pretrained, **kwargs)
 
diff --git a/timm/models/cait.py b/timm/models/cait.py
index 28e14ec756..318b6a8a19 100644
--- a/timm/models/cait.py
+++ b/timm/models/cait.py
@@ -1,6 +1,6 @@
 """ Class-Attention in Image Transformers (CaiT)
 
-Paper: 'Going deeper with Image Transformers' - https://arxiv.org/abs/2103.17239
+Paper: 'Going deeper with Image Transformers' - https://huggingface.co/papers/2103.17239
 
 Original code and weights from https://github.com/facebookresearch/deit, copyright below
 
@@ -116,7 +116,7 @@ def forward(self, x, x_cls):
 
 class TalkingHeadAttn(nn.Module):
     # taken from https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py
-    # with slight modifications to add Talking Heads Attention (https://arxiv.org/pdf/2003.02436v1.pdf)
+    # with slight modifications to add Talking Heads Attention (https://huggingface.co/papers/2003.02436v1.pdf)
     def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0.):
         super().__init__()
 
diff --git a/timm/models/coat.py b/timm/models/coat.py
index 906ecb9083..2f8dca5766 100644
--- a/timm/models/coat.py
+++ b/timm/models/coat.py
@@ -1,7 +1,7 @@
 """ 
 CoaT architecture.
 
-Paper: Co-Scale Conv-Attentional Image Transformers - https://arxiv.org/abs/2104.06399
+Paper: Co-Scale Conv-Attentional Image Transformers - https://huggingface.co/papers/2104.06399
 
 Official CoaT code at: https://github.com/mlpc-ucsd/CoaT
 
diff --git a/timm/models/convit.py b/timm/models/convit.py
index cbe3b51ece..5eb9724c8b 100644
--- a/timm/models/convit.py
+++ b/timm/models/convit.py
@@ -7,7 +7,7 @@
   year={2021}
 }
 
-Paper link: https://arxiv.org/abs/2103.10697
+Paper link: https://huggingface.co/papers/2103.10697
 Original code: https://github.com/facebookresearch/convit, original copyright below
 
 Modifications and additions for timm hacked together by / Copyright 2021, Ross Wightman
diff --git a/timm/models/convnext.py b/timm/models/convnext.py
index e2eb48d37f..26c51cc069 100644
--- a/timm/models/convnext.py
+++ b/timm/models/convnext.py
@@ -1,7 +1,7 @@
 """ ConvNeXt
 
 Papers:
-* `A ConvNet for the 2020s` - https://arxiv.org/pdf/2201.03545.pdf
+* `A ConvNet for the 2020s` - https://huggingface.co/papers/2201.03545
 @Article{liu2022convnet,
   author  = {Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie},
   title   = {A ConvNet for the 2020s},
@@ -9,7 +9,7 @@
   year    = {2022},
 }
 
-* `ConvNeXt-V2 - Co-designing and Scaling ConvNets with Masked Autoencoders` - https://arxiv.org/abs/2301.00808
+* `ConvNeXt-V2 - Co-designing and Scaling ConvNets with Masked Autoencoders` - https://huggingface.co/papers/2301.00808
 @article{Woo2023ConvNeXtV2,
   title={ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders},
   author={Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon and Saining Xie},
@@ -265,7 +265,7 @@ def _get_norm_layers(norm_layer: Union[Callable, str], conv_mlp: bool, norm_eps:
 
 class ConvNeXt(nn.Module):
     r""" ConvNeXt
-        A PyTorch impl of : `A ConvNet for the 2020s`  - https://arxiv.org/pdf/2201.03545.pdf
+        A PyTorch impl of : `A ConvNet for the 2020s`  - https://huggingface.co/papers/2201.03545
     """
 
     def __init__(
diff --git a/timm/models/crossvit.py b/timm/models/crossvit.py
index f3d52f8e49..e8d57d1f6a 100644
--- a/timm/models/crossvit.py
+++ b/timm/models/crossvit.py
@@ -8,7 +8,7 @@
     year={2021}
 }
 
-Paper link: https://arxiv.org/abs/2103.14899
+Paper link: https://huggingface.co/papers/2103.14899
 Original code: https://github.com/IBM/CrossViT/blob/main/models/crossvit.py
 
 NOTE: model names have been renamed from originals to represent actual input res all *_224 -> *_240 and *_384 -> *_408
diff --git a/timm/models/cspnet.py b/timm/models/cspnet.py
index 81d11a0654..c6cb4c75bd 100644
--- a/timm/models/cspnet.py
+++ b/timm/models/cspnet.py
@@ -6,7 +6,7 @@
 * CSPDarkNet53
 * and DarkNet53 for good measure
 
-Based on paper `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://arxiv.org/abs/1911.11929
+Based on paper `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://huggingface.co/papers/1911.11929
 
 Reference impl via darknet cfg files at https://github.com/WongKinYiu/CrossStagePartialNetworks
 
@@ -616,7 +616,7 @@ def create_csp_stages(
 class CspNet(nn.Module):
     """Cross Stage Partial base model.
 
-    Paper: `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://arxiv.org/abs/1911.11929
+    Paper: `CSPNet: A New Backbone that can Enhance Learning Capability of CNN` - https://huggingface.co/papers/1911.11929
     Ref Impl: https://github.com/WongKinYiu/CrossStagePartialNetworks
 
     NOTE: There are differences in the way I handle the 1x1 'expansion' conv in this impl vs the
diff --git a/timm/models/davit.py b/timm/models/davit.py
index f538ecca84..3f7db8d4db 100644
--- a/timm/models/davit.py
+++ b/timm/models/davit.py
@@ -1,6 +1,6 @@
 """ DaViT: Dual Attention Vision Transformers
 
-As described in https://arxiv.org/abs/2204.03645
+As described in https://huggingface.co/papers/2204.03645
 
 Input size invariant transformer architecture that combines channel and spacial
 attention in each block. The attention mechanisms used are linear in complexity.
@@ -501,7 +501,7 @@ def forward(self, x: Tensor):
 
 class DaVit(nn.Module):
     r""" DaViT
-        A PyTorch implementation of `DaViT: Dual Attention Vision Transformers`  - https://arxiv.org/abs/2204.03645
+        A PyTorch implementation of `DaViT: Dual Attention Vision Transformers`  - https://huggingface.co/papers/2204.03645
         Supports arbitrary input sizes and pyramid feature extraction
         
     Args:
diff --git a/timm/models/deit.py b/timm/models/deit.py
index 0072013bf6..271734652e 100644
--- a/timm/models/deit.py
+++ b/timm/models/deit.py
@@ -2,9 +2,9 @@
 
 DeiT model defs and weights from https://github.com/facebookresearch/deit, original copyright below
 
-paper: `DeiT: Data-efficient Image Transformers` - https://arxiv.org/abs/2012.12877
+paper: `DeiT: Data-efficient Image Transformers` - https://huggingface.co/papers/2012.12877
 
-paper: `DeiT III: Revenge of the ViT` - https://arxiv.org/abs/2204.07118
+paper: `DeiT III: Revenge of the ViT` - https://huggingface.co/papers/2204.07118
 
 Modifications copyright 2021, Ross Wightman
 """
@@ -29,7 +29,7 @@ class VisionTransformerDistilled(VisionTransformer):
     """ Vision Transformer w/ Distillation Token and Head
 
     Distillation token & head support for `DeiT: Data-efficient Image Transformers`
-        - https://arxiv.org/abs/2012.12877
+        - https://huggingface.co/papers/2012.12877
     """
 
     def __init__(self, *args, **kwargs):
@@ -243,7 +243,7 @@ def _cfg(url='', **kwargs):
 
 @register_model
 def deit_tiny_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-tiny model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-tiny model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=192, depth=12, num_heads=3)
@@ -253,7 +253,7 @@ def deit_tiny_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit_small_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-small model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-small model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=384, depth=12, num_heads=6)
@@ -263,7 +263,7 @@ def deit_small_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit_base_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT base model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT base model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -273,7 +273,7 @@ def deit_base_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit_base_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT base model @ 384x384 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT base model @ 384x384 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -283,7 +283,7 @@ def deit_base_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit_tiny_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransformerDistilled:
-    """ DeiT-tiny distilled model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-tiny distilled model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=192, depth=12, num_heads=3)
@@ -294,7 +294,7 @@ def deit_tiny_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransfo
 
 @register_model
 def deit_small_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransformerDistilled:
-    """ DeiT-small distilled model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-small distilled model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=384, depth=12, num_heads=6)
@@ -305,7 +305,7 @@ def deit_small_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransf
 
 @register_model
 def deit_base_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransformerDistilled:
-    """ DeiT-base distilled model @ 224x224 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-base distilled model @ 224x224 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -316,7 +316,7 @@ def deit_base_distilled_patch16_224(pretrained=False, **kwargs) -> VisionTransfo
 
 @register_model
 def deit_base_distilled_patch16_384(pretrained=False, **kwargs) -> VisionTransformerDistilled:
-    """ DeiT-base distilled model @ 384x384 from paper (https://arxiv.org/abs/2012.12877).
+    """ DeiT-base distilled model @ 384x384 from paper (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -327,7 +327,7 @@ def deit_base_distilled_patch16_384(pretrained=False, **kwargs) -> VisionTransfo
 
 @register_model
 def deit3_small_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 small model @ 224x224 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 small model @ 224x224 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=384, depth=12, num_heads=6, no_embed_class=True, init_values=1e-6)
@@ -337,7 +337,7 @@ def deit3_small_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_small_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 small model @ 384x384 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 small model @ 384x384 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=384, depth=12, num_heads=6, no_embed_class=True, init_values=1e-6)
@@ -347,7 +347,7 @@ def deit3_small_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_medium_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 medium model @ 224x224 (https://arxiv.org/abs/2012.12877).
+    """ DeiT-3 medium model @ 224x224 (https://huggingface.co/papers/2012.12877).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=512, depth=12, num_heads=8, no_embed_class=True, init_values=1e-6)
@@ -357,7 +357,7 @@ def deit3_medium_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_base_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 base model @ 224x224 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 base model @ 224x224 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, no_embed_class=True, init_values=1e-6)
@@ -367,7 +367,7 @@ def deit3_base_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_base_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 base model @ 384x384 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 base model @ 384x384 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, no_embed_class=True, init_values=1e-6)
@@ -377,7 +377,7 @@ def deit3_base_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_large_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 large model @ 224x224 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 large model @ 224x224 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=1024, depth=24, num_heads=16, no_embed_class=True, init_values=1e-6)
@@ -387,7 +387,7 @@ def deit3_large_patch16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_large_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 large model @ 384x384 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 large model @ 384x384 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=16, embed_dim=1024, depth=24, num_heads=16, no_embed_class=True, init_values=1e-6)
@@ -397,7 +397,7 @@ def deit3_large_patch16_384(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def deit3_huge_patch14_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ DeiT-3 base model @ 384x384 from paper (https://arxiv.org/abs/2204.07118).
+    """ DeiT-3 base model @ 384x384 from paper (https://huggingface.co/papers/2204.07118).
     ImageNet-1k weights from https://github.com/facebookresearch/deit.
     """
     model_args = dict(patch_size=14, embed_dim=1280, depth=32, num_heads=16, no_embed_class=True, init_values=1e-6)
diff --git a/timm/models/densenet.py b/timm/models/densenet.py
index d522965907..e7bdcc8c7a 100644
--- a/timm/models/densenet.py
+++ b/timm/models/densenet.py
@@ -145,7 +145,7 @@ def __init__(
 
 class DenseNet(nn.Module):
     r"""Densenet-BC model class, based on
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`_
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`_
 
     Args:
         growth_rate (int) - how many filters to add each layer (`k` in paper)
@@ -156,7 +156,7 @@ class DenseNet(nn.Module):
         proj_drop_rate (float) - dropout rate after each dense layer
         num_classes (int) - number of classification classes
         memory_efficient (bool) - If True, uses checkpointing. Much more memory efficient,
-          but slower. Default: *False*. See `"paper" <https://arxiv.org/pdf/1707.06990.pdf>`_
+          but slower. Default: *False*. See `"paper" <https://huggingface.co/papers/1707.06990>`_
     """
 
     def __init__(
@@ -361,7 +361,7 @@ def _cfg(url='', **kwargs):
 @register_model
 def densenet121(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-121 model from
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=32, block_config=(6, 12, 24, 16))
     model = _create_densenet('densenet121', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -371,7 +371,7 @@ def densenet121(pretrained=False, **kwargs) -> DenseNet:
 @register_model
 def densenetblur121d(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-121 w/ blur-pooling & 3-layer 3x3 stem
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=32, block_config=(6, 12, 24, 16), stem_type='deep', aa_layer=BlurPool2d)
     model = _create_densenet('densenetblur121d', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -381,7 +381,7 @@ def densenetblur121d(pretrained=False, **kwargs) -> DenseNet:
 @register_model
 def densenet169(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-169 model from
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=32, block_config=(6, 12, 32, 32))
     model = _create_densenet('densenet169', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -391,7 +391,7 @@ def densenet169(pretrained=False, **kwargs) -> DenseNet:
 @register_model
 def densenet201(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-201 model from
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=32, block_config=(6, 12, 48, 32))
     model = _create_densenet('densenet201', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -401,7 +401,7 @@ def densenet201(pretrained=False, **kwargs) -> DenseNet:
 @register_model
 def densenet161(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-161 model from
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=48, block_config=(6, 12, 36, 24))
     model = _create_densenet('densenet161', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -411,7 +411,7 @@ def densenet161(pretrained=False, **kwargs) -> DenseNet:
 @register_model
 def densenet264d(pretrained=False, **kwargs) -> DenseNet:
     r"""Densenet-264 model from
-    `"Densely Connected Convolutional Networks" <https://arxiv.org/pdf/1608.06993.pdf>`
+    `"Densely Connected Convolutional Networks" <https://huggingface.co/papers/1608.06993>`
     """
     model_args = dict(growth_rate=48, block_config=(6, 12, 64, 48), stem_type='deep')
     model = _create_densenet('densenet264d', pretrained=pretrained, **dict(model_args, **kwargs))
diff --git a/timm/models/dla.py b/timm/models/dla.py
index 666acd9d9c..6637d80252 100644
--- a/timm/models/dla.py
+++ b/timm/models/dla.py
@@ -1,9 +1,9 @@
 """ Deep Layer Aggregation and DLA w/ Res2Net
 DLA original adapted from Official Pytorch impl at: https://github.com/ucbdrive/dla
-DLA Paper: `Deep Layer Aggregation` - https://arxiv.org/abs/1707.06484
+DLA Paper: `Deep Layer Aggregation` - https://huggingface.co/papers/1707.06484
 
 Res2Net additions from: https://github.com/gasvn/Res2Net/
-Res2Net Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://arxiv.org/abs/1904.01169
+Res2Net Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://huggingface.co/papers/1904.01169
 """
 import math
 from typing import List, Optional
diff --git a/timm/models/edgenext.py b/timm/models/edgenext.py
index e21be9713b..1470214128 100644
--- a/timm/models/edgenext.py
+++ b/timm/models/edgenext.py
@@ -1,7 +1,7 @@
 """ EdgeNeXt
 
 Paper: `EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications`
- - https://arxiv.org/abs/2206.10589
+ - https://huggingface.co/papers/2206.10589
 
 Original code and weights from https://github.com/mmaaz60/EdgeNeXt
 
diff --git a/timm/models/efficientnet.py b/timm/models/efficientnet.py
index b5bc35c036..06a7d30ce6 100644
--- a/timm/models/efficientnet.py
+++ b/timm/models/efficientnet.py
@@ -3,28 +3,28 @@
 An implementation of EfficienNet that covers variety of related models with efficient architectures:
 
 * EfficientNet-V2
-  - `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+  - `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
 
 * EfficientNet (B0-B8, L2 + Tensorflow pretrained AutoAug/RandAug/AdvProp/NoisyStudent weight ports)
-  - EfficientNet: Rethinking Model Scaling for CNNs - https://arxiv.org/abs/1905.11946
-  - CondConv: Conditionally Parameterized Convolutions for Efficient Inference - https://arxiv.org/abs/1904.04971
-  - Adversarial Examples Improve Image Recognition - https://arxiv.org/abs/1911.09665
-  - Self-training with Noisy Student improves ImageNet classification - https://arxiv.org/abs/1911.04252
+  - EfficientNet: Rethinking Model Scaling for CNNs - https://huggingface.co/papers/1905.11946
+  - CondConv: Conditionally Parameterized Convolutions for Efficient Inference - https://huggingface.co/papers/1904.04971
+  - Adversarial Examples Improve Image Recognition - https://huggingface.co/papers/1911.09665
+  - Self-training with Noisy Student improves ImageNet classification - https://huggingface.co/papers/1911.04252
 
 * MixNet (Small, Medium, and Large)
-  - MixConv: Mixed Depthwise Convolutional Kernels - https://arxiv.org/abs/1907.09595
+  - MixConv: Mixed Depthwise Convolutional Kernels - https://huggingface.co/papers/1907.09595
 
 * MNasNet B1, A1 (SE), Small
-  - MnasNet: Platform-Aware Neural Architecture Search for Mobile - https://arxiv.org/abs/1807.11626
+  - MnasNet: Platform-Aware Neural Architecture Search for Mobile - https://huggingface.co/papers/1807.11626
 
 * FBNet-C
-  - FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable NAS - https://arxiv.org/abs/1812.03443
+  - FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable NAS - https://huggingface.co/papers/1812.03443
 
 * Single-Path NAS Pixel1
-  - Single-Path NAS: Designing Hardware-Efficient ConvNets - https://arxiv.org/abs/1904.02877
+  - Single-Path NAS: Designing Hardware-Efficient ConvNets - https://huggingface.co/papers/1904.02877
 
 * TinyNet
-    - Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets - https://arxiv.org/abs/2010.14819
+    - Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets - https://huggingface.co/papers/2010.14819
     - Definitions & weights borrowed from https://github.com/huawei-noah/CV-Backbones/tree/master/tinynet_pytorch
 
 * And likely more...
@@ -389,7 +389,7 @@ def _gen_mnasnet_a1(variant, channel_multiplier=1.0, pretrained=False, **kwargs)
     """Creates a mnasnet-a1 model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet
-    Paper: https://arxiv.org/pdf/1807.11626.pdf.
+    Paper: https://huggingface.co/papers/1807.11626.
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -425,7 +425,7 @@ def _gen_mnasnet_b1(variant, channel_multiplier=1.0, pretrained=False, **kwargs)
     """Creates a mnasnet-b1 model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet
-    Paper: https://arxiv.org/pdf/1807.11626.pdf.
+    Paper: https://huggingface.co/papers/1807.11626.
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -461,7 +461,7 @@ def _gen_mnasnet_small(variant, channel_multiplier=1.0, pretrained=False, **kwar
     """Creates a mnasnet-b1 model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet
-    Paper: https://arxiv.org/pdf/1807.11626.pdf.
+    Paper: https://huggingface.co/papers/1807.11626.
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -492,7 +492,7 @@ def _gen_mobilenet_v1(
 ):
     """
     Ref impl: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py
-    Paper: https://arxiv.org/abs/1801.04381
+    Paper: https://huggingface.co/papers/1801.04381
     """
     arch_def = [
         ['dsa_r1_k3_s1_c64'],
@@ -528,7 +528,7 @@ def _gen_mobilenet_v2(
 ):
     """ Generate MobileNet-V2 network
     Ref impl: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet_v2.py
-    Paper: https://arxiv.org/abs/1801.04381
+    Paper: https://huggingface.co/papers/1801.04381
     """
     arch_def = [
         ['ds_r1_k3_s1_c16'],
@@ -562,7 +562,7 @@ def _gen_mobilenet_v2(
 def _gen_fbnetc(variant, channel_multiplier=1.0, pretrained=False, **kwargs):
     """ FBNet-C
 
-        Paper: https://arxiv.org/abs/1812.03443
+        Paper: https://huggingface.co/papers/1812.03443
         Ref Impl: https://github.com/facebookresearch/maskrcnn-benchmark/blob/master/maskrcnn_benchmark/modeling/backbone/fbnet_modeldef.py
 
         NOTE: the impl above does not relate to the 'C' variant here, that was derived from paper,
@@ -592,7 +592,7 @@ def _gen_fbnetc(variant, channel_multiplier=1.0, pretrained=False, **kwargs):
 def _gen_spnasnet(variant, channel_multiplier=1.0, pretrained=False, **kwargs):
     """Creates the Single-Path NAS model from search targeted for Pixel1 phone.
 
-    Paper: https://arxiv.org/abs/1904.02877
+    Paper: https://huggingface.co/papers/1904.02877
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -631,7 +631,7 @@ def _gen_efficientnet(
     """Creates an EfficientNet model.
 
     Ref impl: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_model.py
-    Paper: https://arxiv.org/abs/1905.11946
+    Paper: https://huggingface.co/papers/1905.11946
 
     EfficientNet params
     name: (channel_multiplier, depth_multiplier, resolution, dropout_rate)
@@ -742,7 +742,7 @@ def _gen_efficientnet_lite(variant, channel_multiplier=1.0, depth_multiplier=1.0
     """Creates an EfficientNet-Lite model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet/lite
-    Paper: https://arxiv.org/abs/1905.11946
+    Paper: https://huggingface.co/papers/1905.11946
 
     EfficientNet params
     name: (channel_multiplier, depth_multiplier, resolution, dropout_rate)
@@ -785,7 +785,7 @@ def _gen_efficientnetv2_base(
     """ Creates an EfficientNet-V2 base model
 
     Ref impl: https://github.com/google/automl/tree/master/efficientnetv2
-    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
     """
     arch_def = [
         ['cn_r1_k3_s1_e1_c16_skip'],
@@ -815,7 +815,7 @@ def _gen_efficientnetv2_s(
     """ Creates an EfficientNet-V2 Small model
 
     Ref impl: https://github.com/google/automl/tree/master/efficientnetv2
-    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
 
     NOTE: `rw` flag sets up 'small' variant to behave like my initial v2 small model,
         before ref the impl was released.
@@ -855,7 +855,7 @@ def _gen_efficientnetv2_m(
     """ Creates an EfficientNet-V2 Medium model
 
     Ref impl: https://github.com/google/automl/tree/master/efficientnetv2
-    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
     """
 
     arch_def = [
@@ -887,7 +887,7 @@ def _gen_efficientnetv2_l(
     """ Creates an EfficientNet-V2 Large model
 
     Ref impl: https://github.com/google/automl/tree/master/efficientnetv2
-    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
     """
 
     arch_def = [
@@ -919,7 +919,7 @@ def _gen_efficientnetv2_xl(
     """ Creates an EfficientNet-V2 Xtra-Large model
 
     Ref impl: https://github.com/google/automl/tree/master/efficientnetv2
-    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://arxiv.org/abs/2104.00298
+    Paper: `EfficientNetV2: Smaller Models and Faster Training` - https://huggingface.co/papers/2104.00298
     """
 
     arch_def = [
@@ -952,7 +952,7 @@ def _gen_efficientnet_x(
     """Creates an EfficientNet model.
 
     Ref impl: https://github.com/tensorflow/tpu/blob/master/models/official/efficientnet/efficientnet_model.py
-    Paper: https://arxiv.org/abs/1905.11946
+    Paper: https://huggingface.co/papers/1905.11946
 
     EfficientNet params
     name: (channel_multiplier, depth_multiplier, resolution, dropout_rate)
@@ -1032,7 +1032,7 @@ def _gen_mixnet_s(variant, channel_multiplier=1.0, pretrained=False, **kwargs):
     """Creates a MixNet Small model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet/mixnet
-    Paper: https://arxiv.org/abs/1907.09595
+    Paper: https://huggingface.co/papers/1907.09595
     """
     arch_def = [
         # stage 0, 112x112 in
@@ -1065,7 +1065,7 @@ def _gen_mixnet_m(variant, channel_multiplier=1.0, depth_multiplier=1.0, pretrai
     """Creates a MixNet Medium-Large model.
 
     Ref impl: https://github.com/tensorflow/tpu/tree/master/models/official/mnasnet/mixnet
-    Paper: https://arxiv.org/abs/1907.09595
+    Paper: https://huggingface.co/papers/1907.09595
     """
     arch_def = [
         # stage 0, 112x112 in
@@ -2231,7 +2231,7 @@ def efficientnet_lite4(pretrained=False, **kwargs) -> EfficientNet:
 
 @register_model
 def efficientnet_b1_pruned(pretrained=False, **kwargs) -> EfficientNet:
-    """ EfficientNet-B1 Pruned. The pruning has been obtained using https://arxiv.org/pdf/2002.08258.pdf  """
+    """ EfficientNet-B1 Pruned. The pruning has been obtained using https://huggingface.co/papers/2002.08258  """
     kwargs.setdefault('bn_eps', BN_EPS_TF_DEFAULT)
     kwargs.setdefault('pad_type', 'same')
     variant = 'efficientnet_b1_pruned'
@@ -2242,7 +2242,7 @@ def efficientnet_b1_pruned(pretrained=False, **kwargs) -> EfficientNet:
 
 @register_model
 def efficientnet_b2_pruned(pretrained=False, **kwargs) -> EfficientNet:
-    """ EfficientNet-B2 Pruned. The pruning has been obtained using https://arxiv.org/pdf/2002.08258.pdf """
+    """ EfficientNet-B2 Pruned. The pruning has been obtained using https://huggingface.co/papers/2002.08258 """
     kwargs.setdefault('bn_eps', BN_EPS_TF_DEFAULT)
     kwargs.setdefault('pad_type', 'same')
     model = _gen_efficientnet(
@@ -2253,7 +2253,7 @@ def efficientnet_b2_pruned(pretrained=False, **kwargs) -> EfficientNet:
 
 @register_model
 def efficientnet_b3_pruned(pretrained=False, **kwargs) -> EfficientNet:
-    """ EfficientNet-B3 Pruned. The pruning has been obtained using https://arxiv.org/pdf/2002.08258.pdf """
+    """ EfficientNet-B3 Pruned. The pruning has been obtained using https://huggingface.co/papers/2002.08258 """
     kwargs.setdefault('bn_eps', BN_EPS_TF_DEFAULT)
     kwargs.setdefault('pad_type', 'same')
     model = _gen_efficientnet(
diff --git a/timm/models/efficientvit_mit.py b/timm/models/efficientvit_mit.py
index 27872310e0..d9294fd61c 100644
--- a/timm/models/efficientvit_mit.py
+++ b/timm/models/efficientvit_mit.py
@@ -1,7 +1,7 @@
 """ EfficientViT (by MIT Song Han's Lab)
 
 Paper: `Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition`
-    - https://arxiv.org/abs/2205.14756
+    - https://huggingface.co/papers/2205.14756
 
 Adapted from official impl at https://github.com/mit-han-lab/efficientvit
 """
diff --git a/timm/models/efficientvit_msra.py b/timm/models/efficientvit_msra.py
index 91caaa5a4d..98a3f3d948 100644
--- a/timm/models/efficientvit_msra.py
+++ b/timm/models/efficientvit_msra.py
@@ -1,7 +1,7 @@
 """ EfficientViT (by MSRA)
 
 Paper: `EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention`
-    - https://arxiv.org/abs/2305.07027
+    - https://huggingface.co/papers/2305.07027
 
 Adapted from official impl at https://github.com/microsoft/Cream/tree/main/EfficientViT
 """
diff --git a/timm/models/eva.py b/timm/models/eva.py
index 99a77bfdec..3c5daf3f07 100644
--- a/timm/models/eva.py
+++ b/timm/models/eva.py
@@ -1,6 +1,6 @@
 """ EVA
 
-EVA from https://github.com/baaivision/EVA , paper: https://arxiv.org/abs/2211.07636
+EVA from https://github.com/baaivision/EVA , paper: https://huggingface.co/papers/2211.07636
 
 @article{EVA,
   title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
@@ -10,7 +10,7 @@
   year={2022}
 }
 
-EVA-02: A Visual Representation for Neon Genesis - https://arxiv.org/abs/2303.11331
+EVA-02: A Visual Representation for Neon Genesis - https://huggingface.co/papers/2303.11331
 @article{EVA02,
   title={EVA-02: A Visual Representation for Neon Genesis},
   author={Fang, Yuxin and Sun, Quan and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
@@ -1204,7 +1204,7 @@ def _pe_cfg(url='', **kwargs):
 
 @register_model
 def eva_giant_patch14_224(pretrained=False, **kwargs) -> Eva:
-    """ EVA-g model https://arxiv.org/abs/2211.07636 """
+    """ EVA-g model https://huggingface.co/papers/2211.07636 """
     model_args = dict(patch_size=14, embed_dim=1408, depth=40, num_heads=16, mlp_ratio=6144 / 1408)
     model = _create_eva('eva_giant_patch14_224', pretrained=pretrained, **dict(model_args, **kwargs))
     return model
@@ -1212,7 +1212,7 @@ def eva_giant_patch14_224(pretrained=False, **kwargs) -> Eva:
 
 @register_model
 def eva_giant_patch14_336(pretrained=False, **kwargs) -> Eva:
-    """ EVA-g model https://arxiv.org/abs/2211.07636 """
+    """ EVA-g model https://huggingface.co/papers/2211.07636 """
     model_args = dict(patch_size=14, embed_dim=1408, depth=40, num_heads=16, mlp_ratio=6144 / 1408)
     model = _create_eva('eva_giant_patch14_336', pretrained=pretrained, **dict(model_args, **kwargs))
     return model
@@ -1220,7 +1220,7 @@ def eva_giant_patch14_336(pretrained=False, **kwargs) -> Eva:
 
 @register_model
 def eva_giant_patch14_560(pretrained=False, **kwargs) -> Eva:
-    """ EVA-g model https://arxiv.org/abs/2211.07636 """
+    """ EVA-g model https://huggingface.co/papers/2211.07636 """
     model_args = dict(patch_size=14, embed_dim=1408, depth=40, num_heads=16, mlp_ratio=6144 / 1408)
     model = _create_eva('eva_giant_patch14_560', pretrained=pretrained, **dict(model_args, **kwargs))
     return model
diff --git a/timm/models/fastvit.py b/timm/models/fastvit.py
index 96e1d593c6..8110476de5 100644
--- a/timm/models/fastvit.py
+++ b/timm/models/fastvit.py
@@ -38,7 +38,7 @@ class MobileOneBlock(nn.Module):
     and plain-CNN style architecture at inference time
     For more details, please refer to our paper:
     `An Improved One millisecond Mobile Backbone` -
-    https://arxiv.org/pdf/2206.04040.pdf
+    https://huggingface.co/papers/2206.04040
     """
 
     def __init__(
@@ -160,7 +160,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
     def reparameterize(self):
         """Following works like `RepVGG: Making VGG-style ConvNets Great Again` -
-        https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched
+        https://huggingface.co/papers/2101.03697. We re-parameterize multi-branched
         architecture used at training time to obtain a plain CNN-like structure
         for inference.
         """
@@ -276,7 +276,7 @@ class ReparamLargeKernelConv(nn.Module):
     """Building Block of RepLKNet
 
     This class defines overparameterized large kernel conv block
-    introduced in `RepLKNet <https://arxiv.org/abs/2203.06717>`_
+    introduced in `RepLKNet <https://huggingface.co/papers/2203.06717>`_
 
     Reference: https://github.com/DingXiaoH/RepLKNet-pytorch
     """
@@ -379,7 +379,7 @@ def get_kernel_bias(self) -> Tuple[torch.Tensor, torch.Tensor]:
     def reparameterize(self) -> None:
         """
         Following works like `RepVGG: Making VGG-style ConvNets Great Again` -
-        https://arxiv.org/pdf/2101.03697.pdf. We re-parameterize multi-branched
+        https://huggingface.co/papers/2101.03697. We re-parameterize multi-branched
         architecture used at training time to obtain a plain CNN-like structure
         for inference.
         """
@@ -602,7 +602,7 @@ class RepMixer(nn.Module):
     """Reparameterizable token mixer.
 
     For more details, please refer to our paper:
-    `FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization <https://arxiv.org/pdf/2303.14189.pdf>`_
+    `FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization <https://huggingface.co/papers/2303.14189>`_
     """
 
     def __init__(
@@ -766,7 +766,7 @@ class RepConditionalPosEnc(nn.Module):
     """Implementation of conditional positional encoding.
 
     For more details refer to paper:
-    `Conditional Positional Encodings for Vision Transformers <https://arxiv.org/pdf/2102.10882.pdf>`_
+    `Conditional Positional Encodings for Vision Transformers <https://huggingface.co/papers/2102.10882>`_
 
     In our implementation, we can reparameterize this module to eliminate a skip connection.
     """
@@ -882,7 +882,7 @@ class RepMixerBlock(nn.Module):
     """Implementation of Metaformer block with RepMixer as token mixer.
 
     For more details on Metaformer structure, please refer to:
-    `MetaFormer Is Actually What You Need for Vision <https://arxiv.org/pdf/2111.11418.pdf>`_
+    `MetaFormer Is Actually What You Need for Vision <https://huggingface.co/papers/2111.11418>`_
     """
 
     def __init__(
@@ -940,7 +940,7 @@ class AttentionBlock(nn.Module):
     """Implementation of metaformer block with MHSA as token mixer.
 
     For more details on Metaformer structure, please refer to:
-    `MetaFormer Is Actually What You Need for Vision <https://arxiv.org/pdf/2111.11418.pdf>`_
+    `MetaFormer Is Actually What You Need for Vision <https://huggingface.co/papers/2111.11418>`_
     """
 
     def __init__(
@@ -1096,7 +1096,7 @@ class FastVit(nn.Module):
     fork_feat: torch.jit.Final[bool]
 
     """
-    This class implements `FastViT architecture <https://arxiv.org/pdf/2303.14189.pdf>`_
+    This class implements `FastViT architecture <https://huggingface.co/papers/2303.14189>`_
     """
 
     def __init__(
diff --git a/timm/models/focalnet.py b/timm/models/focalnet.py
index ec7cd1cff1..e58f5e317f 100644
--- a/timm/models/focalnet.py
+++ b/timm/models/focalnet.py
@@ -1,6 +1,6 @@
 """ FocalNet
 
-As described in `Focal Modulation Networks` - https://arxiv.org/abs/2203.11926
+As described in `Focal Modulation Networks` - https://huggingface.co/papers/2203.11926
 
 Significant modifications and refactoring from the original impl at https://github.com/microsoft/FocalNet
 
diff --git a/timm/models/gcvit.py b/timm/models/gcvit.py
index c862dc4a20..7699519604 100644
--- a/timm/models/gcvit.py
+++ b/timm/models/gcvit.py
@@ -2,7 +2,7 @@
 
 From scratch implementation of GCViT in the style of timm swin_transformer_v2_cr.py
 
-Global Context Vision Transformers -https://arxiv.org/abs/2206.09959
+Global Context Vision Transformers -https://huggingface.co/papers/2206.09959
 
 @article{hatamizadeh2022global,
   title={Global Context Vision Transformers},
diff --git a/timm/models/ghostnet.py b/timm/models/ghostnet.py
index d73276d4e4..b708658b74 100644
--- a/timm/models/ghostnet.py
+++ b/timm/models/ghostnet.py
@@ -1,6 +1,6 @@
 """
 An implementation of GhostNet & GhostNetV2 Models as defined in:
-GhostNet: More Features from Cheap Operations. https://arxiv.org/abs/1911.11907
+GhostNet: More Features from Cheap Operations. https://huggingface.co/papers/1911.11907
 GhostNetV2: Enhance Cheap Operation with Long-Range Attention. https://proceedings.neurips.cc/paper_files/paper/2022/file/40b60852a4abdaa696b5a1a78da34635-Paper-Conference.pdf
 
 The train script & code of models at:
diff --git a/timm/models/hardcorenas.py b/timm/models/hardcorenas.py
index 459c1a3db8..dc90c641db 100644
--- a/timm/models/hardcorenas.py
+++ b/timm/models/hardcorenas.py
@@ -17,7 +17,7 @@ def _gen_hardcorenas(pretrained, variant, arch_def, **kwargs):
     """Creates a hardcorenas model
 
     Ref impl: https://github.com/Alibaba-MIIL/HardCoReNAS
-    Paper: https://arxiv.org/abs/2102.11646
+    Paper: https://huggingface.co/papers/2102.11646
 
     """
     num_features = 1280
diff --git a/timm/models/hiera.py b/timm/models/hiera.py
index 2c16a9d63e..e290645e4d 100644
--- a/timm/models/hiera.py
+++ b/timm/models/hiera.py
@@ -16,7 +16,7 @@
 # Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed,
 # Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
 #
-# Paper: https://arxiv.org/abs/2306.00989/
+# Paper: https://huggingface.co/papers/2306.00989/
 #
 # References:
 # slowfast: https://github.com/facebookresearch/SlowFast
@@ -644,7 +644,7 @@ def get_random_mask(self, x: torch.Tensor, mask_ratio: float) -> torch.Tensor:
     def _pos_embed(self, x) -> torch.Tensor:
         if self.pos_embed_win is not None:
             # absolute win position embedding, from
-            # Window Attention is Bugged: How not to Interpolate Position Embeddings (https://arxiv.org/abs/2311.05613)
+            # Window Attention is Bugged: How not to Interpolate Position Embeddings (https://huggingface.co/papers/2311.05613)
             pos_embed_win = self.pos_embed_win.tile(self.mask_spatial_shape)
             pos_embed = F.interpolate(
                 self.pos_embed,
diff --git a/timm/models/hieradet_sam2.py b/timm/models/hieradet_sam2.py
index 6cd2592a95..b910947476 100644
--- a/timm/models/hieradet_sam2.py
+++ b/timm/models/hieradet_sam2.py
@@ -243,7 +243,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
 class HieraDet(nn.Module):
     """
-    Reference: https://arxiv.org/abs/2306.00989
+    Reference: https://huggingface.co/papers/2306.00989
     """
 
     def __init__(
@@ -320,7 +320,7 @@ def __init__(
         # Which blocks have global att?
         self.global_att_blocks = global_att_blocks
 
-        # Windowed positional embedding (https://arxiv.org/abs/2311.05613)
+        # Windowed positional embedding (https://huggingface.co/papers/2311.05613)
         self.global_pos_size = global_pos_size
         self.pos_embed = nn.Parameter(torch.zeros(1, embed_dim, *self.global_pos_size))
         self.pos_embed_window = nn.Parameter(torch.zeros(1, embed_dim, self.window_spec[0], self.window_spec[0]))
diff --git a/timm/models/inception_next.py b/timm/models/inception_next.py
index 2fcf123ffa..966a1e2212 100644
--- a/timm/models/inception_next.py
+++ b/timm/models/inception_next.py
@@ -1,5 +1,5 @@
 """
-InceptionNeXt paper: https://arxiv.org/abs/2303.16900
+InceptionNeXt paper: https://huggingface.co/papers/2303.16900
 Original implementation & weights from: https://github.com/sail-sg/inceptionnext
 """
 
@@ -234,7 +234,7 @@ def forward(self, x):
 
 class MetaNeXt(nn.Module):
     r""" MetaNeXt
-        A PyTorch impl of : `InceptionNeXt: When Inception Meets ConvNeXt` - https://arxiv.org/abs/2303.16900
+        A PyTorch impl of : `InceptionNeXt: When Inception Meets ConvNeXt` - https://huggingface.co/papers/2303.16900
 
     Args:
         in_chans (int): Number of input image channels. Default: 3
diff --git a/timm/models/inception_resnet_v2.py b/timm/models/inception_resnet_v2.py
index 7fdfee41ed..6fd0676a92 100644
--- a/timm/models/inception_resnet_v2.py
+++ b/timm/models/inception_resnet_v2.py
@@ -319,7 +319,7 @@ def _create_inception_resnet_v2(variant, pretrained=False, **kwargs):
         'mean': IMAGENET_INCEPTION_MEAN, 'std': IMAGENET_INCEPTION_STD,
         'first_conv': 'conv2d_1a.conv', 'classifier': 'classif',
     },
-    # As per https://arxiv.org/abs/1705.07204 and
+    # As per https://huggingface.co/papers/1705.07204 and
     # ported from http://download.tensorflow.org/models/ens_adv_inception_resnet_v2_2017_08_18.tar.gz
     'inception_resnet_v2.tf_ens_adv_in1k': {
         'hf_hub_id': 'timm/',
diff --git a/timm/models/levit.py b/timm/models/levit.py
index 577fc5f2d7..f869a948b5 100644
--- a/timm/models/levit.py
+++ b/timm/models/levit.py
@@ -1,7 +1,7 @@
 """ LeViT
 
 Paper: `LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference`
-    - https://arxiv.org/abs/2104.01136
+    - https://huggingface.co/papers/2104.01136
 
 @article{graham2021levit,
   title={LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference},
diff --git a/timm/models/mambaout.py b/timm/models/mambaout.py
index 71d12fe672..005f6c6c1f 100644
--- a/timm/models/mambaout.py
+++ b/timm/models/mambaout.py
@@ -177,12 +177,12 @@ def forward(self, x, pre_logits: bool = False):
 
 
 class GatedConvBlock(nn.Module):
-    r""" Our implementation of Gated CNN Block: https://arxiv.org/pdf/1612.08083
+    r""" Our implementation of Gated CNN Block: https://huggingface.co/papers/1612.08083
     Args:
         conv_ratio: control the number of channels to conduct depthwise convolution.
             Conduct convolution on partial channels can improve paraitcal efficiency.
-            The idea of partial channels is from ShuffleNet V2 (https://arxiv.org/abs/1807.11164) and
-            also used by InceptionNeXt (https://arxiv.org/abs/2303.16900) and FasterNet (https://arxiv.org/abs/2303.03667)
+            The idea of partial channels is from ShuffleNet V2 (https://huggingface.co/papers/1807.11164) and
+            also used by InceptionNeXt (https://huggingface.co/papers/2303.16900) and FasterNet (https://huggingface.co/papers/2303.03667)
     """
 
     def __init__(
@@ -283,7 +283,7 @@ def forward(self, x):
 class MambaOut(nn.Module):
     r""" MetaFormer
         A PyTorch impl of : `MetaFormer Baselines for Vision`  -
-          https://arxiv.org/abs/2210.13452
+          https://huggingface.co/papers/2210.13452
 
     Args:
         in_chans (int): Number of input image channels. Default: 3.
diff --git a/timm/models/maxxvit.py b/timm/models/maxxvit.py
index b7d4e7e44c..3c484911b6 100644
--- a/timm/models/maxxvit.py
+++ b/timm/models/maxxvit.py
@@ -14,7 +14,7 @@
 
 Papers:
 
-MaxViT: Multi-Axis Vision Transformer - https://arxiv.org/abs/2204.01697
+MaxViT: Multi-Axis Vision Transformer - https://huggingface.co/papers/2204.01697
 @article{tu2022maxvit,
   title={MaxViT: Multi-Axis Vision Transformer},
   author={Tu, Zhengzhong and Talebi, Hossein and Zhang, Han and Yang, Feng and Milanfar, Peyman and Bovik, Alan and Li, Yinxiao},
@@ -22,7 +22,7 @@
   year={2022},
 }
 
-CoAtNet: Marrying Convolution and Attention for All Data Sizes - https://arxiv.org/abs/2106.04803
+CoAtNet: Marrying Convolution and Attention for All Data Sizes - https://huggingface.co/papers/2106.04803
 @article{DBLP:journals/corr/abs-2106-04803,
   author    = {Zihang Dai and Hanxiao Liu and Quoc V. Le and Mingxing Tan},
   title     = {CoAtNet: Marrying Convolution and Attention for All Data Sizes},
diff --git a/timm/models/metaformer.py b/timm/models/metaformer.py
index 490852cfe4..7c700a7e49 100644
--- a/timm/models/metaformer.py
+++ b/timm/models/metaformer.py
@@ -1,8 +1,8 @@
 """
-Poolformer from MetaFormer is Actually What You Need for Vision https://arxiv.org/abs/2111.11418
+Poolformer from MetaFormer is Actually What You Need for Vision https://huggingface.co/papers/2111.11418
 
 IdentityFormer, RandFormer, PoolFormerV2, ConvFormer, and CAFormer
-from MetaFormer Baselines for Vision https://arxiv.org/abs/2210.13452
+from MetaFormer Baselines for Vision https://huggingface.co/papers/2210.13452
 
 All implemented models support feature extraction and variable input resolution.
 
@@ -121,7 +121,7 @@ def forward(self, x):
 
 class SquaredReLU(nn.Module):
     """
-        Squared ReLU: https://arxiv.org/abs/2109.08668
+        Squared ReLU: https://huggingface.co/papers/2109.08668
     """
 
     def __init__(self, inplace=False):
@@ -158,7 +158,7 @@ def forward(self, x):
 
 class Attention(nn.Module):
     """
-    Vanilla self-attention from Transformer: https://arxiv.org/abs/1706.03762.
+    Vanilla self-attention from Transformer: https://huggingface.co/papers/1706.03762.
     Modified from timm.
     """
     fused_attn: Final[bool]
@@ -239,7 +239,7 @@ def __init__(self, num_channels, **kwargs):
 
 class SepConv(nn.Module):
     r"""
-    Inverted separable convolution from MobileNetV2: https://arxiv.org/abs/1801.04381.
+    Inverted separable convolution from MobileNetV2: https://huggingface.co/papers/1801.04381.
     """
 
     def __init__(
@@ -274,7 +274,7 @@ def forward(self, x):
 
 class Pooling(nn.Module):
     """
-    Implementation of pooling for PoolFormer: https://arxiv.org/abs/2111.11418
+    Implementation of pooling for PoolFormer: https://huggingface.co/papers/2111.11418
     """
 
     def __init__(self, pool_size=3, **kwargs):
@@ -448,7 +448,7 @@ def forward(self, x: Tensor):
 class MetaFormer(nn.Module):
     r""" MetaFormer
         A PyTorch impl of : `MetaFormer Baselines for Vision`  -
-          https://arxiv.org/abs/2210.13452
+          https://huggingface.co/papers/2210.13452
 
     Args:
         in_chans (int): Number of input image channels.
@@ -462,9 +462,9 @@ class MetaFormer(nn.Module):
         drop_path_rate (float): Stochastic depth rate.
         drop_rate (float): Dropout rate.
         layer_scale_init_values (list, tuple, float or None): Init value for Layer Scale.
-            None means not use the layer scale. Form: https://arxiv.org/abs/2103.17239.
+            None means not use the layer scale. Form: https://huggingface.co/papers/2103.17239.
         res_scale_init_values (list, tuple, float or None): Init value for res Scale on residual connections.
-            None means not use the res scale. From: https://arxiv.org/abs/2110.09456.
+            None means not use the res scale. From: https://huggingface.co/papers/2110.09456.
         downsample_norm (nn.Module): Norm layer used in stem and downsampling layers.
         norm_layers (list, tuple or norm_fcn): Norm layers for each stage.
         output_norm: Norm layer before classifier head.
diff --git a/timm/models/mlp_mixer.py b/timm/models/mlp_mixer.py
index 25cde6a67c..8af7c9631f 100644
--- a/timm/models/mlp_mixer.py
+++ b/timm/models/mlp_mixer.py
@@ -4,7 +4,7 @@
 
 Official JAX impl: https://github.com/google-research/vision_transformer/blob/linen/vit_jax/models_mixer.py
 
-Paper: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+Paper: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
 
 @article{tolstikhin2021,
   title={MLP-Mixer: An all-MLP Architecture for Vision},
@@ -17,7 +17,7 @@
 Also supporting ResMlp, and a preliminary (not verified) implementations of gMLP
 
 Code: https://github.com/facebookresearch/deit
-Paper: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+Paper: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
 @misc{touvron2021resmlp,
       title={ResMLP: Feedforward networks for image classification with data-efficient training},
       author={Hugo Touvron and Piotr Bojanowski and Mathilde Caron and Matthieu Cord and Alaaeldin El-Nouby and
@@ -26,7 +26,7 @@
       eprint={2105.03404},
 }
 
-Paper: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+Paper: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
 @misc{liu2021pay,
       title={Pay Attention to MLPs},
       author={Hanxiao Liu and Zihang Dai and David R. So and Quoc V. Le},
@@ -57,7 +57,7 @@
 
 class MixerBlock(nn.Module):
     """ Residual Block w/ token mixing and channel MLPs
-    Based on: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Based on: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     def __init__(
             self,
@@ -97,7 +97,7 @@ def forward(self, x):
 class ResBlock(nn.Module):
     """ Residual MLP block w/ LayerScale and Affine 'norm'
 
-    Based on: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+    Based on: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
     """
     def __init__(
             self,
@@ -130,7 +130,7 @@ def forward(self, x):
 class SpatialGatingUnit(nn.Module):
     """ Spatial Gating Unit
 
-    Based on: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+    Based on: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
     """
     def __init__(self, dim, seq_len, norm_layer=nn.LayerNorm):
         super().__init__()
@@ -153,7 +153,7 @@ def forward(self, x):
 class SpatialGatingBlock(nn.Module):
     """ Residual Block w/ Spatial Gating
 
-    Based on: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+    Based on: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
     """
     def __init__(
             self,
@@ -531,7 +531,7 @@ def _cfg(url='', **kwargs):
 @register_model
 def mixer_s32_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-S/32 224x224
-    Paper: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper: 'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=32, num_blocks=8, embed_dim=512, **kwargs)
     model = _create_mixer('mixer_s32_224', pretrained=pretrained, **model_args)
@@ -541,7 +541,7 @@ def mixer_s32_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def mixer_s16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-S/16 224x224
-    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=16, num_blocks=8, embed_dim=512, **kwargs)
     model = _create_mixer('mixer_s16_224', pretrained=pretrained, **model_args)
@@ -551,7 +551,7 @@ def mixer_s16_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def mixer_b32_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-B/32 224x224
-    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=32, num_blocks=12, embed_dim=768, **kwargs)
     model = _create_mixer('mixer_b32_224', pretrained=pretrained, **model_args)
@@ -561,7 +561,7 @@ def mixer_b32_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def mixer_b16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-B/16 224x224. ImageNet-1k pretrained weights.
-    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=16, num_blocks=12, embed_dim=768, **kwargs)
     model = _create_mixer('mixer_b16_224', pretrained=pretrained, **model_args)
@@ -571,7 +571,7 @@ def mixer_b16_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def mixer_l32_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-L/32 224x224.
-    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=32, num_blocks=24, embed_dim=1024, **kwargs)
     model = _create_mixer('mixer_l32_224', pretrained=pretrained, **model_args)
@@ -581,7 +581,7 @@ def mixer_l32_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def mixer_l16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ Mixer-L/16 224x224. ImageNet-1k pretrained weights.
-    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://arxiv.org/abs/2105.01601
+    Paper:  'MLP-Mixer: An all-MLP Architecture for Vision' - https://huggingface.co/papers/2105.01601
     """
     model_args = dict(patch_size=16, num_blocks=24, embed_dim=1024, **kwargs)
     model = _create_mixer('mixer_l16_224', pretrained=pretrained, **model_args)
@@ -615,7 +615,7 @@ def gmixer_24_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def resmlp_12_224(pretrained=False, **kwargs) -> MlpMixer:
     """ ResMLP-12
-    Paper: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+    Paper: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
     """
     model_args = dict(
         patch_size=16, num_blocks=12, embed_dim=384, mlp_ratio=4, block_layer=ResBlock, norm_layer=Affine, **kwargs)
@@ -626,7 +626,7 @@ def resmlp_12_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def resmlp_24_224(pretrained=False, **kwargs) -> MlpMixer:
     """ ResMLP-24
-    Paper: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+    Paper: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
     """
     model_args = dict(
         patch_size=16, num_blocks=24, embed_dim=384, mlp_ratio=4,
@@ -638,7 +638,7 @@ def resmlp_24_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def resmlp_36_224(pretrained=False, **kwargs) -> MlpMixer:
     """ ResMLP-36
-    Paper: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+    Paper: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
     """
     model_args = dict(
         patch_size=16, num_blocks=36, embed_dim=384, mlp_ratio=4,
@@ -650,7 +650,7 @@ def resmlp_36_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def resmlp_big_24_224(pretrained=False, **kwargs) -> MlpMixer:
     """ ResMLP-B-24
-    Paper: `ResMLP: Feedforward networks for image classification...` - https://arxiv.org/abs/2105.03404
+    Paper: `ResMLP: Feedforward networks for image classification...` - https://huggingface.co/papers/2105.03404
     """
     model_args = dict(
         patch_size=8, num_blocks=24, embed_dim=768, mlp_ratio=4,
@@ -662,7 +662,7 @@ def resmlp_big_24_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def gmlp_ti16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ gMLP-Tiny
-    Paper: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+    Paper: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
     """
     model_args = dict(
         patch_size=16, num_blocks=30, embed_dim=128, mlp_ratio=6, block_layer=SpatialGatingBlock,
@@ -674,7 +674,7 @@ def gmlp_ti16_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def gmlp_s16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ gMLP-Small
-    Paper: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+    Paper: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
     """
     model_args = dict(
         patch_size=16, num_blocks=30, embed_dim=256, mlp_ratio=6, block_layer=SpatialGatingBlock,
@@ -686,7 +686,7 @@ def gmlp_s16_224(pretrained=False, **kwargs) -> MlpMixer:
 @register_model
 def gmlp_b16_224(pretrained=False, **kwargs) -> MlpMixer:
     """ gMLP-Base
-    Paper: `Pay Attention to MLPs` - https://arxiv.org/abs/2105.08050
+    Paper: `Pay Attention to MLPs` - https://huggingface.co/papers/2105.08050
     """
     model_args = dict(
         patch_size=16, num_blocks=30, embed_dim=512, mlp_ratio=6, block_layer=SpatialGatingBlock,
diff --git a/timm/models/mobilenetv3.py b/timm/models/mobilenetv3.py
index 08dcb064fa..7f174a25a6 100644
--- a/timm/models/mobilenetv3.py
+++ b/timm/models/mobilenetv3.py
@@ -2,7 +2,7 @@
 
 A PyTorch impl of MobileNet-V3, compatible with TF weights from official impl.
 
-Paper: Searching for MobileNetV3 - https://arxiv.org/abs/1905.02244
+Paper: Searching for MobileNetV3 - https://huggingface.co/papers/1905.02244
 
 Hacked together by / Copyright 2019, Ross Wightman
 """
@@ -33,13 +33,13 @@ class MobileNetV3(nn.Module):
     'efficient head', where global pooling is done before the head convolution without a final batch-norm
     layer before the classifier.
 
-    Paper: `Searching for MobileNetV3` - https://arxiv.org/abs/1905.02244
+    Paper: `Searching for MobileNetV3` - https://huggingface.co/papers/1905.02244
 
     Other architectures utilizing MobileNet-V3 efficient head that are supported by this impl include:
-      * HardCoRe-NAS - https://arxiv.org/abs/2102.11646 (defn in hardcorenas.py uses this class)
-      * FBNet-V3 - https://arxiv.org/abs/2006.02049
-      * LCNet - https://arxiv.org/abs/2109.15099
-      * MobileNet-V4 - https://arxiv.org/abs/2404.10518
+      * HardCoRe-NAS - https://huggingface.co/papers/2102.11646 (defn in hardcorenas.py uses this class)
+      * FBNet-V3 - https://huggingface.co/papers/2006.02049
+      * LCNet - https://huggingface.co/papers/2109.15099
+      * MobileNet-V4 - https://huggingface.co/papers/2404.10518
     """
 
     def __init__(
@@ -417,7 +417,7 @@ def _gen_mobilenet_v3_rw(
     """Creates a MobileNet-V3 model.
 
     Ref impl: ?
-    Paper: https://arxiv.org/abs/1905.02244
+    Paper: https://huggingface.co/papers/1905.02244
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -458,7 +458,7 @@ def _gen_mobilenet_v3(
     """Creates a MobileNet-V3 model.
 
     Ref impl: ?
-    Paper: https://arxiv.org/abs/1905.02244
+    Paper: https://huggingface.co/papers/1905.02244
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -554,7 +554,7 @@ def _gen_mobilenet_v3(
 def _gen_fbnetv3(variant: str, channel_multiplier: float = 1.0, pretrained: bool = False, **kwargs):
     """ FBNetV3
     Paper: `FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining`
-        - https://arxiv.org/abs/2006.02049
+        - https://huggingface.co/papers/2006.02049
     FIXME untested, this is a preliminary impl of some FBNet-V3 variants.
     """
     vl = variant.split('_')[-1]
@@ -616,7 +616,7 @@ def _gen_lcnet(variant: str, channel_multiplier: float = 1.0, pretrained: bool =
     """ LCNet
     Essentially a MobileNet-V3 crossed with a MobileNet-V1
 
-    Paper: `PP-LCNet: A Lightweight CPU Convolutional Neural Network` - https://arxiv.org/abs/2109.15099
+    Paper: `PP-LCNet: A Lightweight CPU Convolutional Neural Network` - https://huggingface.co/papers/2109.15099
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
@@ -656,7 +656,7 @@ def _gen_mobilenet_v4(
     """Creates a MobileNet-V4 model.
 
     Ref impl: ?
-    Paper: https://arxiv.org/abs/1905.02244
+    Paper: https://huggingface.co/papers/1905.02244
 
     Args:
       channel_multiplier: multiplier to number of channels per layer.
diff --git a/timm/models/mobilevit.py b/timm/models/mobilevit.py
index 9c84871e6d..6b8229c6bc 100644
--- a/timm/models/mobilevit.py
+++ b/timm/models/mobilevit.py
@@ -1,8 +1,8 @@
 """ MobileViT
 
 Paper:
-V1: `MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer` - https://arxiv.org/abs/2110.02178
-V2: `Separable Self-attention for Mobile Vision Transformers` - https://arxiv.org/abs/2206.02680
+V1: `MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer` - https://huggingface.co/papers/2110.02178
+V2: `Separable Self-attention for Mobile Vision Transformers` - https://huggingface.co/papers/2206.02680
 
 MobileVitBlock and checkpoints adapted from https://github.com/apple/ml-cvnets (original copyright below)
 License: https://github.com/apple/ml-cvnets/blob/main/LICENSE (Apple open source)
@@ -164,7 +164,7 @@ def _mobilevitv2_cfg(multiplier=1.0):
 @register_notrace_module
 class MobileVitBlock(nn.Module):
     """ MobileViT block
-        Paper: https://arxiv.org/abs/2110.02178?context=cs.LG
+        Paper: https://huggingface.co/papers/2110.02178?context=cs.LG
     """
     def __init__(
             self,
@@ -271,7 +271,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 
 class LinearSelfAttention(nn.Module):
     """
-    This layer applies a self-attention with linear complexity, as described in `https://arxiv.org/abs/2206.02680`
+    This layer applies a self-attention with linear complexity, as described in `https://huggingface.co/papers/2206.02680`
     This layer can be used for self- as well as cross-attention.
     Args:
         embed_dim (int): :math:`C` from an expected input of size :math:`(N, C, H, W)`
diff --git a/timm/models/mvitv2.py b/timm/models/mvitv2.py
index c048a07277..a48b5a1cd6 100644
--- a/timm/models/mvitv2.py
+++ b/timm/models/mvitv2.py
@@ -692,12 +692,12 @@ class MultiScaleVit(nn.Module):
     Improved Multiscale Vision Transformers for Classification and Detection
     Yanghao Li*, Chao-Yuan Wu*, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik,
         Christoph Feichtenhofer*
-    https://arxiv.org/abs/2112.01526
+    https://huggingface.co/papers/2112.01526
 
     Multiscale Vision Transformers
     Haoqi Fan*, Bo Xiong*, Karttikeya Mangalam*, Yanghao Li*, Zhicheng Yan, Jitendra Malik,
         Christoph Feichtenhofer*
-    https://arxiv.org/abs/2104.11227
+    https://huggingface.co/papers/2104.11227
     """
 
     def __init__(
diff --git a/timm/models/nest.py b/timm/models/nest.py
index 9a423a9776..1083807fa5 100644
--- a/timm/models/nest.py
+++ b/timm/models/nest.py
@@ -3,7 +3,7 @@
 A PyTorch implement of Aggregating Nested Transformers as described in:
 
 'Aggregating Nested Transformers'
-    - https://arxiv.org/abs/2105.12723
+    - https://huggingface.co/papers/2105.12723
 
 The official Jax code is released and available at https://github.com/google-research/nested-transformer. The weights
 have been converted with convert/convert_nest_flax.py
@@ -248,7 +248,7 @@ class Nest(nn.Module):
     """ Nested Transformer (NesT)
 
     A PyTorch impl of : `Aggregating Nested Transformers`
-        - https://arxiv.org/abs/2105.12723
+        - https://huggingface.co/papers/2105.12723
     """
 
     def __init__(
diff --git a/timm/models/nextvit.py b/timm/models/nextvit.py
index 2f232e2990..d8321e8291 100644
--- a/timm/models/nextvit.py
+++ b/timm/models/nextvit.py
@@ -1,6 +1,6 @@
 """ Next-ViT
 
-As described in https://arxiv.org/abs/2207.05501
+As described in https://huggingface.co/papers/2207.05501
 
 Next-ViT model defs and weights adapted from https://github.com/bytedance/Next-ViT, original copyright below
 """
diff --git a/timm/models/nfnet.py b/timm/models/nfnet.py
index 68e92128f3..b94831e51f 100644
--- a/timm/models/nfnet.py
+++ b/timm/models/nfnet.py
@@ -1,10 +1,10 @@
 """ Normalization Free Nets. NFNet, NF-RegNet, NF-ResNet (pre-activation) Models
 
 Paper: `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-    - https://arxiv.org/abs/2101.08692
+    - https://huggingface.co/papers/2101.08692
 
 Paper: `High-Performance Large-Scale Image Recognition Without Normalization`
-    - https://arxiv.org/abs/2102.06171
+    - https://huggingface.co/papers/2102.06171
 
 Official Deepmind JAX code: https://github.com/deepmind/deepmind-research/tree/master/nfnets
 
@@ -272,9 +272,9 @@ class NormFreeNet(nn.Module):
 
     As described in :
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     and
-    `High-Performance Large-Scale Image Recognition Without Normalization` - https://arxiv.org/abs/2102.06171
+    `High-Performance Large-Scale Image Recognition Without Normalization` - https://huggingface.co/papers/2102.06171
 
     This model aims to cover both the NFRegNet-Bx models as detailed in the paper's code snippets and
     the (preact) ResNet models described earlier in the paper.
@@ -747,7 +747,7 @@ def _dcfg(url='', **kwargs):
 def dm_nfnet_f0(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F0 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f0', pretrained=pretrained, **kwargs)
 
@@ -756,7 +756,7 @@ def dm_nfnet_f0(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f1(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F1 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f1', pretrained=pretrained, **kwargs)
 
@@ -765,7 +765,7 @@ def dm_nfnet_f1(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f2(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F2 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f2', pretrained=pretrained, **kwargs)
 
@@ -774,7 +774,7 @@ def dm_nfnet_f2(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f3(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F3 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f3', pretrained=pretrained, **kwargs)
 
@@ -783,7 +783,7 @@ def dm_nfnet_f3(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f4(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F4 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f4', pretrained=pretrained, **kwargs)
 
@@ -792,7 +792,7 @@ def dm_nfnet_f4(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f5(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F5 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f5', pretrained=pretrained, **kwargs)
 
@@ -801,7 +801,7 @@ def dm_nfnet_f5(pretrained=False, **kwargs) -> NormFreeNet:
 def dm_nfnet_f6(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F6 (DeepMind weight compatible)
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('dm_nfnet_f6', pretrained=pretrained, **kwargs)
 
@@ -810,7 +810,7 @@ def dm_nfnet_f6(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f0(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F0
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f0', pretrained=pretrained, **kwargs)
 
@@ -819,7 +819,7 @@ def nfnet_f0(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f1(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F1
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f1', pretrained=pretrained, **kwargs)
 
@@ -828,7 +828,7 @@ def nfnet_f1(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f2(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F2
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f2', pretrained=pretrained, **kwargs)
 
@@ -837,7 +837,7 @@ def nfnet_f2(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f3(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F3
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f3', pretrained=pretrained, **kwargs)
 
@@ -846,7 +846,7 @@ def nfnet_f3(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f4(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F4
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f4', pretrained=pretrained, **kwargs)
 
@@ -855,7 +855,7 @@ def nfnet_f4(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f5(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F5
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f5', pretrained=pretrained, **kwargs)
 
@@ -864,7 +864,7 @@ def nfnet_f5(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f6(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F6
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f6', pretrained=pretrained, **kwargs)
 
@@ -873,7 +873,7 @@ def nfnet_f6(pretrained=False, **kwargs) -> NormFreeNet:
 def nfnet_f7(pretrained=False, **kwargs) -> NormFreeNet:
     """ NFNet-F7
     `High-Performance Large-Scale Image Recognition Without Normalization`
-        - https://arxiv.org/abs/2102.06171
+        - https://huggingface.co/papers/2102.06171
     """
     return _create_normfreenet('nfnet_f7', pretrained=pretrained, **kwargs)
 
@@ -922,7 +922,7 @@ def eca_nfnet_l3(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b0(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B0
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b0', pretrained=pretrained, **kwargs)
 
@@ -931,7 +931,7 @@ def nf_regnet_b0(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b1(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B1
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b1', pretrained=pretrained, **kwargs)
 
@@ -940,7 +940,7 @@ def nf_regnet_b1(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b2(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B2
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b2', pretrained=pretrained, **kwargs)
 
@@ -949,7 +949,7 @@ def nf_regnet_b2(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b3(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B3
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b3', pretrained=pretrained, **kwargs)
 
@@ -958,7 +958,7 @@ def nf_regnet_b3(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b4(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B4
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b4', pretrained=pretrained, **kwargs)
 
@@ -967,7 +967,7 @@ def nf_regnet_b4(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_regnet_b5(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free RegNet-B5
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_regnet_b5', pretrained=pretrained, **kwargs)
 
@@ -976,7 +976,7 @@ def nf_regnet_b5(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_resnet26(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free ResNet-26
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_resnet26', pretrained=pretrained, **kwargs)
 
@@ -985,7 +985,7 @@ def nf_resnet26(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_resnet50(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free ResNet-50
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_resnet50', pretrained=pretrained, **kwargs)
 
@@ -994,7 +994,7 @@ def nf_resnet50(pretrained=False, **kwargs) -> NormFreeNet:
 def nf_resnet101(pretrained=False, **kwargs) -> NormFreeNet:
     """ Normalization-Free ResNet-101
     `Characterizing signal propagation to close the performance gap in unnormalized ResNets`
-        - https://arxiv.org/abs/2101.08692
+        - https://huggingface.co/papers/2101.08692
     """
     return _create_normfreenet('nf_resnet101', pretrained=pretrained, **kwargs)
 
diff --git a/timm/models/pit.py b/timm/models/pit.py
index 1d5386a921..feea9319ab 100644
--- a/timm/models/pit.py
+++ b/timm/models/pit.py
@@ -1,7 +1,7 @@
 """ Pooling-based Vision Transformer (PiT) in PyTorch
 
 A PyTorch implement of Pooling-based Vision Transformers as described in
-'Rethinking Spatial Dimensions of Vision Transformers' - https://arxiv.org/abs/2103.16302
+'Rethinking Spatial Dimensions of Vision Transformers' - https://huggingface.co/papers/2103.16302
 
 This code was adapted from the original version at https://github.com/naver-ai/pit, original copyright below.
 
@@ -144,7 +144,7 @@ class PoolingVisionTransformer(nn.Module):
     """ Pooling-based Vision Transformer
 
     A PyTorch implement of 'Rethinking Spatial Dimensions of Vision Transformers'
-        - https://arxiv.org/abs/2103.16302
+        - https://huggingface.co/papers/2103.16302
     """
     def __init__(
             self,
diff --git a/timm/models/pnasnet.py b/timm/models/pnasnet.py
index 20d17945b5..a0ba3defd0 100644
--- a/timm/models/pnasnet.py
+++ b/timm/models/pnasnet.py
@@ -372,7 +372,7 @@ def _create_pnasnet(variant, pretrained=False, **kwargs):
 def pnasnet5large(pretrained=False, **kwargs) -> PNASNet5Large:
     r"""PNASNet-5 model architecture from the
     `"Progressive Neural Architecture Search"
-    <https://arxiv.org/abs/1712.00559>`_ paper.
+    <https://huggingface.co/papers/1712.00559>`_ paper.
     """
     model_kwargs = dict(pad_type='same', **kwargs)
     return _create_pnasnet('pnasnet5large', pretrained, **model_kwargs)
diff --git a/timm/models/regnet.py b/timm/models/regnet.py
index 49a19aa16c..7d17bcb495 100644
--- a/timm/models/regnet.py
+++ b/timm/models/regnet.py
@@ -1,9 +1,9 @@
 """RegNet X, Y, Z, and more
 
-Paper: `Designing Network Design Spaces` - https://arxiv.org/abs/2003.13678
+Paper: `Designing Network Design Spaces` - https://huggingface.co/papers/2003.13678
 Original Impl: https://github.com/facebookresearch/pycls/blob/master/pycls/models/regnet.py
 
-Paper: `Fast and Accurate Model Scaling` - https://arxiv.org/abs/2103.06877
+Paper: `Fast and Accurate Model Scaling` - https://huggingface.co/papers/2103.06877
 Original Impl: None
 
 Based on original PyTorch impl linked above, but re-wrote to use my own blocks (adapted from ResNet here)
@@ -372,7 +372,7 @@ def forward(self, x):
 class RegNet(nn.Module):
     """RegNet-X, Y, and Z Models
 
-    Paper: https://arxiv.org/abs/2003.13678
+    Paper: https://huggingface.co/papers/2003.13678
     Original Impl: https://github.com/facebookresearch/pycls/blob/master/pycls/models/regnet.py
     """
 
diff --git a/timm/models/repghost.py b/timm/models/repghost.py
index 77fc35d59e..9c43ef0f59 100644
--- a/timm/models/repghost.py
+++ b/timm/models/repghost.py
@@ -1,6 +1,6 @@
 """
 An implementation of RepGhostNet Model as defined in:
-RepGhost: A Hardware-Efficient Ghost Module via Re-parameterization. https://arxiv.org/abs/2211.06088
+RepGhost: A Hardware-Efficient Ghost Module via Re-parameterization. https://huggingface.co/papers/2211.06088
 
 Original implementation: https://github.com/ChengpengChen/RepGhost
 """
diff --git a/timm/models/repvit.py b/timm/models/repvit.py
index ddcfed55c8..ae22bac874 100644
--- a/timm/models/repvit.py
+++ b/timm/models/repvit.py
@@ -1,7 +1,7 @@
 """ RepViT
 
 Paper: `RepViT: Revisiting Mobile CNN From ViT Perspective`
-    - https://arxiv.org/abs/2307.09283
+    - https://huggingface.co/papers/2307.09283
 
 @misc{wang2023repvit,
       title={RepViT: Revisiting Mobile CNN From ViT Perspective}, 
diff --git a/timm/models/res2net.py b/timm/models/res2net.py
index 691f929b91..0ba9e43d86 100644
--- a/timm/models/res2net.py
+++ b/timm/models/res2net.py
@@ -1,6 +1,6 @@
 """ Res2Net and Res2NeXt
 Adapted from Official Pytorch impl at: https://github.com/gasvn/Res2Net/
-Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://arxiv.org/abs/1904.01169
+Paper: `Res2Net: A New Multi-scale Backbone Architecture` - https://huggingface.co/papers/1904.01169
 """
 import math
 
diff --git a/timm/models/resnest.py b/timm/models/resnest.py
index 5b1438017e..7a2c0479f9 100644
--- a/timm/models/resnest.py
+++ b/timm/models/resnest.py
@@ -1,6 +1,6 @@
 """ ResNeSt Models
 
-Paper: `ResNeSt: Split-Attention Networks` - https://arxiv.org/abs/2004.08955
+Paper: `ResNeSt: Split-Attention Networks` - https://huggingface.co/papers/2004.08955
 
 Adapted from original PyTorch impl w/ weights at https://github.com/zhanghang1989/ResNeSt by Hang Zhang
 
@@ -183,7 +183,7 @@ def resnest26d(pretrained=False, **kwargs) -> ResNet:
 
 @register_model
 def resnest50d(pretrained=False, **kwargs) -> ResNet:
-    """ ResNeSt-50d model. Matches paper ResNeSt-50 model, https://arxiv.org/abs/2004.08955
+    """ ResNeSt-50d model. Matches paper ResNeSt-50 model, https://huggingface.co/papers/2004.08955
     Since this codebase supports all possible variations, 'd' for deep stem, stem_width 32, avg in downsample.
     """
     model_kwargs = dict(
@@ -195,7 +195,7 @@ def resnest50d(pretrained=False, **kwargs) -> ResNet:
 
 @register_model
 def resnest101e(pretrained=False, **kwargs) -> ResNet:
-    """ ResNeSt-101e model. Matches paper ResNeSt-101 model, https://arxiv.org/abs/2004.08955
+    """ ResNeSt-101e model. Matches paper ResNeSt-101 model, https://huggingface.co/papers/2004.08955
      Since this codebase supports all possible variations, 'e' for deep stem, stem_width 64, avg in downsample.
     """
     model_kwargs = dict(
@@ -207,7 +207,7 @@ def resnest101e(pretrained=False, **kwargs) -> ResNet:
 
 @register_model
 def resnest200e(pretrained=False, **kwargs) -> ResNet:
-    """ ResNeSt-200e model. Matches paper ResNeSt-200 model, https://arxiv.org/abs/2004.08955
+    """ ResNeSt-200e model. Matches paper ResNeSt-200 model, https://huggingface.co/papers/2004.08955
     Since this codebase supports all possible variations, 'e' for deep stem, stem_width 64, avg in downsample.
     """
     model_kwargs = dict(
@@ -219,7 +219,7 @@ def resnest200e(pretrained=False, **kwargs) -> ResNet:
 
 @register_model
 def resnest269e(pretrained=False, **kwargs) -> ResNet:
-    """ ResNeSt-269e model. Matches paper ResNeSt-269 model, https://arxiv.org/abs/2004.08955
+    """ ResNeSt-269e model. Matches paper ResNeSt-269 model, https://huggingface.co/papers/2004.08955
     Since this codebase supports all possible variations, 'e' for deep stem, stem_width 64, avg in downsample.
     """
     model_kwargs = dict(
diff --git a/timm/models/resnet.py b/timm/models/resnet.py
index dd6aa1e8b2..dba64f63d5 100644
--- a/timm/models/resnet.py
+++ b/timm/models/resnet.py
@@ -360,7 +360,7 @@ class ResNet(nn.Module):
 
     This ResNet impl supports a number of stem and downsample options based on the v1c, v1d, v1e, and v1s
     variants included in the MXNet Gluon ResNetV1b model. The C and D variants are also discussed in the
-    'Bag of Tricks' paper: https://arxiv.org/pdf/1812.01187. The B variant is equivalent to torchvision default.
+    'Bag of Tricks' paper: https://huggingface.co/papers/1812.01187. The B variant is equivalent to torchvision default.
 
     ResNet variants (the same modifications can be used in SE/ResNeXt models as well):
       * normal, b - 7x7 stem, stem_width = 64, same as torchvision ResNet, NVIDIA ResNet 'v1.5', Gluon v1b
@@ -1622,7 +1622,7 @@ def ecaresnet50d(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def ecaresnet50d_pruned(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-50-D model pruned with eca.
-        The pruning has been obtained using https://arxiv.org/pdf/2002.08258.pdf
+        The pruning has been obtained using https://huggingface.co/papers/2002.08258
     """
     model_args = dict(
         block=Bottleneck, layers=(3, 4, 6, 3), stem_width=32, stem_type='deep', avg_down=True,
@@ -1664,7 +1664,7 @@ def ecaresnet101d(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def ecaresnet101d_pruned(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-101-D model pruned with eca.
-       The pruning has been obtained using https://arxiv.org/pdf/2002.08258.pdf
+       The pruning has been obtained using https://huggingface.co/papers/2002.08258
     """
     model_args = dict(
         block=Bottleneck, layers=(3, 4, 23, 3), stem_width=32, stem_type='deep', avg_down=True,
@@ -1963,7 +1963,7 @@ def seresnextaa201d_32x8d(pretrained: bool = False, **kwargs):
 @register_model
 def resnetrs50(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-50 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -1976,7 +1976,7 @@ def resnetrs50(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs101(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-101 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -1989,7 +1989,7 @@ def resnetrs101(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs152(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-152 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -2002,7 +2002,7 @@ def resnetrs152(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs200(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-200 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -2015,7 +2015,7 @@ def resnetrs200(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs270(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-270 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -2029,7 +2029,7 @@ def resnetrs270(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs350(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-350 model.
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
@@ -2042,7 +2042,7 @@ def resnetrs350(pretrained: bool = False, **kwargs) -> ResNet:
 @register_model
 def resnetrs420(pretrained: bool = False, **kwargs) -> ResNet:
     """Constructs a ResNet-RS-420 model
-    Paper: Revisiting ResNets - https://arxiv.org/abs/2103.07579
+    Paper: Revisiting ResNets - https://huggingface.co/papers/2103.07579
     Pretrained weights from https://github.com/tensorflow/tpu/tree/bee9c4f6/models/official/resnet/resnet_rs
     """
     attn_layer = partial(get_attn('se'), rd_ratio=0.25)
diff --git a/timm/models/resnetv2.py b/timm/models/resnetv2.py
index 5cc164ae1b..5f22fc6dba 100644
--- a/timm/models/resnetv2.py
+++ b/timm/models/resnetv2.py
@@ -9,9 +9,9 @@
 https://github.com/google-research/vision_transformer
 
 Thanks to the Google team for the above two repositories and associated papers:
-* Big Transfer (BiT): General Visual Representation Learning - https://arxiv.org/abs/1912.11370
-* An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - https://arxiv.org/abs/2010.11929
-* Knowledge distillation: A good teacher is patient and consistent - https://arxiv.org/abs/2106.05237
+* Big Transfer (BiT): General Visual Representation Learning - https://huggingface.co/papers/1912.11370
+* An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale - https://huggingface.co/papers/2010.11929
+* Knowledge distillation: A good teacher is patient and consistent - https://huggingface.co/papers/2106.05237
 
 Original copyright of Google code below, modifications by Ross Wightman, Copyright 2020.
 """
@@ -719,7 +719,7 @@ def _cfg(url='', **kwargs):
 
 
 default_cfgs = generate_default_cfgs({
-    #  Paper: Knowledge distillation: A good teacher is patient and consistent - https://arxiv.org/abs/2106.05237
+    #  Paper: Knowledge distillation: A good teacher is patient and consistent - https://huggingface.co/papers/2106.05237
     'resnetv2_50x1_bit.goog_distilled_in1k': _cfg(
         hf_hub_id='timm/',
         interpolation='bicubic', custom_load=True),
diff --git a/timm/models/rexnet.py b/timm/models/rexnet.py
index dd3cb4f32f..bbe6100526 100644
--- a/timm/models/rexnet.py
+++ b/timm/models/rexnet.py
@@ -1,7 +1,7 @@
 """ ReXNet
 
 A PyTorch impl of `ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network` -
-https://arxiv.org/abs/2007.00992
+https://huggingface.co/papers/2007.00992
 
 Adapted from original impl at https://github.com/clovaai/rexnet
 Copyright (c) 2020-present NAVER Corp. MIT license
diff --git a/timm/models/selecsls.py b/timm/models/selecsls.py
index fdfa16c318..3a707dccb6 100644
--- a/timm/models/selecsls.py
+++ b/timm/models/selecsls.py
@@ -4,7 +4,7 @@
 
 SelecSLS (core) Network Architecture as proposed in "XNect: Real-time Multi-person 3D
 Human Pose Estimation with a Single RGB Camera, Mehta et al."
-https://arxiv.org/abs/1907.00837
+https://huggingface.co/papers/1907.00837
 
 Based on ResNet implementation in https://github.com/rwightman/pytorch-image-models
 and SelecSLS Net implementation in https://github.com/mehtadushy/SelecSLS-Pytorch
diff --git a/timm/models/sequencer.py b/timm/models/sequencer.py
index 86c4b1df4d..f2cbc2529f 100644
--- a/timm/models/sequencer.py
+++ b/timm/models/sequencer.py
@@ -1,6 +1,6 @@
 """ Sequencer
 
-Paper: `Sequencer: Deep LSTM for Image Classification` - https://arxiv.org/pdf/2205.01972.pdf
+Paper: `Sequencer: Deep LSTM for Image Classification` - https://huggingface.co/papers/2205.01972
 
 """
 #  Copyright (c) 2022. Yuki Tatsunami
diff --git a/timm/models/sknet.py b/timm/models/sknet.py
index b12df2319f..3da9dc3744 100644
--- a/timm/models/sknet.py
+++ b/timm/models/sknet.py
@@ -1,8 +1,8 @@
 """ Selective Kernel Networks (ResNet base)
 
-Paper: Selective Kernel Networks (https://arxiv.org/abs/1903.06586)
+Paper: Selective Kernel Networks (https://huggingface.co/papers/1903.06586)
 
-This was inspired by reading 'Compounding the Performance Improvements...' (https://arxiv.org/abs/2001.06268)
+This was inspired by reading 'Compounding the Performance Improvements...' (https://huggingface.co/papers/2001.06268)
 and a streamlined impl at https://github.com/clovaai/assembled-cnn but I ended up building something closer
 to the original paper with some modifications of my own to better balance param count vs accuracy.
 
diff --git a/timm/models/swin_transformer.py b/timm/models/swin_transformer.py
index 54d57f8148..8e64f0d4e1 100644
--- a/timm/models/swin_transformer.py
+++ b/timm/models/swin_transformer.py
@@ -1,10 +1,10 @@
 """ Swin Transformer
 A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`
-    - https://arxiv.org/pdf/2103.14030
+    - https://huggingface.co/papers/2103.14030
 
 Code/weights from https://github.com/microsoft/Swin-Transformer, original copyright/license info below
 
-S3 (AutoFormerV2, https://arxiv.org/abs/2111.14725) Swin weights from
+S3 (AutoFormerV2, https://huggingface.co/papers/2111.14725) Swin weights from
     - https://github.com/microsoft/Cream/tree/main/AutoFormerV2
 
 Modifications and additions for timm hacked together by / Copyright 2021, Ross Wightman
@@ -564,7 +564,7 @@ class SwinTransformer(nn.Module):
     """ Swin Transformer
 
     A PyTorch impl of : `Swin Transformer: Hierarchical Vision Transformer using Shifted Windows`  -
-          https://arxiv.org/pdf/2103.14030
+          https://huggingface.co/papers/2103.14030
     """
 
     def __init__(
@@ -1036,7 +1036,7 @@ def swin_large_patch4_window12_384(pretrained=False, **kwargs) -> SwinTransforme
 
 @register_model
 def swin_s3_tiny_224(pretrained=False, **kwargs) -> SwinTransformer:
-    """ Swin-S3-T @ 224x224, https://arxiv.org/abs/2111.14725
+    """ Swin-S3-T @ 224x224, https://huggingface.co/papers/2111.14725
     """
     model_args = dict(
         patch_size=4, window_size=(7, 7, 14, 7), embed_dim=96, depths=(2, 2, 6, 2), num_heads=(3, 6, 12, 24))
@@ -1045,7 +1045,7 @@ def swin_s3_tiny_224(pretrained=False, **kwargs) -> SwinTransformer:
 
 @register_model
 def swin_s3_small_224(pretrained=False, **kwargs) -> SwinTransformer:
-    """ Swin-S3-S @ 224x224, https://arxiv.org/abs/2111.14725
+    """ Swin-S3-S @ 224x224, https://huggingface.co/papers/2111.14725
     """
     model_args = dict(
         patch_size=4, window_size=(14, 14, 14, 7), embed_dim=96, depths=(2, 2, 18, 2), num_heads=(3, 6, 12, 24))
@@ -1054,7 +1054,7 @@ def swin_s3_small_224(pretrained=False, **kwargs) -> SwinTransformer:
 
 @register_model
 def swin_s3_base_224(pretrained=False, **kwargs) -> SwinTransformer:
-    """ Swin-S3-B @ 224x224, https://arxiv.org/abs/2111.14725
+    """ Swin-S3-B @ 224x224, https://huggingface.co/papers/2111.14725
     """
     model_args = dict(
         patch_size=4, window_size=(7, 7, 14, 7), embed_dim=96, depths=(2, 2, 30, 2), num_heads=(3, 6, 12, 24))
diff --git a/timm/models/swin_transformer_v2.py b/timm/models/swin_transformer_v2.py
index 6a3330f476..fe73b37fc1 100644
--- a/timm/models/swin_transformer_v2.py
+++ b/timm/models/swin_transformer_v2.py
@@ -1,6 +1,6 @@
 """ Swin Transformer V2
 A PyTorch impl of : `Swin Transformer V2: Scaling Up Capacity and Resolution`
-    - https://arxiv.org/abs/2111.09883
+    - https://huggingface.co/papers/2111.09883
 
 Code/weights from https://github.com/microsoft/Swin-Transformer, original copyright/license info below
 
@@ -581,7 +581,7 @@ class SwinTransformerV2(nn.Module):
     """ Swin Transformer V2
 
     A PyTorch impl of : `Swin Transformer V2: Scaling Up Capacity and Resolution`
-        - https://arxiv.org/abs/2111.09883
+        - https://huggingface.co/papers/2111.09883
     """
 
     def __init__(
diff --git a/timm/models/swin_transformer_v2_cr.py b/timm/models/swin_transformer_v2_cr.py
index d8d247cdeb..a4f9268dc9 100644
--- a/timm/models/swin_transformer_v2_cr.py
+++ b/timm/models/swin_transformer_v2_cr.py
@@ -1,7 +1,7 @@
 """ Swin Transformer V2
 
 A PyTorch impl of : `Swin Transformer V2: Scaling Up Capacity and Resolution`
-    - https://arxiv.org/pdf/2111.09883
+    - https://huggingface.co/papers/2111.09883
 
 Code adapted from https://github.com/ChristophReich1996/Swin-Transformer-V2, original copyright/license info below
 
@@ -613,7 +613,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 class SwinTransformerV2Cr(nn.Module):
     r""" Swin Transformer V2
         A PyTorch impl of : `Swin Transformer V2: Scaling Up Capacity and Resolution`  -
-          https://arxiv.org/pdf/2111.09883
+          https://huggingface.co/papers/2111.09883
 
     Args:
         img_size: Input resolution.
diff --git a/timm/models/tiny_vit.py b/timm/models/tiny_vit.py
index d238fa5b2d..e864e4390a 100644
--- a/timm/models/tiny_vit.py
+++ b/timm/models/tiny_vit.py
@@ -1,7 +1,7 @@
 """ TinyViT
 
 Paper: `TinyViT: Fast Pretraining Distillation for Small Vision Transformers`
-    - https://arxiv.org/abs/2207.10666
+    - https://huggingface.co/papers/2207.10666
 
 Adapted from official impl at https://github.com/microsoft/Cream/tree/main/TinyViT
 """
diff --git a/timm/models/tnt.py b/timm/models/tnt.py
index fa6e1fc9e7..71a9cd5bf9 100644
--- a/timm/models/tnt.py
+++ b/timm/models/tnt.py
@@ -1,7 +1,7 @@
 """ Transformer in Transformer (TNT) in PyTorch
 
 A PyTorch implement of TNT as described in
-'Transformer in Transformer' - https://arxiv.org/abs/2103.00112
+'Transformer in Transformer' - https://huggingface.co/papers/2103.00112
 
 The official mindspore code is released and available at
 https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/TNT
@@ -216,7 +216,7 @@ def forward(self, x: torch.Tensor, pixel_pos: torch.Tensor) -> torch.Tensor:
 
 
 class TNT(nn.Module):
-    """ Transformer in Transformer - https://arxiv.org/abs/2103.00112
+    """ Transformer in Transformer - https://huggingface.co/papers/2103.00112
     """
 
     def __init__(
diff --git a/timm/models/tresnet.py b/timm/models/tresnet.py
index 0fb76fa40c..743d9ba7a0 100644
--- a/timm/models/tresnet.py
+++ b/timm/models/tresnet.py
@@ -1,6 +1,6 @@
 """
 TResNet: High Performance GPU-Dedicated Architecture
-https://arxiv.org/pdf/2003.13630.pdf
+https://huggingface.co/papers/2003.13630
 
 Original model: https://github.com/mrT23/TResNet
 
diff --git a/timm/models/twins.py b/timm/models/twins.py
index 74d22e24c1..9983c7d4a7 100644
--- a/timm/models/twins.py
+++ b/timm/models/twins.py
@@ -1,6 +1,6 @@
 """ Twins
 A PyTorch impl of : `Twins: Revisiting the Design of Spatial Attention in Vision Transformers`
-    - https://arxiv.org/pdf/2104.13840.pdf
+    - https://huggingface.co/papers/2104.13840
 
 Code/weights from https://github.com/Meituan-AutoML/Twins, original copyright/license info below
 
@@ -230,7 +230,7 @@ def forward(self, x, size: Size_):
 
 
 class PosConv(nn.Module):
-    # PEG  from https://arxiv.org/abs/2102.10882
+    # PEG  from https://huggingface.co/papers/2102.10882
     def __init__(self, in_chans, embed_dim=768, stride=1):
         super(PosConv, self).__init__()
         self.proj = nn.Sequential(
diff --git a/timm/models/vgg.py b/timm/models/vgg.py
index a4cfbffdff..3bddedf3a7 100644
--- a/timm/models/vgg.py
+++ b/timm/models/vgg.py
@@ -229,7 +229,7 @@ def _cfg(url='', **kwargs):
 @register_model
 def vgg11(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 11-layer model (configuration "A") from
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(**kwargs)
     return _create_vgg('vgg11', pretrained=pretrained, **model_args)
@@ -238,7 +238,7 @@ def vgg11(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg11_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 11-layer model (configuration "A") with batch normalization
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(norm_layer=nn.BatchNorm2d, **kwargs)
     return _create_vgg('vgg11_bn', pretrained=pretrained, **model_args)
@@ -247,7 +247,7 @@ def vgg11_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg13(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 13-layer model (configuration "B")
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(**kwargs)
     return _create_vgg('vgg13', pretrained=pretrained, **model_args)
@@ -256,7 +256,7 @@ def vgg13(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg13_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 13-layer model (configuration "B") with batch normalization
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(norm_layer=nn.BatchNorm2d, **kwargs)
     return _create_vgg('vgg13_bn', pretrained=pretrained, **model_args)
@@ -265,7 +265,7 @@ def vgg13_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg16(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 16-layer model (configuration "D")
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(**kwargs)
     return _create_vgg('vgg16', pretrained=pretrained, **model_args)
@@ -274,7 +274,7 @@ def vgg16(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg16_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 16-layer model (configuration "D") with batch normalization
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(norm_layer=nn.BatchNorm2d, **kwargs)
     return _create_vgg('vgg16_bn', pretrained=pretrained, **model_args)
@@ -283,7 +283,7 @@ def vgg16_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg19(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 19-layer model (configuration "E")
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(**kwargs)
     return _create_vgg('vgg19', pretrained=pretrained, **model_args)
@@ -292,7 +292,7 @@ def vgg19(pretrained: bool = False, **kwargs: Any) -> VGG:
 @register_model
 def vgg19_bn(pretrained: bool = False, **kwargs: Any) -> VGG:
     r"""VGG 19-layer model (configuration 'E') with batch normalization
-    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://arxiv.org/pdf/1409.1556.pdf>`._
+    `"Very Deep Convolutional Networks For Large-Scale Image Recognition" <https://huggingface.co/papers/1409.1556>`._
     """
     model_args = dict(norm_layer=nn.BatchNorm2d, **kwargs)
     return _create_vgg('vgg19_bn', pretrained=pretrained, **model_args)
diff --git a/timm/models/visformer.py b/timm/models/visformer.py
index 2ed3be5da8..8915c3532a 100644
--- a/timm/models/visformer.py
+++ b/timm/models/visformer.py
@@ -1,6 +1,6 @@
 """ Visformer
 
-Paper: Visformer: The Vision-friendly Transformer - https://arxiv.org/abs/2104.12533
+Paper: Visformer: The Vision-friendly Transformer - https://huggingface.co/papers/2104.12533
 
 From original at https://github.com/danczs/Visformer
 
diff --git a/timm/models/vision_transformer.py b/timm/models/vision_transformer.py
index 3c7b9a2277..9a89c19a74 100644
--- a/timm/models/vision_transformer.py
+++ b/timm/models/vision_transformer.py
@@ -3,13 +3,13 @@
 A PyTorch implement of Vision Transformers as described in:
 
 'An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale'
-    - https://arxiv.org/abs/2010.11929
+    - https://huggingface.co/papers/2010.11929
 
 `How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers`
-    - https://arxiv.org/abs/2106.10270
+    - https://huggingface.co/papers/2106.10270
 
 `FlexiViT: One Model for All Patch Sizes`
-    - https://arxiv.org/abs/2212.08013
+    - https://huggingface.co/papers/2212.08013
 
 The official jax code is released and available at
   * https://github.com/google-research/vision_transformer
@@ -231,7 +231,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 class ParallelScalingBlock(nn.Module):
     """ Parallel ViT block (MLP & Attention in parallel)
     Based on:
-      'Scaling Vision Transformers to 22 Billion Parameters` - https://arxiv.org/abs/2302.05442
+      'Scaling Vision Transformers to 22 Billion Parameters` - https://huggingface.co/papers/2302.05442
     """
     fused_attn: Final[bool]
 
@@ -327,7 +327,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 class ParallelThingsBlock(nn.Module):
     """ Parallel ViT block (N parallel attention followed by N parallel MLP)
     Based on:
-      `Three things everyone should know about Vision Transformers` - https://arxiv.org/abs/2203.09795
+      `Three things everyone should know about Vision Transformers` - https://huggingface.co/papers/2203.09795
     """
     def __init__(
             self,
@@ -426,7 +426,7 @@ class VisionTransformer(nn.Module):
     """ Vision Transformer
 
     A PyTorch impl of : `An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale`
-        - https://arxiv.org/abs/2010.11929
+        - https://huggingface.co/papers/2010.11929
     """
     dynamic_img_size: Final[bool]
 
@@ -1414,7 +1414,7 @@ def _cfg(url: str = '', **kwargs) -> Dict[str, Any]:
         hf_hub_id='timm/',
         custom_load=True, num_classes=21843),
 
-    # SAM trained models (https://arxiv.org/abs/2106.01548)
+    # SAM trained models (https://huggingface.co/papers/2106.01548)
     'vit_base_patch32_224.sam_in1k': _cfg(
         url='https://storage.googleapis.com/vit_models/sam/ViT-B_32.npz', custom_load=True,
         hf_hub_id='timm/'),
@@ -1422,7 +1422,7 @@ def _cfg(url: str = '', **kwargs) -> Dict[str, Any]:
         url='https://storage.googleapis.com/vit_models/sam/ViT-B_16.npz', custom_load=True,
         hf_hub_id='timm/'),
 
-    # DINO pretrained - https://arxiv.org/abs/2104.14294 (no classifier head, for fine-tune only)
+    # DINO pretrained - https://huggingface.co/papers/2104.14294 (no classifier head, for fine-tune only)
     'vit_small_patch16_224.dino': _cfg(
         url='https://dl.fbaipublicfiles.com/dino/dino_deitsmall16_pretrain/dino_deitsmall16_pretrain.pth',
         hf_hub_id='timm/',
@@ -1440,7 +1440,7 @@ def _cfg(url: str = '', **kwargs) -> Dict[str, Any]:
         hf_hub_id='timm/',
         mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, num_classes=0),
 
-    # DINOv2 pretrained - https://arxiv.org/abs/2304.07193 (no classifier head, for fine-tune/features only)
+    # DINOv2 pretrained - https://huggingface.co/papers/2304.07193 (no classifier head, for fine-tune/features only)
     'vit_small_patch14_dinov2.lvd142m': _cfg(
         url='https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_pretrain.pth',
         hf_hub_id='timm/',
@@ -1466,7 +1466,7 @@ def _cfg(url: str = '', **kwargs) -> Dict[str, Any]:
         mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD, num_classes=0,
         input_size=(3, 518, 518), crop_pct=1.0),
 
-    # DINOv2 pretrained w/ registers - https://arxiv.org/abs/2309.16588 (no classifier head, for fine-tune/features only)
+    # DINOv2 pretrained w/ registers - https://huggingface.co/papers/2309.16588 (no classifier head, for fine-tune/features only)
     'vit_small_patch14_reg4_dinov2.lvd142m': _cfg(
         url='https://dl.fbaipublicfiles.com/dinov2/dinov2_vits14/dinov2_vits14_reg4_pretrain.pth',
         hf_hub_id='timm/',
@@ -2479,7 +2479,7 @@ def vit_small_patch8_224(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_base_patch32_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base (ViT-B/32) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=32, embed_dim=768, depth=12, num_heads=12)
@@ -2489,7 +2489,7 @@ def vit_base_patch32_224(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_base_patch32_384(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base model (ViT-B/32) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base model (ViT-B/32) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 384x384, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=32, embed_dim=768, depth=12, num_heads=12)
@@ -2499,7 +2499,7 @@ def vit_base_patch32_384(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_base_patch16_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base (ViT-B/16) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 224x224, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -2509,7 +2509,7 @@ def vit_base_patch16_224(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_base_patch16_384(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base model (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base model (ViT-B/16) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 384x384, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12)
@@ -2519,7 +2519,7 @@ def vit_base_patch16_384(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_base_patch8_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base (ViT-B/8) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base (ViT-B/8) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 224x224, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=8, embed_dim=768, depth=12, num_heads=12)
@@ -2529,7 +2529,7 @@ def vit_base_patch8_224(pretrained: bool = False, **kwargs) -> VisionTransformer
 
 @register_model
 def vit_large_patch32_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Large model (ViT-L/32) from original paper (https://arxiv.org/abs/2010.11929). No pretrained weights.
+    """ ViT-Large model (ViT-L/32) from original paper (https://huggingface.co/papers/2010.11929). No pretrained weights.
     """
     model_args = dict(patch_size=32, embed_dim=1024, depth=24, num_heads=16)
     model = _create_vision_transformer('vit_large_patch32_224', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -2538,7 +2538,7 @@ def vit_large_patch32_224(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def vit_large_patch32_384(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Large model (ViT-L/32) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Large model (ViT-L/32) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 384x384, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=32, embed_dim=1024, depth=24, num_heads=16)
@@ -2548,7 +2548,7 @@ def vit_large_patch32_384(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def vit_large_patch16_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Large model (ViT-L/16) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 224x224, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=16, embed_dim=1024, depth=24, num_heads=16)
@@ -2558,7 +2558,7 @@ def vit_large_patch16_224(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def vit_large_patch16_384(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Large model (ViT-L/16) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Large model (ViT-L/16) from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 384x384, source https://github.com/google-research/vision_transformer.
     """
     model_args = dict(patch_size=16, embed_dim=1024, depth=24, num_heads=16)
@@ -2577,7 +2577,7 @@ def vit_large_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def vit_huge_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Huge model (ViT-H/14) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Huge model (ViT-H/14) from original paper (https://huggingface.co/papers/2010.11929).
     """
     model_args = dict(patch_size=14, embed_dim=1280, depth=32, num_heads=16)
     model = _create_vision_transformer('vit_huge_patch14_224', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -2586,7 +2586,7 @@ def vit_huge_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransforme
 
 @register_model
 def vit_giant_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Giant (little-g) model (ViT-g/14) from `Scaling Vision Transformers` - https://arxiv.org/abs/2106.04560
+    """ ViT-Giant (little-g) model (ViT-g/14) from `Scaling Vision Transformers` - https://huggingface.co/papers/2106.04560
     """
     model_args = dict(patch_size=14, embed_dim=1408, mlp_ratio=48/11, depth=40, num_heads=16)
     model = _create_vision_transformer('vit_giant_patch14_224', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -2595,7 +2595,7 @@ def vit_giant_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def vit_gigantic_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Gigantic (big-G) model (ViT-G/14) from `Scaling Vision Transformers` - https://arxiv.org/abs/2106.04560
+    """ ViT-Gigantic (big-G) model (ViT-G/14) from `Scaling Vision Transformers` - https://huggingface.co/papers/2106.04560
     """
     model_args = dict(patch_size=14, embed_dim=1664, mlp_ratio=64/13, depth=48, num_heads=16)
     model = _create_vision_transformer(
@@ -2605,7 +2605,7 @@ def vit_gigantic_patch14_224(pretrained: bool = False, **kwargs) -> VisionTransf
 
 @register_model
 def vit_base_patch16_224_miil(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Base (ViT-B/16) from original paper (https://arxiv.org/abs/2010.11929).
+    """ ViT-Base (ViT-B/16) from original paper (https://huggingface.co/papers/2010.11929).
     Weights taken from: https://github.com/Alibaba-MIIL/ImageNet21K
     """
     model_args = dict(patch_size=16, embed_dim=768, depth=12, num_heads=12, qkv_bias=False)
@@ -2871,7 +2871,7 @@ def vit_huge_patch14_clip_378(pretrained: bool = False, **kwargs) -> VisionTrans
 
 @register_model
 def vit_giant_patch14_clip_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-Giant (little-g) model (ViT-g/14) from `Scaling Vision Transformers` - https://arxiv.org/abs/2106.04560
+    """ ViT-Giant (little-g) model (ViT-g/14) from `Scaling Vision Transformers` - https://huggingface.co/papers/2106.04560
     Pretrained weights from CLIP image tower.
     """
     model_args = dict(
@@ -2883,7 +2883,7 @@ def vit_giant_patch14_clip_224(pretrained: bool = False, **kwargs) -> VisionTran
 
 @register_model
 def vit_gigantic_patch14_clip_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ ViT-bigG model (ViT-G/14) from `Scaling Vision Transformers` - https://arxiv.org/abs/2106.04560
+    """ ViT-bigG model (ViT-G/14) from `Scaling Vision Transformers` - https://huggingface.co/papers/2106.04560
     Pretrained weights from CLIP image tower.
     """
     model_args = dict(
@@ -3014,7 +3014,7 @@ def vit_base_patch16_rpn_224(pretrained: bool = False, **kwargs) -> VisionTransf
 @register_model
 def vit_small_patch16_36x1_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
     """ ViT-Base w/ LayerScale + 36 x 1 (36 block serial) config. Experimental, may remove.
-    Based on `Three things everyone should know about Vision Transformers` - https://arxiv.org/abs/2203.09795
+    Based on `Three things everyone should know about Vision Transformers` - https://huggingface.co/papers/2203.09795
     Paper focuses on 24x2 + 48x1 for 'Small' width but those are extremely slow.
     """
     model_args = dict(patch_size=16, embed_dim=384, depth=36, num_heads=6, init_values=1e-5)
@@ -3026,7 +3026,7 @@ def vit_small_patch16_36x1_224(pretrained: bool = False, **kwargs) -> VisionTran
 @register_model
 def vit_small_patch16_18x2_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
     """ ViT-Small w/ LayerScale + 18 x 2 (36 block parallel) config. Experimental, may remove.
-    Based on `Three things everyone should know about Vision Transformers` - https://arxiv.org/abs/2203.09795
+    Based on `Three things everyone should know about Vision Transformers` - https://huggingface.co/papers/2203.09795
     Paper focuses on 24x2 + 48x1 for 'Small' width but those are extremely slow.
     """
     model_args = dict(
@@ -3039,7 +3039,7 @@ def vit_small_patch16_18x2_224(pretrained: bool = False, **kwargs) -> VisionTran
 @register_model
 def vit_base_patch16_18x2_224(pretrained: bool = False, **kwargs) -> VisionTransformer:
     """ ViT-Base w/ LayerScale + 18 x 2 (36 block parallel) config. Experimental, may remove.
-    Based on `Three things everyone should know about Vision Transformers` - https://arxiv.org/abs/2203.09795
+    Based on `Three things everyone should know about Vision Transformers` - https://huggingface.co/papers/2203.09795
     """
     model_args = dict(
         patch_size=16, embed_dim=768, depth=18, num_heads=12, init_values=1e-5, block_fn=ParallelThingsBlock)
@@ -3050,7 +3050,7 @@ def vit_base_patch16_18x2_224(pretrained: bool = False, **kwargs) -> VisionTrans
 
 @register_model
 def eva_large_patch14_196(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ EVA-large model https://arxiv.org/abs/2211.07636 /via MAE MIM pretrain"""
+    """ EVA-large model https://huggingface.co/papers/2211.07636 /via MAE MIM pretrain"""
     model_args = dict(patch_size=14, embed_dim=1024, depth=24, num_heads=16, global_pool='avg')
     model = _create_vision_transformer(
         'eva_large_patch14_196', pretrained=pretrained, **dict(model_args, **kwargs))
@@ -3059,7 +3059,7 @@ def eva_large_patch14_196(pretrained: bool = False, **kwargs) -> VisionTransform
 
 @register_model
 def eva_large_patch14_336(pretrained: bool = False, **kwargs) -> VisionTransformer:
-    """ EVA-large model https://arxiv.org/abs/2211.07636 via MAE MIM pretrain"""
+    """ EVA-large model https://huggingface.co/papers/2211.07636 via MAE MIM pretrain"""
     model_args = dict(patch_size=14, embed_dim=1024, depth=24, num_heads=16, global_pool='avg')
     model = _create_vision_transformer('eva_large_patch14_336', pretrained=pretrained, **dict(model_args, **kwargs))
     return model
diff --git a/timm/models/vision_transformer_hybrid.py b/timm/models/vision_transformer_hybrid.py
index 4cf3a7664b..26e20123bd 100644
--- a/timm/models/vision_transformer_hybrid.py
+++ b/timm/models/vision_transformer_hybrid.py
@@ -3,10 +3,10 @@
 A PyTorch implement of the Hybrid Vision Transformers as described in:
 
 'An Image Is Worth 16 x 16 Words: Transformers for Image Recognition at Scale'
-    - https://arxiv.org/abs/2010.11929
+    - https://huggingface.co/papers/2010.11929
 
 `How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers`
-    - https://arxiv.org/abs/2106.10270
+    - https://huggingface.co/papers/2106.10270
 
 NOTE These hybrid model definitions depend on code in vision_transformer.py.
 They were moved here to keep file sizes sane.
@@ -291,7 +291,7 @@ def vit_base_r26_s32_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def vit_base_r50_s16_224(pretrained=False, **kwargs) -> VisionTransformer:
-    """ R50+ViT-B/S16 hybrid from original paper (https://arxiv.org/abs/2010.11929).
+    """ R50+ViT-B/S16 hybrid from original paper (https://huggingface.co/papers/2010.11929).
     """
     backbone = _resnetv2((3, 4, 9), **kwargs)
     model_args = dict(embed_dim=768, depth=12, num_heads=12)
@@ -302,7 +302,7 @@ def vit_base_r50_s16_224(pretrained=False, **kwargs) -> VisionTransformer:
 
 @register_model
 def vit_base_r50_s16_384(pretrained=False, **kwargs) -> VisionTransformer:
-    """ R50+ViT-B/16 hybrid from original paper (https://arxiv.org/abs/2010.11929).
+    """ R50+ViT-B/16 hybrid from original paper (https://huggingface.co/papers/2010.11929).
     ImageNet-1k weights fine-tuned from in21k @ 384x384, source https://github.com/google-research/vision_transformer.
     """
     backbone = _resnetv2((3, 4, 9), **kwargs)
diff --git a/timm/models/vision_transformer_sam.py b/timm/models/vision_transformer_sam.py
index 75bb12e56f..f6fe7057d1 100644
--- a/timm/models/vision_transformer_sam.py
+++ b/timm/models/vision_transformer_sam.py
@@ -3,7 +3,7 @@
 A PyTorch implement of Vision Transformers as described in:
 
 'Exploring Plain Vision Transformer Backbones for Object Detection'
-    - https://arxiv.org/abs/2203.16527
+    - https://huggingface.co/papers/2203.16527
 
 'Segment Anything Model (SAM)'
     - https://github.com/facebookresearch/segment-anything/
@@ -320,7 +320,7 @@ class VisionTransformerSAM(nn.Module):
     """ Vision Transformer for Segment-Anything Model(SAM)
 
     A PyTorch impl of : `Exploring Plain Vision Transformer Backbones for Object Detection` or `Segment Anything Model (SAM)`
-        - https://arxiv.org/abs/2010.11929
+        - https://huggingface.co/papers/2010.11929
     """
 
     def __init__(
diff --git a/timm/models/volo.py b/timm/models/volo.py
index 46be778f67..c5957a52b1 100644
--- a/timm/models/volo.py
+++ b/timm/models/volo.py
@@ -1,6 +1,6 @@
 """ Vision OutLOoker (VOLO) implementation
 
-Paper: `VOLO: Vision Outlooker for Visual Recognition` - https://arxiv.org/abs/2106.13112
+Paper: `VOLO: Vision Outlooker for Visual Recognition` - https://huggingface.co/papers/2106.13112
 
 Code adapted from official impl at https://github.com/sail-sg/volo, original copyright in comment below
 
diff --git a/timm/models/vovnet.py b/timm/models/vovnet.py
index 0b48a34c14..9092ba54c4 100644
--- a/timm/models/vovnet.py
+++ b/timm/models/vovnet.py
@@ -1,8 +1,8 @@
 """ VoVNet (V1 & V2)
 
 Papers:
-* `An Energy and GPU-Computation Efficient Backbone Network` - https://arxiv.org/abs/1904.09730
-* `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://arxiv.org/abs/1911.06667
+* `An Energy and GPU-Computation Efficient Backbone Network` - https://huggingface.co/papers/1904.09730
+* `CenterMask : Real-Time Anchor-Free Instance Segmentation` - https://huggingface.co/papers/1911.06667
 
 Looked at  https://github.com/youngwanLEE/vovnet-detectron2 &
 https://github.com/stigma0617/VoVNet.pytorch/blob/master/models_vovnet/vovnet.py
diff --git a/timm/models/xception.py b/timm/models/xception.py
index e1f92abfa0..516c3b76ec 100644
--- a/timm/models/xception.py
+++ b/timm/models/xception.py
@@ -8,7 +8,7 @@
 
 Francois Chollet
 Xception: Deep Learning with Depthwise Separable Convolutions
-https://arxiv.org/pdf/1610.02357.pdf
+https://huggingface.co/papers/1610.02357
 
 This weights ported from the Keras implementation. Achieves the following performance on the validation set:
 
@@ -93,7 +93,7 @@ def forward(self, inp):
 class Xception(nn.Module):
     """
     Xception optimized for the ImageNet dataset, as specified in
-    https://arxiv.org/pdf/1610.02357.pdf
+    https://huggingface.co/papers/1610.02357
     """
 
     def __init__(self, num_classes=1000, in_chans=3, drop_rate=0., global_pool='avg'):
diff --git a/timm/models/xcit.py b/timm/models/xcit.py
index 250749f1cf..010fafd1b8 100644
--- a/timm/models/xcit.py
+++ b/timm/models/xcit.py
@@ -1,7 +1,7 @@
 """ Cross-Covariance Image Transformer (XCiT) in PyTorch
 
 Paper:
-    - https://arxiv.org/abs/2106.09681
+    - https://huggingface.co/papers/2106.09681
 
 Same as the official implementation, with some minor adaptations, original copyright below
     - https://github.com/facebookresearch/xcit/blob/master/xcit.py
@@ -144,7 +144,7 @@ def forward(self, x, H: int, W: int):
 
 
 class ClassAttentionBlock(nn.Module):
-    """Class Attention Layer as in CaiT https://arxiv.org/abs/2103.17239"""
+    """Class Attention Layer as in CaiT https://huggingface.co/papers/2103.17239"""
 
     def __init__(
             self,
diff --git a/timm/optim/adafactor.py b/timm/optim/adafactor.py
index c426e30a17..1049b5949b 100644
--- a/timm/optim/adafactor.py
+++ b/timm/optim/adafactor.py
@@ -22,7 +22,7 @@ class Adafactor(torch.optim.Optimizer):
     """Implements Adafactor algorithm.
 
     This implementation is based on: `Adafactor: Adaptive Learning Rates with Sublinear Memory Cost`
-    (see https://arxiv.org/abs/1804.04235)
+    (see https://huggingface.co/papers/1804.04235)
 
     Note that this optimizer internally adjusts the learning rate depending on the
     *scale_parameter*, *relative_step* and *warmup_init* options.
@@ -212,7 +212,7 @@ def _remove_dim(shape, dim):
                     exp_avg = state['exp_avg']
                     exp_avg.mul_(group['beta1']).add_(update, alpha=1 - group['beta1'])
                     if group['caution']:
-                        # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                        # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                         mask = (exp_avg * grad > 0).to(grad.dtype)
                         mask.div_(mask.mean().clamp_(min=1e-3))
                         update = exp_avg * mask
diff --git a/timm/optim/adafactor_bv.py b/timm/optim/adafactor_bv.py
index 298d43bb7e..6ed5af3d6f 100644
--- a/timm/optim/adafactor_bv.py
+++ b/timm/optim/adafactor_bv.py
@@ -2,7 +2,7 @@
 
 Adapted from the implementation in big vision: https://github.com/google-research/big_vision
 
-Described in 'Scaling Vision Transformers': https://arxiv.org/abs/2106.04560
+Described in 'Scaling Vision Transformers': https://huggingface.co/papers/2106.04560
 
 Adaptation and PyTorch modifications by Ross Wightman
 """
@@ -274,7 +274,7 @@ def _single_tensor_adafactor(
                 update = exp_avg.clone()
 
             if caution:
-                # apply caution as per 'Cautious Optimizers': https://arxiv.org/abs/2411.16085
+                # apply caution as per 'Cautious Optimizers': https://huggingface.co/papers/2411.16085
                 mask = (update * grad > 0).to(grad.dtype)
                 mask.div_(mask.mean().clamp_(min=1e-3))
                 update.mul_(mask)
diff --git a/timm/optim/adamp.py b/timm/optim/adamp.py
index 5a9ac3395d..9e5394adc3 100644
--- a/timm/optim/adamp.py
+++ b/timm/optim/adamp.py
@@ -1,7 +1,7 @@
 """
 AdamP Optimizer Implementation copied from https://github.com/clovaai/AdamP/blob/master/adamp/adamp.py
 
-Paper: `Slowing Down the Weight Norm Increase in Momentum-based Optimizers` - https://arxiv.org/abs/2006.08217
+Paper: `Slowing Down the Weight Norm Increase in Momentum-based Optimizers` - https://huggingface.co/papers/2006.08217
 Code: https://github.com/clovaai/AdamP
 
 Copyright (c) 2020-present NAVER Corp.
diff --git a/timm/optim/adamw.py b/timm/optim/adamw.py
index 07299ad63e..34c1fdc2ba 100644
--- a/timm/optim/adamw.py
+++ b/timm/optim/adamw.py
@@ -18,8 +18,8 @@ class AdamWLegacy(Optimizer):
     NOTE: This impl has been deprecated in favour of torch.optim.NAdam and remains as a reference
 
     References:
-        - Adam: A Method for Stochastic Optimization: https://arxiv.org/abs/1412.6980
-        - Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101
+        - Adam: A Method for Stochastic Optimization: https://huggingface.co/papers/1412.6980
+        - Decoupled Weight Decay Regularization: https://huggingface.co/papers/1711.05101
         - On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
 
     Args:
@@ -130,7 +130,7 @@ def step(self, closure=None):
                 step_size = group['lr'] / bias_correction1
 
                 if group['caution']:
-                    # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                    # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                     mask = (exp_avg * grad > 0).to(grad.dtype)
                     mask.div_(mask.mean().clamp_(min=1e-3))
                     exp_avg = exp_avg * mask
diff --git a/timm/optim/adan.py b/timm/optim/adan.py
index 4db62e9cf0..1f455b833b 100644
--- a/timm/optim/adan.py
+++ b/timm/optim/adan.py
@@ -1,7 +1,7 @@
 """ Adan Optimizer
 
 Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models[J]. arXiv preprint arXiv:2208.06677, 2022.
-    https://arxiv.org/abs/2208.06677
+    https://huggingface.co/papers/2208.06677
 
 Implementation adapted from https://github.com/sail-sg/Adan
 """
@@ -47,7 +47,7 @@ class Adan(Optimizer):
     """ Implements a pytorch variant of Adan.
 
     Adan was proposed in Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
-    https://arxiv.org/abs/2208.06677
+    https://huggingface.co/papers/2208.06677
 
     Arguments:
         params: Iterable of parameters to optimize or dicts defining parameter groups.
@@ -244,7 +244,7 @@ def _single_tensor_adan(
         step_size = lr / bias_correction1
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             mask = (exp_avg * grad > 0).to(grad.dtype)
             mask.div_(mask.mean().clamp_(min=1e-3))
             exp_avg = exp_avg * mask
@@ -306,7 +306,7 @@ def _multi_tensor_adan(
     step_size = lr / bias_correction1
 
     if caution:
-        # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+        # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
         masks = torch._foreach_mul(exp_avgs, grads)
         masks = [(m > 0).to(g.dtype) for m, g in zip(masks, grads)]
         mask_scale = [m.mean() for m in masks]
diff --git a/timm/optim/adopt.py b/timm/optim/adopt.py
index 6192990ed6..f502d33632 100644
--- a/timm/optim/adopt.py
+++ b/timm/optim/adopt.py
@@ -1,6 +1,6 @@
 """ ADOPT PyTorch Optimizer
 
-ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate: https://arxiv.org/abs/2411.02853
+ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate: https://huggingface.co/papers/2411.02853
 
 Modified for reduced dependencies on PyTorch internals from original at: https://github.com/iShohei220/adopt
 
@@ -54,7 +54,7 @@ def _get_value(x):
 
 class Adopt(Optimizer):
     """
-    ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate: https://arxiv.org/abs/2411.02853
+    ADOPT: Modified Adam Can Converge with Any β2 with the Optimal Rate: https://huggingface.co/papers/2411.02853
 
     """
     def __init__(
@@ -311,7 +311,7 @@ def _single_tensor_adopt(
         exp_avg.lerp_(normed_grad, 1 - beta1)
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             mask = (exp_avg * grad > 0).to(grad.dtype)
             mask.div_(mask.mean().clamp_(min=1e-3))
             exp_avg = exp_avg * mask
@@ -425,7 +425,7 @@ def _multi_tensor_adopt(
         torch._foreach_lerp_(device_exp_avgs, normed_grad, 1 - beta1)
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             masks = torch._foreach_mul(device_exp_avgs, device_grads)
             masks = [(m > 0).to(g.dtype) for m, g in zip(masks, device_grads)]
             mask_scale = [m.mean() for m in masks]
diff --git a/timm/optim/lamb.py b/timm/optim/lamb.py
index fa86757441..726714fba6 100644
--- a/timm/optim/lamb.py
+++ b/timm/optim/lamb.py
@@ -65,7 +65,7 @@ class Lamb(Optimizer):
     reference: https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/Transformer-XL/pytorch/lamb.py
 
     LAMB was proposed in:
-    - Large Batch Optimization for Deep Learning - Training BERT in 76 minutes:  https://arxiv.org/abs/1904.00962
+    - Large Batch Optimization for Deep Learning - Training BERT in 76 minutes:  https://huggingface.co/papers/1904.00962
     - On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
 
     Args:
@@ -195,7 +195,7 @@ def step(self, closure=None):
                 update = (exp_avg / bias_correction1).div_(denom)
 
                 if group['caution']:
-                    # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                    # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                     mask = (update * grad > 0).to(grad.dtype)
                     mask.div_(mask.mean().clamp_(min=1e-3))
                     update.mul_(mask)
diff --git a/timm/optim/laprop.py b/timm/optim/laprop.py
index a17c81e6ad..baf8f7d385 100644
--- a/timm/optim/laprop.py
+++ b/timm/optim/laprop.py
@@ -2,7 +2,7 @@
 
 Code simplified from https://github.com/Z-T-WANG/LaProp-Optimizer, MIT License
 
-Paper: LaProp: Separating Momentum and Adaptivity in Adam, https://arxiv.org/abs/2002.04839
+Paper: LaProp: Separating Momentum and Adaptivity in Adam, https://huggingface.co/papers/2002.04839
 
 @article{ziyin2020laprop,
   title={LaProp: a Better Way to Combine Momentum with Adaptive Gradient},
@@ -23,7 +23,7 @@
 class LaProp(Optimizer):
     """ LaProp Optimizer
 
-    Paper: LaProp: Separating Momentum and Adaptivity in Adam, https://arxiv.org/abs/2002.04839
+    Paper: LaProp: Separating Momentum and Adaptivity in Adam, https://huggingface.co/papers/2002.04839
     """
     def __init__(
             self,
@@ -108,7 +108,7 @@ def step(self, closure=None):
                 exp_avg.mul_(beta1).add_(step_of_this_grad, alpha=group['lr'] * one_minus_beta1)
 
                 if group['caution']:
-                    # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                    # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                     mask = (exp_avg * grad > 0).to(grad.dtype)
                     mask.div_(mask.mean().clamp_(min=1e-3))
                     exp_avg = exp_avg * mask
diff --git a/timm/optim/lars.py b/timm/optim/lars.py
index d49efc6d0e..9dece694b7 100644
--- a/timm/optim/lars.py
+++ b/timm/optim/lars.py
@@ -17,7 +17,7 @@
 class Lars(Optimizer):
     """ LARS for PyTorch
     
-    Paper: `Large batch training of Convolutional Networks` - https://arxiv.org/pdf/1708.03888.pdf
+    Paper: `Large batch training of Convolutional Networks` - https://huggingface.co/papers/1708.03888
 
     Args:
         params (iterable): iterable of parameters to optimize or dicts defining parameter groups.
diff --git a/timm/optim/lion.py b/timm/optim/lion.py
index 1860723203..6a79a826c0 100644
--- a/timm/optim/lion.py
+++ b/timm/optim/lion.py
@@ -1,5 +1,5 @@
 """ Lion Optimizer
-Paper: `Symbolic Discovery of Optimization Algorithms` - https://arxiv.org/abs/2302.06675
+Paper: `Symbolic Discovery of Optimization Algorithms` - https://huggingface.co/papers/2302.06675
 Original Impl: https://github.com/google/automl/tree/master/lion
 """
 # Copyright 2023 Google Research. All Rights Reserved.
@@ -196,7 +196,7 @@ def _single_tensor_lion(
         update = exp_avg.mul(beta1).add_(grad, alpha=1 - beta1).sign_()
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             mask = (update * grad > 0).to(grad.dtype)
             mask.div_(mask.mean().clamp_(min=1e-3))
             update.mul_(mask)
@@ -238,7 +238,7 @@ def _multi_tensor_lion(
     updates = [u.sign_() for u in updates]
 
     if caution:
-        # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+        # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
         masks = torch._foreach_mul(updates, grads)
         masks = [(m > 0).to(g.dtype) for m, g in zip(masks, grads)]
         mask_scale = [m.mean() for m in masks]
diff --git a/timm/optim/lookahead.py b/timm/optim/lookahead.py
index 1c0f1c91d8..0ee2418757 100644
--- a/timm/optim/lookahead.py
+++ b/timm/optim/lookahead.py
@@ -1,6 +1,6 @@
 """ Lookahead Optimizer Wrapper.
 Implementation modified from: https://github.com/alphadl/lookahead.pytorch
-Paper: `Lookahead Optimizer: k steps forward, 1 step back` - https://arxiv.org/abs/1907.08610
+Paper: `Lookahead Optimizer: k steps forward, 1 step back` - https://huggingface.co/papers/1907.08610
 
 Hacked together by / Copyright 2020 Ross Wightman
 """
diff --git a/timm/optim/madgrad.py b/timm/optim/madgrad.py
index 8e449dce3d..1e41364f13 100644
--- a/timm/optim/madgrad.py
+++ b/timm/optim/madgrad.py
@@ -1,6 +1,6 @@
 """ PyTorch MADGRAD optimizer
 
-MADGRAD: https://arxiv.org/abs/2101.11075
+MADGRAD: https://huggingface.co/papers/2101.11075
 
 Code from: https://github.com/facebookresearch/madgrad
 """
@@ -26,7 +26,7 @@ class MADGRAD(torch.optim.Optimizer):
     MADGRAD_: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic
     Optimization.
 
-    .. _MADGRAD: https://arxiv.org/abs/2101.11075
+    .. _MADGRAD: https://huggingface.co/papers/2101.11075
 
     MADGRAD is a general purpose optimizer that can be used in place of SGD or
     Adam may converge faster and generalize better. Currently GPU-only.
diff --git a/timm/optim/mars.py b/timm/optim/mars.py
index 1068ee9141..bcfbd12c49 100644
--- a/timm/optim/mars.py
+++ b/timm/optim/mars.py
@@ -2,7 +2,7 @@
 
 Code simplified from https://github.com/AGI-Arena/MARS
 
-Paper: MARS: Unleashing the Power of Variance Reduction for Training Large Models - https://arxiv.org/abs/2411.10438
+Paper: MARS: Unleashing the Power of Variance Reduction for Training Large Models - https://huggingface.co/papers/2411.10438
 
 @article{yuan2024mars,
   title={MARS: Unleashing the Power of Variance Reduction for Training Large Models},
@@ -56,7 +56,7 @@ def _mars_single_tensor_step(
         exp_avg.mul_(beta1).add_(c_t, alpha=one_minus_beta1)
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             mask = (exp_avg * grad > 0).to(grad.dtype)
             mask.div_(mask.mean().clamp_(min=1e-3))
             exp_avg = exp_avg * mask
@@ -92,7 +92,7 @@ class Mars(Optimizer):
     """ MARS Optimizer
 
     Paper: MARS: Unleashing the Power of Variance Reduction for Training Large Models
-        https://arxiv.org/abs/2411.10438
+        https://huggingface.co/papers/2411.10438
 
     """
     def __init__(
diff --git a/timm/optim/nadamw.py b/timm/optim/nadamw.py
index d9933026c6..236d7d8b48 100644
--- a/timm/optim/nadamw.py
+++ b/timm/optim/nadamw.py
@@ -17,12 +17,12 @@
 class NAdamW(torch.optim.Optimizer):
     """ Implements NAdamW algorithm.
 
-    See Table 1 in https://arxiv.org/abs/1910.05446 for the implementation of
+    See Table 1 in https://huggingface.co/papers/1910.05446 for the implementation of
     the NAdam algorithm (there is also a comment in the code which highlights
     the only difference of NAdamW and AdamW).
 
     For further details regarding the algorithm we refer to
-        - Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101
+        - Decoupled Weight Decay Regularization: https://huggingface.co/papers/1711.05101
         - On the Convergence of Adam and Beyond: https://openreview.net/forum?id=ryQu7f-RZ
 
     Args:
@@ -250,7 +250,7 @@ def _single_tensor_nadamw(
             denom = (exp_avg_sq.sqrt() / (bias_correction2_sqrt * step_size_neg)).add_(eps / step_size_neg)
 
             if caution:
-                # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                 # FIXME not 100% sure if this remains capturable?
                 mask = (exp_avg * grad > 0).to(grad.dtype)
                 mask.div_(mask.mean().clamp_(min=1e-3))
@@ -270,7 +270,7 @@ def _single_tensor_nadamw(
             denom = (exp_avg_sq.sqrt() / bias_correction2_sqrt).add_(eps)
 
             if caution:
-                # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                 mask = (exp_avg * grad > 0).to(grad.dtype)
                 mask.div_(mask.mean().clamp_(min=1e-3))
                 exp_avg.mul_(mask)
@@ -355,7 +355,7 @@ def _multi_tensor_nadamw(
         denom = torch._foreach_add(exp_avg_sq_sqrt, eps_over_step_size)
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             masks = torch._foreach_mul(exp_avgs, grads)
             masks = [(m > 0).to(g.dtype) for m, g in zip(masks, grads)]  # capturable?
             mask_scale = [m.mean() for m in masks]
@@ -382,7 +382,7 @@ def _multi_tensor_nadamw(
         denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
 
         if caution:
-            # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+            # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
             masks = torch._foreach_mul(exp_avgs, grads)
             masks = [(m > 0).to(g.dtype) for m, g in zip(masks, grads)]
             mask_scale = [m.mean() for m in masks]
diff --git a/timm/optim/nvnovograd.py b/timm/optim/nvnovograd.py
index 068e5aa2c1..f3d43bf507 100644
--- a/timm/optim/nvnovograd.py
+++ b/timm/optim/nvnovograd.py
@@ -2,7 +2,7 @@
 Original impl by Nvidia from Jasper example:
     - https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper
 Paper: `Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks`
-    - https://arxiv.org/abs/1905.11286
+    - https://huggingface.co/papers/1905.11286
 """
 
 import torch
diff --git a/timm/optim/radam.py b/timm/optim/radam.py
index 9b12b98a59..648dec72eb 100644
--- a/timm/optim/radam.py
+++ b/timm/optim/radam.py
@@ -1,6 +1,6 @@
 """RAdam Optimizer.
 Implementation lifted from: https://github.com/LiyuanLucasLiu/RAdam
-Paper: `On the Variance of the Adaptive Learning Rate and Beyond` - https://arxiv.org/abs/1908.03265
+Paper: `On the Variance of the Adaptive Learning Rate and Beyond` - https://huggingface.co/papers/1908.03265
 
 NOTE: This impl has been deprecated in favour of torch.optim.RAdam and remains as a reference
 """
diff --git a/timm/optim/rmsprop_tf.py b/timm/optim/rmsprop_tf.py
index 07b0279c85..a335e2010b 100644
--- a/timm/optim/rmsprop_tf.py
+++ b/timm/optim/rmsprop_tf.py
@@ -28,7 +28,7 @@ class RMSpropTF(Optimizer):
     `course <http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf>`_.
 
     The centered version first appears in `Generating Sequences
-    With Recurrent Neural Networks <https://arxiv.org/pdf/1308.0850v5.pdf>`_.
+    With Recurrent Neural Networks <https://huggingface.co/papers/1308.0850v5.pdf>`_.
 
     Args:
         params: iterable of parameters to optimize or dicts defining parameter groups
@@ -38,7 +38,7 @@ class RMSpropTF(Optimizer):
         eps: term added to the denominator to improve numerical stability
         centered: if ``True``, compute the centered RMSProp, the gradient is normalized by an estimation of its variance
         weight_decay: weight decay (L2 penalty) (default: 0)
-        decoupled_decay: decoupled weight decay as per https://arxiv.org/abs/1711.05101
+        decoupled_decay: decoupled weight decay as per https://huggingface.co/papers/1711.05101
         lr_in_momentum: learning rate scaling is included in the momentum buffer update as per defaults in Tensorflow
         caution: apply caution
     """
@@ -146,7 +146,7 @@ def step(self, closure=None):
                     buf.mul_(group['momentum'])
 
                     def _apply_caution(_m, _g):
-                        # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                        # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                         mask = (_m * _g > 0).to(_g.dtype)
                         mask.div_(mask.mean().clamp_(min=1e-3))
                         return _m * mask
diff --git a/timm/optim/sgdp.py b/timm/optim/sgdp.py
index 87b89f6f0b..26bb481a57 100644
--- a/timm/optim/sgdp.py
+++ b/timm/optim/sgdp.py
@@ -1,7 +1,7 @@
 """
 SGDP Optimizer Implementation copied from https://github.com/clovaai/AdamP/blob/master/adamp/sgdp.py
 
-Paper: `Slowing Down the Weight Norm Increase in Momentum-based Optimizers` - https://arxiv.org/abs/2006.08217
+Paper: `Slowing Down the Weight Norm Increase in Momentum-based Optimizers` - https://huggingface.co/papers/2006.08217
 Code: https://github.com/clovaai/AdamP
 
 Copyright (c) 2020-present NAVER Corp.
diff --git a/timm/optim/sgdw.py b/timm/optim/sgdw.py
index b771c43c67..3ef3df5145 100644
--- a/timm/optim/sgdw.py
+++ b/timm/optim/sgdw.py
@@ -209,7 +209,7 @@ def _single_tensor_sgdw(
             if caution:
                 if nesterov:
                     buf = grad.add(buf, alpha=momentum)
-                # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                 mask = (buf * grad > 0).to(grad.dtype)
                 mask.div_(mask.mean().clamp_(min=1e-3))
                 grad = buf * mask
@@ -279,7 +279,7 @@ def _multi_tensor_sgdw(
                 if nesterov:
                     # Can't do nesterov in-place if we want to compare against orig grad for caution
                     bufs = torch._foreach_add(device_grads, bufs, alpha=momentum)
-                # Apply caution as per 'Cautious Optimizers' - https://arxiv.org/abs/2411.16085
+                # Apply caution as per 'Cautious Optimizers' - https://huggingface.co/papers/2411.16085
                 masks = torch._foreach_mul(bufs, device_grads)
                 masks = [(m > 0).to(g.dtype) for m, g in zip(masks, device_grads)]
                 mask_scale = [m.mean() for m in masks]
diff --git a/timm/scheduler/cosine_lr.py b/timm/scheduler/cosine_lr.py
index 00dd9357d9..1dee9831cf 100644
--- a/timm/scheduler/cosine_lr.py
+++ b/timm/scheduler/cosine_lr.py
@@ -19,12 +19,12 @@
 class CosineLRScheduler(Scheduler):
     """
     Cosine decay with restarts.
-    This is described in the paper https://arxiv.org/abs/1608.03983.
+    This is described in the paper https://huggingface.co/papers/1608.03983.
 
     Inspiration from
     https://github.com/allenai/allennlp/blob/master/allennlp/training/learning_rate_schedulers/cosine.py
 
-    k-decay option based on `k-decay: A New Method For Learning Rate Schedule` - https://arxiv.org/abs/2004.05909
+    k-decay option based on `k-decay: A New Method For Learning Rate Schedule` - https://huggingface.co/papers/2004.05909
     """
 
     def __init__(
diff --git a/timm/scheduler/poly_lr.py b/timm/scheduler/poly_lr.py
index f7971302ed..80dc46f6b7 100644
--- a/timm/scheduler/poly_lr.py
+++ b/timm/scheduler/poly_lr.py
@@ -19,7 +19,7 @@
 class PolyLRScheduler(Scheduler):
     """ Polynomial LR Scheduler w/ warmup, noise, and k-decay
 
-    k-decay option based on `k-decay: A New Method For Learning Rate Schedule` - https://arxiv.org/abs/2004.05909
+    k-decay option based on `k-decay: A New Method For Learning Rate Schedule` - https://huggingface.co/papers/2004.05909
     """
 
     def __init__(
diff --git a/timm/scheduler/tanh_lr.py b/timm/scheduler/tanh_lr.py
index 932229262e..c7a1d5ee2a 100644
--- a/timm/scheduler/tanh_lr.py
+++ b/timm/scheduler/tanh_lr.py
@@ -19,7 +19,7 @@
 class TanhLRScheduler(Scheduler):
     """
     Hyberbolic-Tangent decay with restarts.
-    This is described in the paper https://arxiv.org/abs/1806.01593
+    This is described in the paper https://huggingface.co/papers/1806.01593
     """
 
     def __init__(
diff --git a/timm/utils/agc.py b/timm/utils/agc.py
index f51401726f..80b26cc6c9 100644
--- a/timm/utils/agc.py
+++ b/timm/utils/agc.py
@@ -1,6 +1,6 @@
 """ Adaptive Gradient Clipping
 
-An impl of AGC, as per (https://arxiv.org/abs/2102.06171):
+An impl of AGC, as per (https://huggingface.co/papers/2102.06171):
 
 @article{brock2021high,
   author={Andrew Brock and Soham De and Samuel L. Smith and Karen Simonyan},
diff --git a/timm/utils/model.py b/timm/utils/model.py
index 90412b741c..523c149729 100644
--- a/timm/utils/model.py
+++ b/timm/utils/model.py
@@ -98,7 +98,7 @@ def extract_spp_stats(
     """Extract average square channel mean and variance of activations during 
         forward pass to plot Signal Propagation Plots (SPP).
     
-    Paper: https://arxiv.org/abs/2101.08692
+    Paper: https://huggingface.co/papers/2101.08692
 
     Example Usage: https://gist.github.com/amaarora/6e56942fcb46e67ba203f3009b30d950
     """