当前位置：首页 > news >正文

深圳建设工程信息网站企业营销推广怎么做

news 2026/1/1 5:22:40

深圳建设工程信息网站,企业营销推广怎么做,网站开发费用属于哪种无形资产,鹤壁河南网站建设摘要一、论文介绍研究背景#xff1a;视觉Transformer在计算机视觉领域展现出巨大潜力#xff0c;能够捕获长距离依赖关系#xff0c;具有高并行性#xff0c;有利于大型模型的训练和推理。现有问题#xff1a;尽管大量研究设计了高效的注意力模式#xff0c;但查询并…摘要一、论文介绍研究背景视觉Transformer在计算机视觉领域展现出巨大潜力能够捕获长距离依赖关系具有高并行性有利于大型模型的训练和推理。现有问题尽管大量研究设计了高效的注意力模式但查询并非源自语义区域的关键值对强制所有查询关注不足的一组令牌可能无法产生最优结果。双级路由注意力虽由语义关键值对处理查询但可能并非在所有情况下都能产生最优结果。论文目的提出DeBiFormer一种带有可变形双级路由注意力DBRA的视觉Transformer旨在优化查询-键-值交互自适应选择语义相关区域。二、创新点可变形双级路由注意力DBRA提出一种注意力中注意力架构通过可变形点和双级路由机制实现更高效、有意义的注意力分配。可变形点感知区域划分确保每个可变形点仅与键值对的一个小子集进行交互平衡重要区域和不太重要区域之间的注意力分配。区域间方法通过构建有向图建立注意关系使用topk操作符和路由索引矩阵保留每个区域的topk连接。三、方法可变形注意力模块包含一个偏移网络为参考点生成偏移量创建可变形点这些点以高灵活性和效率向重要区域移动捕获更多信息性特征。双层标记到可变形层标记注意力利用区域路由矩阵对区域内的每个可变形查询标记执行注意力操作跨越位于topk路由区域中的所有键值对。DeBiFormer模型架构使用四阶段金字塔结构包含重叠补丁嵌入、补丁合并模块、DeBiFormer块等用于降低输入空间分辨率增加通道数实现跨位置关系建模和每个位置的嵌入。四、模块作用可变形双级路由注意力DBRA模块优化查询-键-值交互自适应选择语义相关区域实现更高效和有意义的注意力。通过可变形点和双级路由机制提高模型对重要区域的关注度同时减少不太重要区域的注意力。3x3深度卷积在DeBiFormer块开始时使用用于隐式编码相对位置信息增强模型的局部敏感性。2-ConvFFN模块用于每个位置的嵌入扩展模型的特征表示能力。五、实验结果图像分类在ImageNet-1K数据集上从头训练图像分类模型验证了DeBiFormer的有效性。语义分割在ADE20K数据集上对预训练的主干网络进行微调DeBiFormer表现出色证明了其在密集预测任务中的性能。目标检测和实例分割使用DeBiFormer作为Mask RCNN和RetinaNet框架中的主干网络在COCO 2017数据集上评估其性能。尽管资源有限但DeBiFormer在大目标上的性能优于一些最具竞争力的现有方法。消融研究验证了DBRA和DeBiFormer的top-k选择的有效性证明了可变形双级路由注意力对模型性能的贡献。总结本文介绍的DeBiFormer是一种专为图像分类和密集预测任务设计的新型分层视觉Transformer。通过提出可变形双级路由注意力DBRA优化了查询-键-值交互自适应选择语义相关区域实现了更高效和有意义的注意力。实验结果表明DeBiFormer在多个计算机视觉任务上均表现出色为设计灵活且语义感知的注意力机制提供了见解。本文使用DeBiFormer模型实现图像分类任务模型选择debi_tiny在植物幼苗分类任务ACC达到了82%。通过深入阅读本文您将能够掌握以下关键技能与知识数据增强的多种策略包括利用PyTorch的transforms库进行基本增强以及进阶技巧如CutOut、MixUp、CutMix等这些方法能显著提升模型泛化能力。 DeBiFormer模型的训练实现了解如何从头开始构建并训练DeBiFormer或其他深度学习模型涵盖模型定义、数据加载、训练循环等关键环节。混合精度训练学习如何利用PyTorch自带的混合精度训练功能加速训练过程同时减少内存消耗。梯度裁剪技术掌握梯度裁剪的应用有效防止梯度爆炸问题确保训练过程的稳定性。分布式数据并行DP训练了解如何在多GPU环境下使用PyTorch的分布式数据并行功能加速大规模模型训练。可视化训练过程学习如何绘制训练过程中的loss和accuracy曲线直观监控模型学习状况。评估与生成报告掌握在验证集上评估模型性能的方法并生成详细的评估报告包括ACC等指标。测试脚本编写学会编写测试脚本对测试集进行预测评估模型在实际应用中的表现。学习率调整策略理解并应用余弦退火策略动态调整学习率优化训练效果。自定义统计工具使用AverageMeter类或其他工具统计和记录训练过程中的ACC、loss等关键指标便于后续分析。深入理解ACC1与ACC5掌握图像分类任务中ACC1Top-1准确率和ACC5Top-5准确率的含义及其计算方法。指数移动平均EMA学习如何在模型训练中应用EMA技术进一步提升模型在测试集上的表现。若您在以上任一领域基础尚浅感到理解困难推荐您参考我的专栏“经典主干网络精讲与实战”该专栏从零开始循序渐进地讲解上述所有知识点助您轻松掌握深度学习中的这些核心技能。安装包安装timm 使用pip就行命令 pip install timmmixup增强和EMA用到了timm 安装einops执行命令 pip install einops数据增强Cutout和Mixup 为了提高模型的泛化能力和性能我在数据预处理阶段加入了Cutout和Mixup这两种数据增强技术。Cutout通过随机遮挡图像的一部分来强制模型学习更鲁棒的特征而Mixup则通过混合两张图像及其标签来生成新的训练样本从而增加数据的多样性。实现这两种增强需要安装torchtoolbox。安装命令 pip install torchtoolboxCutout实现在transforms中。 from torchtoolbox.transform import Cutout # 数据预处理 transform transforms.Compose([transforms.Resize((224, 224)),Cutout(),transforms.ToTensor(),transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])])需要导入包from timm.data.mixup import Mixup 定义Mixup和SoftTargetCrossEntropy mixup_fn Mixup(mixup_alpha0.8, cutmix_alpha1.0, cutmix_minmaxNone,prob0.1, switch_prob0.5, modebatch,label_smoothing0.1, num_classes12)criterion_train SoftTargetCrossEntropy()Mixup 是一种在图像分类任务中常用的数据增强技术它通过将两张图像以及其对应的标签进行线性组合来生成新的数据和标签。参数详解 mixup_alpha (float): mixup alpha 值如果 0则 mixup 处于活动状态。 cutmix_alpha (float)cutmix alpha 值如果 0cutmix 处于活动状态。 cutmix_minmax (List[float])cutmix 最小/最大图像比率cutmix 处于活动状态如果不是 None则使用这个 vs alpha。如果设置了 cutmix_minmax 则cutmix_alpha 默认为1.0 prob (float): 每批次或元素应用 mixup 或 cutmix 的概率。 switch_prob (float): 当两者都处于活动状态时切换cutmix 和mixup 的概率。 mode (str): 如何应用 mixup/cutmix 参数每个’batch’‘pair’元素对‘elem’元素。 correct_lam (bool): 当 cutmix bbox 被图像边框剪裁时应用。 lambda 校正 label_smoothing (float)将标签平滑应用于混合目标张量。 num_classes (int): 目标的类数。 EMA EMAExponential Moving Average在深度学习中是一种用于模型参数优化的技术它通过计算参数的指数移动平均值来平滑模型的学习过程。这种方法有助于提高模型的稳定性和泛化能力特别是在训练后期。以下是关于EMA的总结表达进行了优化 EMA概述 EMA是一种加权移动平均技术其中每个新的平均值都是前一个平均值和当前值的加权和。在深度学习中EMA被用于模型参数的更新以减缓参数在训练过程中的快速波动从而得到更加平滑和稳定的模型表现。工作原理在训练过程中除了维护当前模型的参数外还额外保存一份EMA参数。每个训练步骤或每隔一定步骤根据当前模型参数和EMA参数按照指数衰减的方式更新EMA参数。具体来说EMA参数的更新公式通常如下 EMA new decay × EMA old ( 1 − decay ) × model_parameters \text{EMA}_{\text{new}} \text{decay} \times \text{EMA}_{\text{old}} (1 - \text{decay}) \times \text{model\_parameters} EMAnewdecay×EMAold(1−decay)×model_parameters 其中decay是一个介于0和1之间的超参数控制着旧EMA值和新模型参数值之间的权重分配。较大的decay值意味着EMA更新时更多地依赖于旧值即平滑效果更强。应用优势稳定性EMA通过平滑参数更新过程减少了模型在训练过程中的波动使得模型更加稳定。泛化能力由于EMA参数是历史参数的平滑版本它往往能捕捉到模型训练过程中的全局趋势因此在测试或评估时使用EMA参数往往能获得更好的泛化性能。快速收敛虽然EMA本身不直接加速训练过程但通过稳定模型参数它可能间接地帮助模型更快地收敛到更优的解。使用场景 EMA在深度学习中的使用场景广泛特别是在需要高度稳定性和良好泛化能力的任务中如图像分类、目标检测等。在训练大型模型时EMA尤其有用因为它可以帮助减少过拟合的风险并提高模型在未见数据上的表现。具体实现如下 import logging from collections import OrderedDict from copy import deepcopy import torch import torch.nn as nn_logger logging.getLogger(__name__)class ModelEma:def __init__(self, model, decay0.9999, device, resume):# make a copy of the model for accumulating moving average of weightsself.ema deepcopy(model)self.ema.eval()self.decay decayself.device device # perform ema on different device from model if setif device:self.ema.to(devicedevice)self.ema_has_module hasattr(self.ema, module)if resume:self._load_checkpoint(resume)for p in self.ema.parameters():p.requires_grad_(False)def _load_checkpoint(self, checkpoint_path):checkpoint torch.load(checkpoint_path, map_locationcpu)assert isinstance(checkpoint, dict)if state_dict_ema in checkpoint:new_state_dict OrderedDict()for k, v in checkpoint[state_dict_ema].items():# ema model may have been wrapped by DataParallel, and need module prefixif self.ema_has_module:name module. k if not k.startswith(module) else kelse:name knew_state_dict[name] vself.ema.load_state_dict(new_state_dict)_logger.info(Loaded state_dict_ema)else:_logger.warning(Failed to find state_dict_ema, starting from loaded model weights)def update(self, model):# correct a mismatch in state dict keysneeds_module hasattr(model, module) and not self.ema_has_modulewith torch.no_grad():msd model.state_dict()for k, ema_v in self.ema.state_dict().items():if needs_module:k module. kmodel_v msd[k].detach()if self.device:model_v model_v.to(deviceself.device)ema_v.copy_(ema_v * self.decay (1. - self.decay) * model_v) 加入到模型中。 #初始化 if use_ema:model_ema ModelEma(model_ft,decaymodel_ema_decay,devicecpu,resumeresume)# 训练过程中更新完参数后同步update shadow weights def train():optimizer.step()if model_ema is not None:model_ema.update(model)# 将model_ema传入验证函数中 val(model_ema.ema, DEVICE, test_loader)针对没有预训练的模型容易出现EMA不上分的情况这点大家要注意啊项目结构 DeBiFormer_Demo ├─data1 │ ├─Black-grass │ ├─Charlock │ ├─Cleavers │ ├─Common Chickweed │ ├─Common wheat │ ├─Fat Hen │ ├─Loose Silky-bent │ ├─Maize │ ├─Scentless Mayweed │ ├─Shepherds Purse │ ├─Small-flowered Cranesbill │ └─Sugar beet ├─models │ └─debiformer.py ├─mean_std.py ├─makedata.py ├─train.py └─test.pymean_std.py计算mean和std的值。 makedata.py生成数据集。 train.py训练models文件下DeBiFormer的模型 models来源官方代码。计算mean和std 在深度学习中特别是在处理图像数据时计算数据的均值mean和标准差standard deviation, std并进行归一化Normalization是加速模型收敛、提高模型性能的关键步骤之一。这里我将详细解释这两个概念并讨论它们如何帮助模型学习。均值Mean 均值是所有数值加和后除以数值的个数得到的平均值。在图像处理中我们通常对每个颜色通道如RGB图像的三个通道分别计算均值。这意味着如果我们的数据集包含多张图像我们会计算所有图像在R通道上的像素值的均值同样地我们也会计算G通道和B通道的均值。标准差Standard Deviation, Std 标准差是衡量数据分布离散程度的统计量。它反映了数据点与均值的偏离程度。在计算图像数据的标准差时我们也是针对每个颜色通道分别进行的。标准差较大的颜色通道意味着该通道上的像素值变化较大而标准差较小的通道则相对较为稳定。归一化Normalization 归一化是将数据按比例缩放使之落入一个小的特定区间通常是[0, 1]或[-1, 1]。在图像处理中我们通常会使用计算得到的均值和标准差来进行归一化公式如下 Normalized Value Original Value − Mean Std \text{Normalized Value} \frac{\text{Original Value} - \text{Mean}}{\text{Std}} Normalized ValueStdOriginal Value−Mean 注意在某些情况下为了简化计算并确保数据非负我们可能会选择将数据缩放到[0, 1]区间这时使用的是最大最小值归一化而不是基于均值和标准差的归一化。但在这里我们主要讨论基于均值和标准差的归一化因为它能保留数据的分布特性。为什么需要归一化加速收敛归一化后的数据具有相似的尺度这有助于梯度下降算法更快地找到最优解因为不同特征的梯度更新将在同一数量级上从而避免了某些特征因尺度过大或过小而导致的训练缓慢或梯度消失/爆炸问题。提高精度归一化可以改善模型的泛化能力因为它使得模型更容易学习到特征之间的相对关系而不是被特征的绝对大小所影响。稳定性归一化后的数据更加稳定减少了训练过程中的波动有助于模型更加稳定地收敛。如何计算和使用mean和std 计算全局mean和std在整个数据集上计算mean和std。这通常是在训练开始前进行的并使用这些值来归一化训练集、验证集和测试集。使用库函数许多深度学习框架如PyTorch、TensorFlow等提供了计算mean和std的便捷函数并可以直接用于数据集的归一化。动态调整在某些情况下特别是当数据集非常大或持续更新时可能需要动态地计算mean和std。这通常涉及到在训练过程中使用移动平均如EMA来更新这些统计量。计算并使用数据的mean和std进行归一化是深度学习中的一项基本且重要的预处理步骤它对于加速模型收敛、提高模型性能和稳定性具有重要意义。新建mean_std.py,插入代码 from torchvision.datasets import ImageFolder import torch from torchvision import transformsdef get_mean_and_std(train_data):train_loader torch.utils.data.DataLoader(train_data, batch_size1, shuffleFalse, num_workers0,pin_memoryTrue)mean torch.zeros(3)std torch.zeros(3)for X, _ in train_loader:for d in range(3):mean[d] X[:, d, :, :].mean()std[d] X[:, d, :, :].std()mean.div_(len(train_data))std.div_(len(train_data))return list(mean.numpy()), list(std.numpy())if __name__ __main__:train_dataset ImageFolder(rootrdata1, transformtransforms.ToTensor())print(get_mean_and_std(train_dataset))数据集结构运行结果 ([0.3281186, 0.28937867, 0.20702125], [0.09407319, 0.09732835, 0.106712654])把这个结果记录下来后面要用生成数据集我们整理还的图像分类的数据集结构是这样的 data ├─Black-grass ├─Charlock ├─Cleavers ├─Common Chickweed ├─Common wheat ├─Fat Hen ├─Loose Silky-bent ├─Maize ├─Scentless Mayweed ├─Shepherds Purse ├─Small-flowered Cranesbill └─Sugar beetpytorch和keras默认加载方式是ImageNet数据集格式格式是 ├─data │ ├─val │ │ ├─Black-grass │ │ ├─Charlock │ │ ├─Cleavers │ │ ├─Common Chickweed │ │ ├─Common wheat │ │ ├─Fat Hen │ │ ├─Loose Silky-bent │ │ ├─Maize │ │ ├─Scentless Mayweed │ │ ├─Shepherds Purse │ │ ├─Small-flowered Cranesbill │ │ └─Sugar beet │ └─train │ ├─Black-grass │ ├─Charlock │ ├─Cleavers │ ├─Common Chickweed │ ├─Common wheat │ ├─Fat Hen │ ├─Loose Silky-bent │ ├─Maize │ ├─Scentless Mayweed │ ├─Shepherds Purse │ ├─Small-flowered Cranesbill │ └─Sugar beet新增格式转化脚本makedata.py,插入代码 import glob import os import shutilimage_listglob.glob(data1/*/*.png) print(image_list) file_dirdata if os.path.exists(file_dir):print(true)#os.rmdir(file_dir)shutil.rmtree(file_dir)#删除再建立os.makedirs(file_dir) else:os.makedirs(file_dir)from sklearn.model_selection import train_test_split trainval_files, val_files train_test_split(image_list, test_size0.3, random_state42) train_dirtrain val_dirval train_rootos.path.join(file_dir,train_dir) val_rootos.path.join(file_dir,val_dir) for file in trainval_files:file_classfile.replace(\\,/).split(/)[-2]file_namefile.replace(\\,/).split(/)[-1]file_classos.path.join(train_root,file_class)if not os.path.isdir(file_class):os.makedirs(file_class)shutil.copy(file, file_class / file_name)for file in val_files:file_classfile.replace(\\,/).split(/)[-2]file_namefile.replace(\\,/).split(/)[-1]file_classos.path.join(val_root,file_class)if not os.path.isdir(file_class):os.makedirs(file_class)shutil.copy(file, file_class / file_name)完成上面的内容就可以开启训练和测试了。 DeBiFormer代码 import numpy as np from collections import defaultdict import matplotlib.pyplot as plt from timm.models.registry import register_model import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torchvision import datasets, transforms import torchvisionfrom torch import Tensor from typing import Tuple import numbers from timm.models.layers import to_2tuple, trunc_normal_ from einops import rearrange import gc import torch import torch.nn as nn from einops import rearrange from timm.models.layers import DropPath, to_2tuple, trunc_normal_ from collections import OrderedDict import torch import torch.nn as nn import torch.nn.functional as F from einops import rearrange from einops.layers.torch import Rearrange from fairscale.nn.checkpoint import checkpoint_wrapper from timm.models import register_model from timm.models.layers import DropPath, to_2tuple, trunc_normal_ from timm.models.vision_transformer import _cfgclass LayerNorm2d(nn.Module):def __init__(self, channels):super().__init__()self.ln nn.LayerNorm(channels)def forward(self, x):x rearrange(x, N C H W - N H W C)x self.ln(x)x rearrange(x, N H W C - N C H W)return xdef init_linear(m):if isinstance(m, (nn.Conv2d, nn.Linear)):nn.init.kaiming_normal_(m.weight)if m.bias is not None: nn.init.zeros_(m.bias)elif isinstance(m, nn.LayerNorm):nn.init.constant_(m.bias, 0)nn.init.constant_(m.weight, 1.0)def to_4d(x,h,w):return rearrange(x, b (h w) c - b c h w,hh,ww)#def to_4d(x,s,h,w): # return rearrange(x, b (s h w) c - b c s h w,ss,hh,ww)def to_3d(x):return rearrange(x, b c h w - b (h w) c)#def to_3d(x): # return rearrange(x, b c s h w - b (s h w) c)class Partial:def __init__(self, module, *args, **kwargs):self.module moduleself.args argsself.kwargs kwargsdef __call__(self, *args_c, **kwargs_c):return self.module(*args_c, *self.args, **kwargs_c, **self.kwargs)class LayerNormChannels(nn.Module):def __init__(self, channels):super().__init__()self.norm nn.LayerNorm(channels)def forward(self, x):x x.transpose(1, -1)x self.norm(x)x x.transpose(-1, 1)return xclass LayerNormProxy(nn.Module):def __init__(self, dim):super().__init__()self.norm nn.LayerNorm(dim)def forward(self, x):x rearrange(x, b c h w - b h w c)x self.norm(x)return rearrange(x, b h w c - b c h w)class BiasFree_LayerNorm(nn.Module):def __init__(self, normalized_shape):super(BiasFree_LayerNorm, self).__init__()if isinstance(normalized_shape, numbers.Integral):normalized_shape (normalized_shape,)normalized_shape torch.Size(normalized_shape)assert len(normalized_shape) 1self.weight nn.Parameter(torch.ones(normalized_shape))self.normalized_shape normalized_shapedef forward(self, x):sigma x.var(-1, keepdimTrue, unbiasedFalse)return x / torch.sqrt(sigma1e-5) * self.weightclass WithBias_LayerNorm(nn.Module):def __init__(self, normalized_shape):super(WithBias_LayerNorm, self).__init__()if isinstance(normalized_shape, numbers.Integral):normalized_shape (normalized_shape,)normalized_shape torch.Size(normalized_shape)assert len(normalized_shape) 1self.weight nn.Parameter(torch.ones(normalized_shape))self.bias nn.Parameter(torch.zeros(normalized_shape))self.normalized_shape normalized_shapedef forward(self, x):mu x.mean(-1, keepdimTrue)sigma x.var(-1, keepdimTrue, unbiasedFalse)return (x - mu) / torch.sqrt(sigma1e-5) * self.weight self.biasclass LayerNorm(nn.Module):def __init__(self, dim, LayerNorm_type):super(LayerNorm, self).__init__()if LayerNorm_type BiasFree:self.body BiasFree_LayerNorm(dim)else:self.body WithBias_LayerNorm(dim)def forward(self, x):h, w x.shape[-2:]return to_4d(self.body(to_3d(x)), h, w)#class LayerNorm(nn.Module): # def __init__(self, dim, LayerNorm_type): # super(LayerNorm, self).__init__() # if LayerNorm_type BiasFree: # self.body BiasFree_LayerNorm(dim) # else: # self.body WithBias_LayerNorm(dim) # def forward(self, x): # s, h, w x.shape[-3:] # return to_4d(self.body(to_3d(x)),s, h, w)class DWConv(nn.Module):def __init__(self, dim768):super(DWConv, self).__init__()self.dwconv nn.Conv2d(dim, dim, 3, 1, 1, biasTrue, groupsdim)def forward(self, x):x: NHWC tensorx x.permute(0, 3, 1, 2) #NCHWx self.dwconv(x)x x.permute(0, 2, 3, 1) #NHWCreturn xclass ConvFFN(nn.Module):def __init__(self, dim768):super(DWConv, self).__init__()self.dwconv nn.Conv2d(dim, dim, 1, 1, 0)def forward(self, x):x: NHWC tensorx x.permute(0, 3, 1, 2) #NCHWx self.dwconv(x)x x.permute(0, 2, 3, 1) #NHWCreturn xclass Attention(nn.Module):vanilla attentiondef __init__(self, dim, num_heads8, qkv_biasFalse, qk_scaleNone, attn_drop0., proj_drop0.):super().__init__()self.num_heads num_headshead_dim dim // num_heads# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weightsself.scale qk_scale or head_dim ** -0.5self.qkv nn.Linear(dim, dim * 3, biasqkv_bias)self.attn_drop nn.Dropout(attn_drop)self.proj nn.Linear(dim, dim)self.proj_drop nn.Dropout(proj_drop)def forward(self, x):args:x: NHWC tensorreturn:NHWC tensor_, H, W, _ x.size()x rearrange(x, n h w c - n (h w) c)#######################################B, N, C x.shapeqkv self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)attn (q k.transpose(-2, -1)) * self.scaleattn attn.softmax(dim-1)attn self.attn_drop(attn)x (attn v).transpose(1, 2).reshape(B, N, C)x self.proj(x)x self.proj_drop(x)#######################################x rearrange(x, n (h w) c - n h w c, hH, wW)return xclass AttentionLePE(nn.Module):vanilla attentiondef __init__(self, dim, num_heads8, qkv_biasFalse, qk_scaleNone, attn_drop0., proj_drop0., side_dwconv5):super().__init__()self.num_heads num_headshead_dim dim // num_heads# NOTE scale factor was wrong in my original version, can set manually to be compat with prev weightsself.scale qk_scale or head_dim ** -0.5self.qkv nn.Linear(dim, dim * 3, biasqkv_bias)self.attn_drop nn.Dropout(attn_drop)self.proj nn.Linear(dim, dim)self.proj_drop nn.Dropout(proj_drop)self.lepe nn.Conv2d(dim, dim, kernel_sizeside_dwconv, stride1, paddingside_dwconv//2, groupsdim) if side_dwconv 0 else \lambda x: torch.zeros_like(x)def forward(self, x):args:x: NHWC tensorreturn:NHWC tensor_, H, W, _ x.size()x rearrange(x, n h w c - n (h w) c)#######################################B, N, C x.shapeqkv self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)q, k, v qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)lepe self.lepe(rearrange(x, n (h w) c - n c h w, hH, wW))lepe rearrange(lepe, n c h w - n (h w) c)attn (q k.transpose(-2, -1)) * self.scaleattn attn.softmax(dim-1)attn self.attn_drop(attn)x (attn v).transpose(1, 2).reshape(B, N, C)x x lepex self.proj(x)x self.proj_drop(x)#######################################x rearrange(x, n (h w) c - n h w c, hH, wW)return xclass nchwAttentionLePE(nn.Module):Attention with LePE, takes nchw inputdef __init__(self, dim, num_heads8, qkv_biasFalse, qk_scaleNone, attn_drop0., proj_drop0., side_dwconv5):super().__init__()self.num_heads num_headsself.head_dim dim // num_headsself.scale qk_scale or self.head_dim ** -0.5self.qkv nn.Conv2d(dim, dim*3, kernel_size1, biasqkv_bias)self.attn_drop nn.Dropout(attn_drop)self.proj nn.Conv2d(dim, dim, kernel_size1)self.proj_drop nn.Dropout(proj_drop)self.lepe nn.Conv2d(dim, dim, kernel_sizeside_dwconv, stride1, paddingside_dwconv//2, groupsdim) if side_dwconv 0 else \lambda x: torch.zeros_like(x)def forward(self, x:torch.Tensor):args:x: NCHW tensorreturn:NCHW tensorB, C, H, W x.size()q, k, v self.qkv.forward(x).chunk(3, dim1) # B, C, H, Wattn q.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2) \k.view(B, self.num_heads, self.head_dim, H*W)attn torch.softmax(attn*self.scale, dim-1)attn self.attn_drop(attn)# (B, nhead, HW, HW) (B, nhead, HW, head_dim) - (B, nhead, HW, head_dim)output:torch.Tensor attn v.view(B, self.num_heads, self.head_dim, H*W).transpose(-1, -2)output output.permute(0, 1, 3, 2).reshape(B, C, H, W)output output self.lepe(v)output self.proj_drop(self.proj(output))return outputclass TopkRouting(nn.Module):differentiable topk routing with scalingArgs:qk_dim: int, feature dimension of query and keytopk: int, the topkqk_scale: int or None, temperature (multiply) of softmax activationwith_param: bool, wether inorporate learnable params in routing unitdiff_routing: bool, wether make routing differentiablesoft_routing: bool, wether make output value multiplied by routing weightsdef __init__(self, qk_dim, topk4, qk_scaleNone, param_routingFalse, diff_routingFalse):super().__init__()self.topk topkself.qk_dim qk_dimself.scale qk_scale or qk_dim ** -0.5self.diff_routing diff_routing# TODO: norm layer before/after linear?self.emb nn.Linear(qk_dim, qk_dim) if param_routing else nn.Identity()# routing activationself.routing_act nn.Softmax(dim-1)def forward(self, query:Tensor, key:Tensor)-Tuple[Tensor]:Args:q, k: (n, p^2, c) tensorReturn:r_weight, topk_index: (n, p^2, topk) tensorif not self.diff_routing:query, key query.detach(), key.detach()query_hat, key_hat self.emb(query), self.emb(key) # per-window pooling - (n, p^2, c)attn_logit (query_hat*self.scale) key_hat.transpose(-2, -1) # (n, p^2, p^2)topk_attn_logit, topk_index torch.topk(attn_logit, kself.topk, dim-1) # (n, p^2, k), (n, p^2, k)r_weight self.routing_act(topk_attn_logit) # (n, p^2, k)return r_weight, topk_indexclass KVGather(nn.Module):def __init__(self, mul_weightnone):super().__init__()assert mul_weight in [none, soft, hard]self.mul_weight mul_weightdef forward(self, r_idx:Tensor, r_weight:Tensor, kv:Tensor):r_idx: (n, p^2, topk) tensorr_weight: (n, p^2, topk) tensorkv: (n, p^2, w^2, c_kqc_v)Return:(n, p^2, topk, w^2, c_kqc_v) tensor# select kv according to routing indexn, p2, w2, c_kv kv.size()topk r_idx.size(-1)# print(r_idx.size(), r_weight.size())# FIXME: gather consumes much memory (topk times redundancy), write cuda kernel?topk_kv torch.gather(kv.view(n, 1, p2, w2, c_kv).expand(-1, p2, -1, -1, -1), # (n, p^2, p^2, w^2, c_kv) without mem cpydim2,indexr_idx.view(n, p2, topk, 1, 1).expand(-1, -1, -1, w2, c_kv) # (n, p^2, k, w^2, c_kv))if self.mul_weight soft:topk_kv r_weight.view(n, p2, topk, 1, 1) * topk_kv # (n, p^2, k, w^2, c_kv)elif self.mul_weight hard:raise NotImplementedError(differentiable hard routing TBA)# else: #none# topk_kv topk_kv # do nothingreturn topk_kvclass QKVLinear(nn.Module):def __init__(self, dim, qk_dim, biasTrue):super().__init__()self.dim dimself.qk_dim qk_dimself.qkv nn.Linear(dim, qk_dim qk_dim dim, biasbias)def forward(self, x):q, kv self.qkv(x).split([self.qk_dim, self.qk_dimself.dim], dim-1)return q, kv# q, k, v self.qkv(x).split([self.qk_dim, self.qk_dim, self.dim], dim-1)# return q, k, vclass QKVConv(nn.Module):def __init__(self, dim, qk_dim, biasTrue):super().__init__()self.dim dimself.qk_dim qk_dimself.qkv nn.Conv2d(dim, qk_dim qk_dim dim, 1, 1, 0)def forward(self, x):q, kv self.qkv(x).split([self.qk_dim, self.qk_dimself.dim], dim1)return q, kvclass BiLevelRoutingAttention(nn.Module):n_win: number of windows in one side (so the actual number of windows is n_win*n_win)kv_per_win: for kv_downsample_modeada_xxxpool only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.topk: topk for window filteringparam_attention: qkvo-linear for q,k,v and o, none: param free attentionparam_routing: extra linear for routingdiff_routing: wether to set routing differentiablesoft_routing: wether to multiply soft routing weights def __init__(self, dim, num_heads8, n_win7, qk_dimNone, qk_scaleNone,kv_per_win4, kv_downsample_ratio4, kv_downsample_kernelNone, kv_downsample_modeidentity,topk4, param_attentionqkvo, param_routingFalse, diff_routingFalse, soft_routingFalse, side_dwconv3,auto_padFalse):super().__init__()# local attention settingself.dim dimself.n_win n_win # Wh, Wwself.num_heads num_headsself.qk_dim qk_dim or dimassert self.qk_dim % num_heads 0 and self.dim % num_heads0, qk_dim and dim must be divisible by num_heads!self.scale qk_scale or self.qk_dim ** -0.5################side_dwconv (i.e. LCE in ShuntedTransformer)###########self.lepe nn.Conv2d(dim, dim, kernel_sizeside_dwconv, stride1, paddingside_dwconv//2, groupsdim) if side_dwconv 0 else \lambda x: torch.zeros_like(x)################ global routing setting #################self.topk topkself.param_routing param_routingself.diff_routing diff_routingself.soft_routing soft_routing# routerassert not (self.param_routing and not self.diff_routing) # cannot be with_paramTrue and diff_routingFalseself.router TopkRouting(qk_dimself.qk_dim,qk_scaleself.scale,topkself.topk,diff_routingself.diff_routing,param_routingself.param_routing)if self.soft_routing: # soft routing, always diffrentiable (if no detach)mul_weight softelif self.diff_routing: # hard differentiable routingmul_weight hardelse: # hard non-differentiable routingmul_weight noneself.kv_gather KVGather(mul_weightmul_weight)# qkv mapping (shared by both global routing and local attention)self.param_attention param_attentionif self.param_attention qkvo:self.qkv QKVLinear(self.dim, self.qk_dim)self.wo nn.Linear(dim, dim)elif self.param_attention qkv:self.qkv QKVLinear(self.dim, self.qk_dim)self.wo nn.Identity()else:raise ValueError(fparam_attention mode {self.param_attention} is not surpported!)self.kv_downsample_mode kv_downsample_modeself.kv_per_win kv_per_winself.kv_downsample_ratio kv_downsample_ratioself.kv_downsample_kenel kv_downsample_kernelif self.kv_downsample_mode ada_avgpool:assert self.kv_per_win is not Noneself.kv_down nn.AdaptiveAvgPool2d(self.kv_per_win)elif self.kv_downsample_mode ada_maxpool:assert self.kv_per_win is not Noneself.kv_down nn.AdaptiveMaxPool2d(self.kv_per_win)elif self.kv_downsample_mode maxpool:assert self.kv_downsample_ratio is not Noneself.kv_down nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio 1 else nn.Identity()elif self.kv_downsample_mode avgpool:assert self.kv_downsample_ratio is not Noneself.kv_down nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio 1 else nn.Identity()elif self.kv_downsample_mode identity: # no kv downsamplingself.kv_down nn.Identity()elif self.kv_downsample_mode fracpool:# assert self.kv_downsample_ratio is not None# assert self.kv_downsample_kenel is not None# TODO: fracpool# 1. kernel size should be input size dependent# 2. there is a random factor, need to avoid independent sampling for k and v raise NotImplementedError(fracpool policy is not implemented yet!)elif kv_downsample_mode conv:# TODO: need to consider the case where k ! v so that need two downsample modulesraise NotImplementedError(conv policy is not implemented yet!)else:raise ValueError(fkv_down_sample_mode {self.kv_downsaple_mode} is not surpported!)# softmax for local attentionself.attn_act nn.Softmax(dim-1)self.auto_padauto_paddef forward(self, x, ret_attn_maskFalse):x: NHWC tensorReturn:NHWC tensor# NOTE: use padding for semantic segmentation###################################################if self.auto_pad:N, H_in, W_in, C x.size()pad_l pad_t 0pad_r (self.n_win - W_in % self.n_win) % self.n_winpad_b (self.n_win - H_in % self.n_win) % self.n_winx F.pad(x, (0, 0, # dim-1pad_l, pad_r, # dim-2pad_t, pad_b)) # dim-3_, H, W, _ x.size() # padded sizeelse:N, H, W, C x.size()#assert H%self.n_win 0 and W%self.n_win 0 ##################################################### patchify, (n, p^2, w, w, c), keep 2d window as we need 2d pooling to reduce kv sizex rearrange(x, n (j h) (i w) c - n (j i) h w c, jself.n_win, iself.n_win)#################qkv projection#################### q: (n, p^2, w, w, c_qk)# kv: (n, p^2, w, w, c_qkc_v)# NOTE: separte kv if there were memory leak issue caused by gatherq, kv self.qkv(x) # pixel-wise qkv# q_pix: (n, p^2, w^2, c_qk)# kv_pix: (n, p^2, h_kv*w_kv, c_qkc_v)q_pix rearrange(q, n p2 h w c - n p2 (h w) c)kv_pix self.kv_down(rearrange(kv, n p2 h w c - (n p2) c h w))kv_pix rearrange(kv_pix, (n j i) c h w - n (j i) (h w) c, jself.n_win, iself.n_win)q_win, k_win q.mean([2, 3]), kv[..., 0:self.qk_dim].mean([2, 3]) # window-wise qk, (n, p^2, c_qk), (n, p^2, c_qk)##################side_dwconv(lepe)################### NOTE: call contiguous to avoid gradient warning when using ddplepe self.lepe(rearrange(kv[..., self.qk_dim:], n (j i) h w c - n c (j h) (i w), jself.n_win, iself.n_win).contiguous())lepe rearrange(lepe, n c (j h) (i w) - n (j h) (i w) c, jself.n_win, iself.n_win)############ gather q dependent k/v #################r_weight, r_idx self.router(q_win, k_win) # both are (n, p^2, topk) tensorskv_pix_sel self.kv_gather(r_idxr_idx, r_weightr_weight, kvkv_pix) #(n, p^2, topk, h_kv*w_kv, c_qkc_v)k_pix_sel, v_pix_sel kv_pix_sel.split([self.qk_dim, self.dim], dim-1)# kv_pix_sel: (n, p^2, topk, h_kv*w_kv, c_qk)# v_pix_sel: (n, p^2, topk, h_kv*w_kv, c_v)######### do attention as normal ####################k_pix_sel rearrange(k_pix_sel, n p2 k w2 (m c) - (n p2) m c (k w2), mself.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_kq//m) transpose here?v_pix_sel rearrange(v_pix_sel, n p2 k w2 (m c) - (n p2) m (k w2) c, mself.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_v//m)q_pix rearrange(q_pix, n p2 w2 (m c) - (n p2) m w2 c, mself.num_heads) # to BMLC tensor (n*p^2, m, w^2, c_qk//m)# param-free multihead attentionattn_weight (q_pix * self.scale) k_pix_sel # (n*p^2, m, w^2, c) (n*p^2, m, c, topk*h_kv*w_kv) - (n*p^2, m, w^2, topk*h_kv*w_kv)attn_weight self.attn_act(attn_weight)out attn_weight v_pix_sel # (n*p^2, m, w^2, topk*h_kv*w_kv) (n*p^2, m, topk*h_kv*w_kv, c) - (n*p^2, m, w^2, c)out rearrange(out, (n j i) m (h w) c - n (j h) (i w) (m c), jself.n_win, iself.n_win,hH//self.n_win, wW//self.n_win)out out lepe# output linearout self.wo(out)# NOTE: use padding for semantic segmentation# crop padded regionif self.auto_pad and (pad_r 0 or pad_b 0):out out[:, :H_in, :W_in, :].contiguous()if ret_attn_mask:return out, r_weight, r_idx, attn_weightelse:return outclass TransformerMLPWithConv(nn.Module):def __init__(self, channels, expansion, drop):super().__init__()self.dim1 channelsself.dim2 channels * expansionself.linear1 nn.Sequential(nn.Conv2d(self.dim1, self.dim2, 1, 1, 0),# nn.GELU(),# nn.BatchNorm2d(self.dim2, eps1e-5))self.drop1 nn.Dropout(drop, inplaceTrue)self.act nn.GELU()# self.bn nn.BatchNorm2d(self.dim2, eps1e-5)self.linear2 nn.Sequential(nn.Conv2d(self.dim2, self.dim1, 1, 1, 0),# nn.BatchNorm2d(self.dim1, eps1e-5))self.drop2 nn.Dropout(drop, inplaceTrue)self.dwc nn.Conv2d(self.dim2, self.dim2, 3, 1, 1, groupsself.dim2)def forward(self, x):x self.linear1(x)x self.drop1(x)x x self.dwc(x)x self.act(x)# x self.bn(x)x self.linear2(x)x self.drop2(x)return xclass DeBiLevelRoutingAttention(nn.Module):n_win: number of windows in one side (so the actual number of windows is n_win*n_win)kv_per_win: for kv_downsample_modeada_xxxpool only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.topk: topk for window filteringparam_attention: qkvo-linear for q,k,v and o, none: param free attentionparam_routing: extra linear for routingdiff_routing: wether to set routing differentiablesoft_routing: wether to multiply soft routing weightsdef __init__(self, dim, num_heads8, n_win7, qk_dimNone, qk_scaleNone,kv_per_win4, kv_downsample_ratio4, kv_downsample_kernelNone, kv_downsample_modeidentity,topk4, param_attentionqkvo, param_routingFalse, diff_routingFalse, soft_routingFalse, side_dwconv3,auto_padFalse, param_sizesmall):super().__init__()# local attention settingself.dim dimself.n_win n_win # Wh, Wwself.num_heads num_headsself.qk_dim qk_dim or dim#############################################################if param_sizetiny:if self.dim 64 :self.n_groups 1self.top_k_def 16 # 2 128self.kk 9self.stride_def 8self.expain_ratio 3self.q_sizeto_2tuple(56)if self.dim 128 :self.n_groups 2self.top_k_def 16 # 4 256self.kk 7self.stride_def 4self.expain_ratio 3self.q_sizeto_2tuple(28)if self.dim 256 :self.n_groups 4self.top_k_def 4 # 8 512self.kk 5self.stride_def 2self.expain_ratio 3self.q_sizeto_2tuple(14)if self.dim 512 :self.n_groups 8self.top_k_def 49 # 8 512self.kk 3self.stride_def 1self.expain_ratio 3self.q_sizeto_2tuple(7) #############################################################if param_sizesmall:if self.dim 64 :self.n_groups 1self.top_k_def 16 # 2 128self.kk 9self.stride_def 8self.expain_ratio 3self.q_sizeto_2tuple(56)if self.dim 128 :self.n_groups 2self.top_k_def 16 # 4 256self.kk 7self.stride_def 4self.expain_ratio 3self.q_sizeto_2tuple(28)if self.dim 256 :self.n_groups 4self.top_k_def 4 # 8 512self.kk 5self.stride_def 2self.expain_ratio 3self.q_sizeto_2tuple(14)if self.dim 512 :self.n_groups 8self.top_k_def 49 # 8 512self.kk 3self.stride_def 1self.expain_ratio 1self.q_sizeto_2tuple(7) #############################################################if param_sizebase:if self.dim 96 :self.n_groups 1self.top_k_def 16 # 2 128self.kk 9self.stride_def 8self.expain_ratio 3self.q_sizeto_2tuple(56)if self.dim 192 :self.n_groups 2self.top_k_def 16 # 4 256self.kk 7self.stride_def 4self.expain_ratio 3self.q_sizeto_2tuple(28)if self.dim 384 :self.n_groups 3self.top_k_def 4 # 8 512self.kk 5self.stride_def 2self.expain_ratio 3self.q_sizeto_2tuple(14)if self.dim 768 :self.n_groups 6self.top_k_def 49 # 8 512self.kk 3self.stride_def 1self.expain_ratio 3self.q_sizeto_2tuple(7)self.q_h, self.q_w self.q_sizeself.kv_h, self.kv_w self.q_h // self.stride_def, self.q_w // self.stride_defself.n_group_channels self.dim // self.n_groupsself.n_group_heads self.num_heads // self.n_groupsself.n_group_channels self.dim // self.n_groupsself.offset_range_factor -1self.head_channels dim // num_headsself.n_group_heads self.num_heads // self.n_groups#assert self.qk_dim % num_heads 0 and self.dim % num_heads0, qk_dim and dim must be divisible by num_heads!self.scale qk_scale or self.qk_dim ** -0.5self.rpe_table nn.Parameter(torch.zeros(self.num_heads, self.q_h * 2 - 1, self.q_w * 2 - 1))trunc_normal_(self.rpe_table, std0.01)################side_dwconv (i.e. LCE in ShuntedTransformer)###########self.lepe1 nn.Conv2d(dim, dim, kernel_sizeside_dwconv, strideself.stride_def, paddingside_dwconv//2, groupsdim) if side_dwconv 0 else \lambda x: torch.zeros_like(x)################ global routing setting #################self.topk topkself.param_routing param_routingself.diff_routing diff_routingself.soft_routing soft_routing# router#assert not (self.param_routing and not self.diff_routing) # cannot be with_paramTrue and diff_routingFalseself.router TopkRouting(qk_dimself.qk_dim,qk_scaleself.scale,topkself.topk,diff_routingself.diff_routing,param_routingself.param_routing)if self.soft_routing: # soft routing, always diffrentiable (if no detach)mul_weight softelif self.diff_routing: # hard differentiable routingmul_weight hardelse: # hard non-differentiable routingmul_weight noneself.kv_gather KVGather(mul_weightmul_weight)# qkv mapping (shared by both global routing and local attention)self.param_attention param_attentionif self.param_attention qkvo:#self.qkv QKVLinear(self.dim, self.qk_dim)self.qkv_conv QKVConv(self.dim, self.qk_dim)#self.wo nn.Linear(dim, dim)elif self.param_attention qkv:#self.qkv QKVLinear(self.dim, self.qk_dim)self.qkv_conv QKVConv(self.dim, self.qk_dim)#self.wo nn.Identity()else:raise ValueError(fparam_attention mode {self.param_attention} is not surpported!)self.kv_downsample_mode kv_downsample_modeself.kv_per_win kv_per_winself.kv_downsample_ratio kv_downsample_ratioself.kv_downsample_kenel kv_downsample_kernelif self.kv_downsample_mode ada_avgpool:assert self.kv_per_win is not Noneself.kv_down nn.AdaptiveAvgPool2d(self.kv_per_win)elif self.kv_downsample_mode ada_maxpool:assert self.kv_per_win is not Noneself.kv_down nn.AdaptiveMaxPool2d(self.kv_per_win)elif self.kv_downsample_mode maxpool:assert self.kv_downsample_ratio is not Noneself.kv_down nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio 1 else nn.Identity()elif self.kv_downsample_mode avgpool:assert self.kv_downsample_ratio is not Noneself.kv_down nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio 1 else nn.Identity()elif self.kv_downsample_mode identity: # no kv downsamplingself.kv_down nn.Identity()elif self.kv_downsample_mode fracpool:raise NotImplementedError(fracpool policy is not implemented yet!)elif kv_downsample_mode conv:raise NotImplementedError(conv policy is not implemented yet!)else:raise ValueError(fkv_down_sample_mode {self.kv_downsaple_mode} is not surpported!)self.attn_act nn.Softmax(dim-1)self.auto_padauto_pad##########################################################################################self.proj_q nn.Conv2d(dim, dim,kernel_size1, stride1, padding0)self.proj_k nn.Conv2d(dim, dim,kernel_size1, stride1, padding0)self.proj_v nn.Conv2d(dim, dim,kernel_size1, stride1, padding0)self.proj_out nn.Conv2d(dim, dim,kernel_size1, stride1, padding0)self.unifyheads1 nn.Conv2d(dim, dim,kernel_size1, stride1, padding0)self.conv_offset_q nn.Sequential(nn.Conv2d(self.n_group_channels, self.n_group_channels, (self.kk,self.kk), (self.stride_def,self.stride_def), (self.kk//2,self.kk//2), groupsself.n_group_channels, biasFalse),LayerNormProxy(self.n_group_channels),nn.GELU(),nn.Conv2d(self.n_group_channels, 1, 1, 1, 0, biasFalse),)### FFNself.norm nn.LayerNorm(dim, eps1e-6)self.norm2 nn.LayerNorm(dim, eps1e-6)self.mlp TransformerMLPWithConv(dim, self.expain_ratio, 0.)torch.no_grad()def _get_ref_points(self, H_key, W_key, B, dtype, device):ref_y, ref_x torch.meshgrid(torch.linspace(0.5, H_key - 0.5, H_key, dtypedtype, devicedevice),torch.linspace(0.5, W_key - 0.5, W_key, dtypedtype, devicedevice))ref torch.stack((ref_y, ref_x), -1)ref[..., 1].div_(W_key).mul_(2).sub_(1)ref[..., 0].div_(H_key).mul_(2).sub_(1)ref ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2return reftorch.no_grad()def _get_q_grid(self, H, W, B, dtype, device):ref_y, ref_x torch.meshgrid(torch.arange(0, H, dtypedtype, devicedevice),torch.arange(0, W, dtypedtype, devicedevice),indexingij)ref torch.stack((ref_y, ref_x), -1)ref[..., 1].div_(W - 1.0).mul_(2.0).sub_(1.0)ref[..., 0].div_(H - 1.0).mul_(2.0).sub_(1.0)ref ref[None, ...].expand(B * self.n_groups, -1, -1, -1) # B * g H W 2return refdef forward(self, x, ret_attn_maskFalse):dtype, device x.dtype, x.devicex: NHWC tensorReturn:NHWC tensor # NOTE: use padding for semantic segmentation ###################################################if self.auto_pad:N, H_in, W_in, C x.size()pad_l pad_t 0pad_r (self.n_win - W_in % self.n_win) % self.n_winpad_b (self.n_win - H_in % self.n_win) % self.n_winx F.pad(x, (0, 0, # dim-1pad_l, pad_r, # dim-2pad_t, pad_b)) # dim-3_, H, W, _ x.size() # padded sizeelse:N, H, W, C x.size()assert H%self.n_win 0 and W%self.n_win 0 ##print(X_in)#print(x.shape)####################################################qself.proj_q_def(x)x_res rearrange(x, n h w c - n c h w) #################qkv projection###################q,kv self.qkv_conv(x.permute(0, 3, 1, 2))q_bi rearrange(q, n c (j h) (i w) - n (j i) h w c, jself.n_win, iself.n_win)kv rearrange(kv, n c (j h) (i w) - n (j i) h w c, jself.n_win, iself.n_win)q_pix rearrange(q_bi, n p2 h w c - n p2 (h w) c)kv_pix self.kv_down(rearrange(kv, n p2 h w c - (n p2) c h w))kv_pix rearrange(kv_pix, (n j i) c h w - n (j i) (h w) c, jself.n_win, iself.n_win)##################side_dwconv(lepe)################### NOTE: call contiguous to avoid gradient warning when using ddplepe1 self.lepe1(rearrange(kv[..., self.qk_dim:], n (j i) h w c - n c (j h) (i w), jself.n_win, iself.n_win).contiguous())################################################################# Offset Qq_off rearrange(q, b (g c) h w - (b g) c h w, gself.n_groups, cself.n_group_channels)offset_q self.conv_offset_q(q_off).contiguous() # B * g 2 Sg HWgHk, Wk offset_q.size(2), offset_q.size(3)n_sample Hk * Wkif self.offset_range_factor 0:offset_range torch.tensor([1.0 / Hk, 1.0 / Wk], devicedevice).reshape(1, 2, 1, 1)offset_q offset_q.tanh().mul(offset_range).mul(self.offset_range_factor)offset_q rearrange(offset_q, b p h w - b h w p) # B * g 2 Hg Wg - B*g Hg Wg 2reference self._get_ref_points(Hk, Wk, N, dtype, device)if self.offset_range_factor 0:pos_k offset_q referenceelse:pos_k (offset_q reference).clamp(-1., 1.)x_sampled_q F.grid_sample(inputx_res.reshape(N * self.n_groups, self.n_group_channels, H, W),gridpos_k[..., (1, 0)], # y, x - x, ymodebilinear, align_cornersTrue) # B * g, Cg, Hg, Wgq_sampled x_sampled_q.reshape(N, C, Hk, Wk)########　　Bi-LEVEL Gatheringif self.auto_pad:q_sampledq_sampled.permute(0, 2, 3, 1)Ng, Hg, Wg, Cg q_sampled.size()pad_l pad_t 0pad_rg (self.n_win - Wg % self.n_win) % self.n_winpad_bg (self.n_win - Hg % self.n_win) % self.n_winq_sampled F.pad(q_sampled, (0, 0, # dim-1pad_l, pad_rg, # dim-2pad_t, pad_bg)) # dim-3_, Hg, Wg, _ q_sampled.size() # padded sizeq_sampledq_sampled.permute(0, 3, 1, 2)lepe1 F.pad(lepe1.permute(0, 2, 3, 1), (0, 0, # dim-1pad_l, pad_rg, # dim-2pad_t, pad_bg)) # dim-3lepe1lepe1.permute(0, 3, 1, 2)pos_k F.pad(pos_k, (0, 0, # dim-1pad_l, pad_rg, # dim-2pad_t, pad_bg)) # dim-3queries_def self.proj_q(q_sampled) #Linnear projectionqueries_def rearrange(queries_def, n c (j h) (i w) - n (j i) h w c, jself.n_win, iself.n_win).contiguous()q_win, k_win queries_def.mean([2, 3]), kv[..., 0:(self.qk_dim)].mean([2, 3])r_weight, r_idx self.router(q_win, k_win)kv_gather self.kv_gather(r_idxr_idx, r_weightr_weight, kvkv_pix) # (n, p^2, topk, h_kv*w_kv, c )k_gather, v_gather kv_gather.split([self.qk_dim, self.dim], dim-1)### Bi-level Routing MHAk rearrange(k_gather, n p2 k hw (m c) - (n p2) m c (k hw), mself.num_heads)v rearrange(v_gather, n p2 k hw (m c) - (n p2) m (k hw) c, mself.num_heads)q_def rearrange(queries_def, n p2 h w (m c)- (n p2) m (h w) c,mself.num_heads)attn_weight (q_def * self.scale) kattn_weight self.attn_act(attn_weight)out attn_weight vout_def rearrange(out, (n j i) m (h w) c - n (m c) (j h) (i w), jself.n_win, iself.n_win, hHg//self.n_win, wWg//self.n_win).contiguous()out_def out_def lepe1out_def self.unifyheads1(out_def)out_def q_sampled out_defout_def out_def self.mlp(self.norm2(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)) # (N, C, H, W)#####################################################################################################　　　Deformable Gathering ############################################################################################# out_def self.norm(out_def.permute(0, 2, 3, 1)).permute(0, 3, 1, 2)k self.proj_k(out_def)v self.proj_v(out_def)k_pix_sel rearrange(k, n (m c) h w - (n m) c (h w), mself.num_heads)v_pix_sel rearrange(v, n (m c) h w - (n m) c (h w), mself.num_heads)q_pix rearrange(q, n (m c) h w - (n m) c (h w), mself.num_heads)attn torch.einsum(b c m, b c n - b m n, q_pix, k_pix_sel) # B * h, HW, Nsattn attn.mul(self.scale)### Biasrpe_table self.rpe_tablerpe_bias rpe_table[None, ...].expand(N, -1, -1, -1)q_grid self._get_q_grid(H, W, N, dtype, device)displacement (q_grid.reshape(N * self.n_groups, H * W, 2).unsqueeze(2) - pos_k.reshape(N * self.n_groups, Hg*Wg, 2).unsqueeze(1)).mul(0.5)attn_bias F.grid_sample(inputrearrange(rpe_bias, b (g c) h w - (b g) c h w, cself.n_group_heads, gself.n_groups),griddisplacement[..., (1, 0)],modebilinear, align_cornersTrue) # B * g, h_g, HW, Nsattn_bias attn_bias.reshape(N * self.num_heads, H * W, Hg*Wg)attn attn attn_bias### attn F.softmax(attn, dim2)out torch.einsum(b m n, b c n - b c m, attn, v_pix_sel)out out.reshape(N,C,H,W).contiguous()out self.proj_out(out).permute(0,2,3,1)############################################################################################## NOTE: use padding for semantic segmentation# crop padded regionif self.auto_pad and (pad_r 0 or pad_b 0):out out[:, :H_in, :W_in, :].contiguous()if ret_attn_mask:return out, r_weight, r_idx, attn_weightelse:return outdef get_pe_layer(emb_dim, pe_dimNone, namenone):if name none:return nn.Identity()else:raise ValueError(fPE name {name} is not surpported!)class Block(nn.Module):def __init__(self, dim, drop_path0., layer_scale_init_value-1,num_heads8, n_win7, qk_dimNone, qk_scaleNone,kv_per_win4, kv_downsample_ratio4, kv_downsample_kernelNone, kv_downsample_modeada_avgpool,topk4, param_attentionqkvo, param_routingFalse, diff_routingFalse, soft_routingFalse, mlp_ratio4, param_sizesmall,mlp_dwconvFalse,side_dwconv5, before_attn_dwconv3, pre_normTrue, auto_padFalse):super().__init__()qk_dim qk_dim or dim# modulesif before_attn_dwconv 0:self.pos_embed1 nn.Conv2d(dim, dim, kernel_sizebefore_attn_dwconv, padding1, groupsdim)self.pos_embed2 nn.Conv2d(dim, dim, kernel_sizebefore_attn_dwconv, padding1, groupsdim)else:self.pos_embed lambda x: 0self.norm1 nn.LayerNorm(dim, eps1e-6) # important to avoid attention collapsing#if topk 0:if topk 4:self.attn1 BiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topk1, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,auto_padauto_pad)self.attn2 DeBiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topktopk, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,auto_padauto_pad,param_sizeparam_size)elif topk 8:self.attn1 BiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topk4, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,auto_padauto_pad)self.attn2 DeBiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topktopk, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,uto_padauto_pad,param_sizeparam_size)elif topk 16:self.attn1 BiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topk16, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,auto_padauto_pad)self.attn2 DeBiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topktopk, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,uto_padauto_pad,param_sizeparam_size)elif topk -1:self.attn Attention(dimdim)elif topk -2:self.attn1 DeBiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topk49, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,uto_padauto_pad,param_sizeparam_size)self.attn2 DeBiLevelRoutingAttention(dimdim, num_headsnum_heads, n_winn_win, qk_dimqk_dim,qk_scaleqk_scale, kv_per_winkv_per_win, kv_downsample_ratiokv_downsample_ratio,kv_downsample_kernelkv_downsample_kernel, kv_downsample_modekv_downsample_mode,topk49, param_attentionparam_attention, param_routingparam_routing,diff_routingdiff_routing, soft_routingsoft_routing, side_dwconvside_dwconv,uto_padauto_pad,param_sizeparam_size)elif topk 0:self.attn nn.Sequential(Rearrange(n h w c - n c h w), # compatiabilitynn.Conv2d(dim, dim, 1), # pseudo qkv linearnn.Conv2d(dim, dim, 5, padding2, groupsdim), # pseudo attentionnn.Conv2d(dim, dim, 1), # pseudo out linearRearrange(n c h w - n h w c))self.norm2 nn.LayerNorm(dim, eps1e-6)self.mlp1 TransformerMLPWithConv(dim, mlp_ratio, 0.)self.drop_path1 DropPath(drop_path) if drop_path 0. else nn.Identity()self.drop_path2 DropPath(drop_path) if drop_path 0. else nn.Identity()self.norm3 nn.LayerNorm(dim, eps1e-6)self.norm4 nn.LayerNorm(dim, eps1e-6)self.mlp2 TransformerMLPWithConv(dim, mlp_ratio, 0.)# tricks: layer scale pre_norm/post_normif layer_scale_init_value 0:self.use_layer_scale Trueself.gamma1 nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_gradTrue)self.gamma2 nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_gradTrue)self.gamma3 nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_gradTrue)self.gamma4 nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_gradTrue)else:self.use_layer_scale Falseself.pre_norm pre_normdef forward(self, x):x: NCHW tensor# conv pos embeddingx x self.pos_embed1(x)# permute to NHWC tensor for attention mlpx x.permute(0, 2, 3, 1) # (N, C, H, W) - (N, H, W, C)# attention mlpif self.pre_norm:if self.use_layer_scale:x x self.drop_path1(self.gamma1 * self.attn1(self.norm1(x))) # (N, H, W, C)x x self.drop_path1(self.gamma2 * self.mlp1(self.norm2(x))) # (N, H, W, C)# conv pos embeddingx x self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x x self.drop_path2(self.gamma3 * self.attn2(self.norm3(x))) # (N, H, W, C)x x self.drop_path2(self.gamma4 * self.mlp2(self.norm4(x))) # (N, H, W, C)else:x x self.drop_path1(self.attn1(self.norm1(x))) # (N, H, W, C)x x self.drop_path1(self.mlp1(self.norm2(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# conv pos embeddingx x self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x x self.drop_path2(self.attn2(self.norm3(x))) # (N, H, W, C)x x self.drop_path2(self.mlp2(self.norm4(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)else: # https://kexue.fm/archives/9009if self.use_layer_scale:x self.norm1(x self.drop_path1(self.gamma1 * self.attn1(x))) # (N, H, W, C)x self.norm2(x self.drop_path1(self.gamma2 * self.mlp1(x))) # (N, H, W, C)# conv pos embeddingx x self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x self.norm3(x self.drop_path2(self.gamma3 * self.attn2(x))) # (N, H, W, C)x self.norm4(x self.drop_path2(self.gamma4 * self.mlp2(x))) # (N, H, W, C)else:x self.norm1(x self.drop_path1(self.attn1(x))) # (N, H, W, C)x x self.drop_path1(self.mlp1(self.norm2(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# conv pos embeddingx x self.pos_embed2(x.permute(0, 3, 1, 2)).permute(0, 2, 3, 1)x self.norm3(x self.drop_path2(self.attn2(x))) # (N, H, W, C)x x self.drop_path2(self.mlp2(self.norm4(x).permute(0, 3, 1, 2)).permute(0, 2, 3, 1)) # (N, H, W, C)# permute backx x.permute(0, 3, 1, 2) # (N, H, W, C) - (N, C, H, W)return xclass DeBiFormer(nn.Module):def __init__(self, depth[3, 4, 8, 3], in_chans3, num_classes1000, embed_dim[64, 128, 320, 512],head_dim64, qk_scaleNone, representation_sizeNone,drop_path_rate0., drop_rate0.,use_checkpoint_stages[],########n_win7,kv_downsample_modeada_avgpool,kv_per_wins[2, 2, -1, -1],topks[8, 8, -1, -1],side_dwconv5,layer_scale_init_value-1,qk_dims[None, None, None, None],param_routingFalse, diff_routingFalse, soft_routingFalse,pre_normTrue,peNone,pe_stages[0],before_attn_dwconv3,auto_padFalse,#-----------------------kv_downsample_kernels[4, 2, 1, 1],kv_downsample_ratios[4, 2, 1, 1], # - kv_per_win [2, 2, 2, 1]mlp_ratios[4, 4, 4, 4],param_attentionqkvo,param_sizesmall,mlp_dwconvFalse):Args:depth (list): depth of each stageimg_size (int, tuple): input image sizein_chans (int): number of input channelsnum_classes (int): number of classes for classification headembed_dim (list): embedding dimension of each stagehead_dim (int): head dimensionmlp_ratio (int): ratio of mlp hidden dim to embedding dimqkv_bias (bool): enable bias for qkv if Trueqk_scale (float): override default qk scale of head_dim ** -0.5 if setrepresentation_size (Optional[int]): enable and set representation layer (pre-logits) to this value if setdrop_rate (float): dropout rateattn_drop_rate (float): attention dropout ratedrop_path_rate (float): stochastic depth ratenorm_layer (nn.Module): normalization layerconv_stem (bool): whether use overlapped patch stemsuper().__init__()self.num_classes num_classesself.num_features self.embed_dim embed_dim # num_features for consistency with other models############ downsample layers (patch embeddings) ######################self.downsample_layers nn.ModuleList()# NOTE: uniformer uses two 3*3 conv, while in many other transformers this is one 7*7 convstem nn.Sequential(nn.Conv2d(in_chans, embed_dim[0] // 2, kernel_size(3, 3), stride(2, 2), padding(1, 1)),nn.BatchNorm2d(embed_dim[0] // 2),nn.GELU(),nn.Conv2d(embed_dim[0] // 2, embed_dim[0], kernel_size(3, 3), stride(2, 2), padding(1, 1)),nn.BatchNorm2d(embed_dim[0]),)if (pe is not None) and 0 in pe_stages:stem.append(get_pe_layer(emb_dimembed_dim[0], namepe))if use_checkpoint_stages:stem checkpoint_wrapper(stem)self.downsample_layers.append(stem)for i in range(3):downsample_layer nn.Sequential(nn.Conv2d(embed_dim[i], embed_dim[i1], kernel_size(3, 3), stride(2, 2), padding(1, 1)),nn.BatchNorm2d(embed_dim[i1]))if (pe is not None) and i1 in pe_stages:downsample_layer.append(get_pe_layer(emb_dimembed_dim[i1], namepe))if use_checkpoint_stages:downsample_layer checkpoint_wrapper(downsample_layer)self.downsample_layers.append(downsample_layer)##########################################################################self.stages nn.ModuleList() # 4 feature resolution stages, each consisting of multiple residual blocksnheads [dim // head_dim for dim in qk_dims]dp_rates[x.item() for x in torch.linspace(0, drop_path_rate, sum(depth))]cur 0for i in range(4):stage nn.Sequential(*[Block(dimembed_dim[i], drop_pathdp_rates[cur j],layer_scale_init_valuelayer_scale_init_value,topktopks[i],num_headsnheads[i],n_winn_win,qk_dimqk_dims[i],qk_scaleqk_scale,kv_per_winkv_per_wins[i],kv_downsample_ratiokv_downsample_ratios[i],kv_downsample_kernelkv_downsample_kernels[i],kv_downsample_modekv_downsample_mode,param_attentionparam_attention,param_sizeparam_size,param_routingparam_routing,diff_routingdiff_routing,soft_routingsoft_routing,mlp_ratiomlp_ratios[i],mlp_dwconvmlp_dwconv,side_dwconvside_dwconv,before_attn_dwconvbefore_attn_dwconv,pre_normpre_norm,auto_padauto_pad) for j in range(depth[i])],)if i in use_checkpoint_stages:stage checkpoint_wrapper(stage)self.stages.append(stage)cur depth[i]##########################################################################self.norm nn.BatchNorm2d(embed_dim[-1])# Representation layerif representation_size:self.num_features representation_sizeself.pre_logits nn.Sequential(OrderedDict([(fc, nn.Linear(embed_dim, representation_size)),(act, nn.Tanh())]))else:self.pre_logits nn.Identity()# Classifier headself.head nn.Linear(embed_dim[-1], num_classes) if num_classes 0 else nn.Identity()self.reset_parameters()def reset_parameters(self):for m in self.parameters():if isinstance(m, (nn.Linear, nn.Conv2d)):nn.init.kaiming_normal_(m.weight)nn.init.zeros_(m.bias)torch.jit.ignoredef no_weight_decay(self):return {pos_embed, cls_token}def get_classifier(self):return self.headdef reset_classifier(self, num_classes, global_pool):self.num_classes num_classesself.head nn.Linear(self.embed_dim, num_classes) if num_classes 0 else nn.Identity()def forward_features(self, x):for i in range(4):x self.downsample_layers[i](x) # res (56, 28, 14, 7), wins (64, 16, 4, 1)x self.stages[i](x)x self.norm(x)x self.pre_logits(x)return xdef forward(self, x):x self.forward_features(x)x x.flatten(2).mean(-1)x self.head(x)return x register_model def debi_tiny(pretrainedFalse, pretrained_cfgNone, **kwargs):model DeBiFormer(depth[1, 1, 4, 1],embed_dim[64, 128, 256, 512], mlp_ratios[3, 3, 3, 3],param_sizetiny,drop_path_rate0., #Drop rate#------------------------------n_win7,kv_downsample_modeidentity,kv_per_wins[-1, -1, -1, -1],topks[4, 8, 16, -2],side_dwconv5,before_attn_dwconv3,layer_scale_init_value-1,qk_dims[64, 128, 256, 512],head_dim32,param_routingFalse, diff_routingFalse, soft_routingFalse,pre_normTrue,peNone)return modelregister_model def debi_small(pretrainedFalse, pretrained_cfgNone, **kwargs):model DeBiFormer(depth[2, 2, 9, 3],embed_dim[64, 128, 256, 512], mlp_ratios[3, 3, 3, 2],param_sizesmall,drop_path_rate0.3, #Drop rate#------------------------------n_win7,kv_downsample_modeidentity,kv_per_wins[-1, -1, -1, -1],topks[4, 8, 16, -2],side_dwconv5,before_attn_dwconv3,layer_scale_init_value-1,qk_dims[64, 128, 256, 512],head_dim32,param_routingFalse, diff_routingFalse, soft_routingFalse,pre_normTrue,peNone)return modelregister_model def debi_base(pretrainedFalse, pretrained_cfgNone, **kwargs):model DeBiFormer(depth[2, 2, 9, 2],embed_dim[96, 192, 384, 768], mlp_ratios[3, 3, 3, 3],param_sizebase,drop_path_rate0.4, #Drop rate#------------------------------n_win7,kv_downsample_modeidentity,kv_per_wins[-1, -1, -1, -1],topks[4, 8, 16, -2],side_dwconv5,before_attn_dwconv3,layer_scale_init_value-1,qk_dims[96, 192, 384, 768],head_dim32,param_routingFalse, diff_routingFalse, soft_routingFalse,pre_normTrue,peNone)return modelif __name__ __main__:from mmcv.cnn.utils import flops_countermodel DeBiFormer(depth[2, 2, 9, 1],embed_dim[64, 128, 256, 512], mlp_ratios[3, 3, 3, 2],#------------------------------n_win7,kv_downsample_modeidentity,kv_per_wins[-1, -1, -1, -1],topks[4, 8, 16, -2],side_dwconv5,before_attn_dwconv3,layer_scale_init_value-1,qk_dims[64, 128, 256, 512],head_dim32,param_routingFalse, diff_routingFalse, soft_routingFalse,pre_normTrue,peNone)input_shape (3, 224, 224)flops_counter.get_model_complexity_info(model, input_shape)

查看全文

http://www.w-s-a.com/news/220156/