当前位置：首页 > news >正文

工程机械外贸网站建设刷评论网站推广

news 2025/12/28 3:09:31

工程机械外贸网站建设,刷评论网站推广,广东免费建站公司,辽阳网站建设公司本文为365天深度学习训练营中的学习记录博客原作者#xff1a;K同学啊任务要求#xff1a;加载第N1周的.txt文件#xff0c;使用Embeddingbag与Embedding完成词嵌入第N1周的.txt文件的名称为“任务文件.txt”#xff0c;内容为#xff1a; 比较直观的编码方式是采用上… 本文为365天深度学习训练营中的学习记录博客原作者K同学啊任务要求加载第N1周的.txt文件使用Embeddingbag与Embedding完成词嵌入第N1周的.txt文件的名称为“任务文件.txt”内容为比较直观的编码方式是采用上面提到的字典序列。例如对于一个有三个类别的问题可以用1、2和3分别表示这三个类别。但是这种编码方式存在一个问题就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系而实际上这些关系可能是不存在的或者不具有实际意义的。为了避免这种问题引入了one-hot编码也称独热编码。one-hot编码的基本思想是将每个类别映射到一个向量其中只有一个元素的值为1其余元素的值为0。这样每个类别之间就是相互独立的不存在顺序或距离关系。例如对于三个类别的情况可以使用如下的one-hot编码词嵌入是一种用于自然语言处理 (NLP) 的技术用于将单词表示为数字以便计算机可以处理它们。通俗的讲就是一种把文本转为数值输入到计算机中的方法。在《第N1周one-hot编码案例》中提到的将文本转换为字典序列、one-hot编码就是最早期的词嵌入方法。 Embedding和EmbeddingBag则是PyTorch中的用来处理文本数据中词嵌入word embedding的工具它们将离散的词汇映射到低维的连续向量空间中使得词汇之间的语义关系能够在向量空间中得到体现。 1.Embedding嵌入 Embedding是PyTorch中最基本的词嵌入操作TensorFlow中也有相同的函数功能是一样。它将每个离散的词汇映射到一个低维的连续向量空间中并且保持了词汇之间的语义关系。在PyTorch中Embedding的输入是一个整数张量每个整数都代表着一个词汇的索引输出是一个浮点型的张量每个浮点数都代表着对应词汇的词嵌入向量。 ●输入shape[batch, seqSize]seqSize为单个文本长度 ●输出shape[batch, seqSize, embed_dim]embed_dim嵌入维度嵌入层使用随机权重初始化并将学习数据集中所有词的嵌入。它是一个灵活的层可以以各种方式使用如 ●它可以用作深度学习模型的一部分其中嵌入与模型本身一起被学习。 ●它可以用于加载训练好的词嵌入模型。嵌入层被定义为网络的第一个隐藏层。函数原型 torch.nn.Embedding(num_embeddings, embedding_dim, padding_idxNone, max_normNone,norm_type2.0,scale_grad_by_freqFalse, sparseFalse,_weightNone,_freezeFalse, deviceNone, dtypeNone)官方API地址Embedding — PyTorch 2.0 documentation 常见参数 1.num_embeddings词汇表大小即最大整数 index 1。 2.embedding_dim词向量的维度。下面是一个简单的例子用Embedding将两个句子转换为词嵌入向量 1.1.自定义数据集类 import torch from torch import nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader, Datasetclass MyDataset(Dataset):def __init__(self, texts, labels):self.texts textsself.labels labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):texts self.texts[idx]labels self.labels[idx]return texts, labels1.2.定义填充函数 def collate_batch(batch):texts, labels zip(*batch)max_len max(len(text) for text in texts)padded_texts [F.pad(text, (0, max_len - len(text)), value0) for text in texts]padded_texts torch.stack(padded_texts)labels torch.tensor(labels, dtypetorch.float).unsqueeze(1)return padded_texts, labels1.3.准备数据和数据加载器 # 假设我们有以下三个样本分别由不同数量的单词索引组成 text_data [torch.tensor([1, 1, 1, 1], dtypetorch.long), # 样本1torch.tensor([2, 2, 2], dtypetorch.long), # 样本2torch.tensor([3, 3], dtypetorch.long) # 样本3 ]# 对应的标签 labels torch.tensor([4, 5, 6], dtypetorch.float)# 创建数据集和数据加载器 my_dataset MyDataset(text_data, labels) data_loader DataLoader(my_dataset, batch_size2, shuffleTrue, collate_fncollate_batch)for batch in data_loader:print(batch)代码输出 (tensor([[1, 1, 1, 1],[2, 2, 2, 0]]), tensor([[4.],[5.]])) (tensor([[3, 3]]), tensor([[6.]]))1.4.定义模型 class EmbeddingModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingModel, self).__init__()self.embedding nn.Embedding(vocab_size, embed_dim)self.fc nn.Linear(embed_dim, 1) # 假设我们做一个二分类任务def forward(self, text): print(embedding输入文本是,text)print(embedding输入文本shape,text.shape)embedding self.embedding(text)embedding_mean embedding.mean(dim1) # 对每个样本的嵌入向量进行平均print(embedding输出文本shape,embedding_mean.shape)return self.fc(embedding_mean)特别注意如果使用embedding_mean embedding.mean(dim1)语句对每个样本的嵌入向量求平均输出shape为[batch, embed_dim]。若注释掉该语句输出shape则为[batch, seqSize, embed_dim]。 1.5.训练模型 # 示例词典大小和嵌入维度 vocab_size 10 embed_dim 6# 创建模型实例 model EmbeddingModel(vocab_size, embed_dim)# 定义一个简单的损失函数和优化器 criterion nn.BCEWithLogitsLoss() optimizer optim.SGD(model.parameters(), lr0.01)# 训练模型 for epoch in range(1): # 训练1个epochfor batch in data_loader:texts, labels batch# 前向传播outputs model(texts)loss criterion(outputs, labels)# 反向传播和优化optimizer.zero_grad()loss.backward()optimizer.step()print(fEpoch {epoch1}, Loss: {loss.item()})代码输出 embedding输入文本是 tensor([[3, 3, 0],[2, 2, 2]]) embedding输入文本shape torch.Size([2, 3]) embedding输出文本shape torch.Size([2, 6])embedding输入文本是 tensor([[1, 1, 1, 1]]) embedding输入文本shape torch.Size([1, 4]) embedding输出文本shape torch.Size([1, 6])Epoch 1, Loss: 1.44715225696563722.EmbeddingBag嵌入 EmbeddingBag是在Embedding基础上进一步优化的工具其核心思想是将每个输入序列的嵌入向量进行合并能够处理可变长度的输入序列并且减少了计算和存储的开销并且可以计算句子中所有词汇的词嵌入向量的均值或总和。在PyTorch中EmbeddingBag的输入是一个整数张量和一个偏移量张量每个整数都代表着一个词汇的索引偏移量则表示句子中每个词汇的位置输出是一个浮点型的张量每个浮点数都代表着对应句子的词嵌入向量的均值或总和。 ●输入shape[seqsSize]seqsSize为单个batch文本总长度 ●输出shape[batch, embed_dim]embed_dim嵌入维度假定原始输入数据为[[1, 1, 1, 1],[2, 2, 2],[3, 3]] 1.输入 ○输入是一个展平的词汇索引张量input例如 [2, 2, 2, 1, 1, 1, 1]。 ○对应的偏移量offsets例如 [0, 3]表示每个样本在展平张量中的起始位置。 2.合并操作 ○根据偏移量将嵌入向量进行合并操作。 ○合并操作可以是求和、平均或取最大值默认是平均mean。函数原型 torch.nn.EmbeddingBag(num_embeddings, embedding_dim, max_normNone, norm_type2.0, scale_grad_by_freqFalse, modemean, sparseFalse, _weightNone, include_last_offsetFalse, padding_idxNone, deviceNone, dtypeNone)主要参数 ●num_embeddings (int)词典的大小。 ●embedding_dim (int)每个词向量的维度即嵌入向量的长度。 ●mode (str)指定嵌入向量的聚合方式。可选值为 ‘sum’、‘mean’ 和 ‘max’。 ○假设有一个序列 [2, 3, 1]每个数字表示一个离散特征的索引对应的嵌入向量分别为 [[0.1, 0.2, 0.3],[0.2, 0.3, 0.4],[0.3, 0.4, 0.5]] ○’sum’对所有的嵌入向量求和则使用 ‘sum’ 模式汇总后的嵌入向量为 [0.6, 0.9,1.2]。 ○’mean’对所有的嵌入向量求平均值使用 ‘mean’ 模式汇总后的嵌入向量为 [0.2,0.3, 0.4]。 ○’max’对所有的嵌入向量求最大值使用 ‘max’ 模式汇总后的嵌入向量为 [0.3,0.4,0.5]。下面是一个简单的例子用EmbeddingBag将两个句子转换为词嵌入向量并计算它们的均值。 2.1.自定义数据集类 import torch from torch import nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader, Datasetclass MyDataset(Dataset):def __init__(self, texts, labels):self.texts textsself.labels labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):texts self.texts[idx]labels self.labels[idx]return texts, labels2.2.准备数据和数据加载器 # 假设我们有以下三个样本分别由不同数量的单词索引组成 text_data [torch.tensor([1, 1, 1, 1], dtypetorch.long), # 样本1torch.tensor([2, 2, 2], dtypetorch.long), # 样本2torch.tensor([3, 3], dtypetorch.long) # 样本3 ]# 对应的标签 labels torch.tensor([4, 5, 6], dtypetorch.float)# 创建数据集和数据加载器 my_dataset MyDataset(text_data, labels) data_loader DataLoader(my_dataset, batch_size2, shuffleTrue, collate_fnlambda x: x)for batch in data_loader:print(batch)代码输出 [(tensor([1, 1, 1, 1]), tensor(4.)), (tensor([2, 2, 2]), tensor(5.))] [(tensor([3, 3]), tensor(6.))]2.3.定义模型 class EmbeddingBagModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingBagModel, self).__init__()self.embedding_bag nn.EmbeddingBag(vocab_size, embed_dim, modemean)self.fc nn.Linear(embed_dim, 1) # 假设我们做一个二分类任务def forward(self, text, offsets):print(embedding_bag输入文本是,text)print(embedding_bag输入文本shape,text.shape)embedded self.embedding_bag(text, offsets)print(embedding_bag输出文本shape,embedded.shape)return self.fc(embedded)2.4.训练模型 # 示例词典大小和嵌入维度 vocab_size 10 embed_dim 6# 创建模型实例 model EmbeddingBagModel(vocab_size, embed_dim)# 定义一个简单的损失函数和优化器 criterion nn.BCEWithLogitsLoss() optimizer optim.SGD(model.parameters(), lr0.01)# 训练模型 for epoch in range(1): # 训练1个epochfor batch in data_loader:# 将批处理的数据展平并计算偏移量texts, labels zip(*batch)offsets [0] [len(text) for text in texts[:-1]]offsets torch.tensor(offsets).cumsum(dim0)texts torch.cat(texts)labels torch.tensor(labels).unsqueeze(1)# 前向传播outputs model(texts, offsets)loss criterion(outputs, labels)# 反向传播和优化optimizer.zero_grad()loss.backward()optimizer.step()print(fEpoch {epoch1}, Loss: {loss.item()})代码输出 embedding_bag输入文本是 tensor([1, 1, 1, 1, 2, 2, 2]) embedding_bag输入文本shape torch.Size([7]) embedding_bag输出文本shape torch.Size([2, 6])embedding_bag输入文本是 tensor([3, 3]) embedding_bag输入文本shape torch.Size([2]) embedding_bag输出文本shape torch.Size([1, 6])Epoch 1, Loss: 11.769578933715823.用Embedding嵌入处理txt文件内容 3.1.自定义数据集类 import torch from torch import nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader, Datasetclass MyDataset(Dataset):def __init__(self, texts, labels):self.texts textsself.labels labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):texts self.texts[idx]labels self.labels[idx]return texts, labels3.2.定义填充函数 def collate_batch(batch):texts, labels zip(*batch)max_len max(len(text) for text in texts)padded_texts [F.pad(text, (0, max_len - len(text)), value0) for text in texts]padded_texts torch.stack(padded_texts)labels torch.tensor(labels, dtypetorch.float).unsqueeze(1)return padded_texts, labels3.3.准备数据和数据加载器 import torch import torch.nn.functional as F import jieba# 打开txt文件 file_name ./N1/任务文件.txt with open(file_name,r,encoding utf-8) as file:context file.read()texts context.split() texts# 使用结巴分词进行分词 tokenized_texts [list(jieba.cut(text)) for text in texts]# 构建词汇表 word_index {} index_word {} for i, word in enumerate(set([word for text in tokenized_texts for word in text])):word_index[word] iindex_word[i] word# 将文本转化为整数序列 sequences [[word_index[word] for word in text] for text in tokenized_texts]# 获取词汇表大小 vocab_size len(word_index)# 将整数序列转化为one-hot编码 one_hot_results torch.zeros(len(texts), vocab_size) for i, seq in enumerate(sequences):one_hot_results[i, seq] 1# 打印结果 print(词汇表:) print(word_index) print(\n文本:) print(texts) print(\n分词结果) print(tokenized_texts) print(\n文本序列:) print(sequences) print(\nOne-Hot编码:) print(one_hot_results)代码输出词汇表: {one: 0, 了: 1, 基本: 2, 类别: 3, 如下: 4, 实际上: 5, -: 6, 其中: 7, 避免: 8, 情况: 9, 就是: 10, 存在: 11, 独立: 12, : 13, 实际意义: 14, 的: 15, 到: 16, 2: 17, 使用: 18, 提到: 19, 不同: 20, 一个: 21, 用: 22, 可以: 23, 例如: 24, 会: 25, 问题: 26, 相互: 27, 不: 28, 可能: 29, 之间: 30, 这些: 31, 编码方式: 32, 是: 33, 顺序: 34, 思想: 35, 将: 36, 分别: 37, 地: 38, 1: 39, 值: 40, 模型: 41, 有: 42, 比较: 43, 。: 44, 为: 45, 、: 46, 或: 47, 这种: 48, 0: 49, 映射: 50, 上面: 51, 只有: 52, 元素: 53, 独热: 54, 和: 55, 认为: 56, 距离: 57, 对于: 58, 称: 59, 其余: 60, 具有: 61, 编码: 62, 引入: 63, 关系: 64, 这样: 65, 为了: 66, 直观: 67, 也: 68, 字典: 69, 或者: 70, : 71, 三个: 72, 向量: 73, 错误: 74, 3: 75, hot: 76, 但是: 77, 采用: 78, : 79, 每个: 80, : 81, 序列: 82, 表示: 83, 而: 84, 一些: 85, 这: 86}文本: [比较直观的编码方式是采用上面提到的字典序列。例如对于一个有三个类别的问题可以用1、2和3分别表示这三个类别。但是这种编码方式存在一个问题就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系而实际上这些关系可能是不存在的或者不具有实际意义的。, 为了避免这种问题引入了one-hot编码也称独热编码。one-hot编码的基本思想是将每个类别映射到一个向量其中只有一个元素的值为1其余元素的值为0。这样每个类别之间就是相互独立的不存在顺序或距离关系。例如对于三个类别的情况可以使用如下的one-hot编码]分词结果 [[比较, 直观, 的, 编码方式, 是, 采用, 上面, 提到, 的, 字典, 序列, 。, 例如, , 对于, 一个, 有, 三个, 类别, 的, 问题, , 可以, 用, 1, 、, 2, 和, 3, 分别, 表示, 这, 三个, 类别, 。, 但是, , 这种, 编码方式, 存在, 一个, 问题, , 就是, 模型, 可能, 会, 错误, 地, 认为, 不同, 类别, 之间, 存在, 一些, 顺序, 或, 距离, 关系, , 而, 实际上, 这些, 关系, 可能, 是, 不, 存在, 的, 或者, 不, 具有, 实际意义, 的, 。], [为了, 避免, 这种, 问题, , 引入, 了, one, -, hot, 编码, , 也, 称, 独热, 编码, , 。, one, -, hot, 编码, 的, 基本, 思想, 是, 将, 每个, 类别, 映射, 到, 一个, 向量, , 其中, 只有, 一个, 元素, 的, 值, 为, 1, , 其余, 元素, 的, 值, 为, 0, 。, 这样, , 每个, 类别, 之间, 就是, 相互, 独立, 的, , 不, 存在, 顺序, 或, 距离, 关系, 。, 例如, , 对于, 三个, 类别, 的, 情况, , 可以, 使用, 如下, 的, one, -, hot, 编码, ]]文本序列: [[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72, 3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77, 71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11, 85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61, 14, 15, 44], [66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44, 0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3, 15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]One-Hot编码: tensor([[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1.,0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0.,0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1.,1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],[1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0.,1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1.,1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1.,1., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1.,1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]])# 将文本序列转换为PyTorch张量 text_data [torch.tensor(seq, dtypetorch.long) for seq in sequences]# 假设标签是一些浮点数值根据实际任务定义标签 labels torch.tensor([1.0, 2.0], dtypetorch.float)# 输出结果 print(Text Data:, text_data) print(Labels:, labels)代码输出 Text Data: [tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77,71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11,85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,14, 15, 44]), tensor([66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44,0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52,21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3,30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3,15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81])] Labels: tensor([1., 2.])# 创建数据集和数据加载器 my_dataset MyDataset(text_data, labels) data_loader DataLoader(my_dataset, batch_size2, shuffleTrue, collate_fncollate_batch)for batch in data_loader:print(batch)代码输出 (tensor([[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77,71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11,85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,14, 15, 44, 0, 0, 0, 0, 0, 0, 0, 0, 0],[66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44,0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52,21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3,30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3,15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]), tensor([[1.],[2.]]))3.4.定义模型 class EmbeddingModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingModel, self).__init__()self.embedding nn.Embedding(vocab_size, embed_dim)self.fc nn.Linear(embed_dim, 1) # 假设我们做一个二分类任务def forward(self, text):print(embedding输入文本是,text)print(embedding输入文本shape,text.shape)embedding self.embedding(text)embedding_mean embedding.mean(dim1) # 对每个样本的嵌入向量进行平均print(embedding输出文本shape,embedding_mean.shape)return self.fc(embedding_mean)3.5.训练模型 # 示例词典大小和嵌入维度 vocab_size vocab_size embed_dim 10# 创建模型实例 model EmbeddingModel(vocab_size, embed_dim)# 定义一个简单的损失函数和优化器 criterion nn.BCEWithLogitsLoss() optimizer optim.SGD(model.parameters(), lr0.01)# 训练模型 for epoch in range(1): # 训练1个epochfor batch in data_loader:texts, labels batch# 前向传播outputs model(texts)loss criterion(outputs, labels)# 反向传播和优化optimizer.zero_grad()loss.backward()optimizer.step()print(fEpoch {epoch1}, Loss: {loss.item()})代码输出 embedding输入文本是 tensor([[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77,71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11,85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,14, 15, 44, 0, 0, 0, 0, 0, 0, 0, 0, 0],[66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44,0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52,21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3,30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3,15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]) embedding输入文本shape torch.Size([2, 84]) embedding输出文本shape torch.Size([2, 10]) Epoch 1, Loss: 0.98435461521148684.EmbeddingBag嵌入处理txt文件内容 4.1.自定义数据集类 import torch from torch import nn import torch.nn.functional as F import torch.optim as optim from torch.utils.data import DataLoader, Datasetclass MyDataset(Dataset):def __init__(self, texts, labels):self.texts textsself.labels labelsdef __len__(self):return len(self.labels)def __getitem__(self, idx):texts torch.tensor(self.texts[idx], dtypetorch.long)labels torch.tensor(self.labels[idx], dtypetorch.float)return texts, labels4.2.准备数据和数据加载器 import torch import torch.nn.functional as F import jieba# 打开txt文件 file_name ./N1/任务文件.txt with open(file_name,r,encoding utf-8) as file:context file.read()texts context.split() texts# 使用结巴分词进行分词 tokenized_texts [list(jieba.cut(text)) for text in texts]# 构建词汇表 word_index {} index_word {} for i, word in enumerate(set([word for text in tokenized_texts for word in text])):word_index[word] iindex_word[i] word# 将文本转化为整数序列 sequences [[word_index[word] for word in text] for text in tokenized_texts]# 获取词汇表大小 vocab_size len(word_index)# 将整数序列转化为one-hot编码 one_hot_results torch.zeros(len(texts), vocab_size) for i, seq in enumerate(sequences):one_hot_results[i, seq] 1# 打印结果 print(词汇表:) print(word_index) print(\n文本:) print(texts) print(\n分词结果) print(tokenized_texts) print(\n文本序列:) print(sequences) print(\nOne-Hot编码:) print(one_hot_results)代码输出词汇表: {one: 0, 了: 1, 基本: 2, 类别: 3, 如下: 4, 实际上: 5, -: 6, 其中: 7, 避免: 8, 情况: 9, 就是: 10, 存在: 11, 独立: 12, : 13, 实际意义: 14, 的: 15, 到: 16, 2: 17, 使用: 18, 提到: 19, 不同: 20, 一个: 21, 用: 22, 可以: 23, 例如: 24, 会: 25, 问题: 26, 相互: 27, 不: 28, 可能: 29, 之间: 30, 这些: 31, 编码方式: 32, 是: 33, 顺序: 34, 思想: 35, 将: 36, 分别: 37, 地: 38, 1: 39, 值: 40, 模型: 41, 有: 42, 比较: 43, 。: 44, 为: 45, 、: 46, 或: 47, 这种: 48, 0: 49, 映射: 50, 上面: 51, 只有: 52, 元素: 53, 独热: 54, 和: 55, 认为: 56, 距离: 57, 对于: 58, 称: 59, 其余: 60, 具有: 61, 编码: 62, 引入: 63, 关系: 64, 这样: 65, 为了: 66, 直观: 67, 也: 68, 字典: 69, 或者: 70, : 71, 三个: 72, 向量: 73, 错误: 74, 3: 75, hot: 76, 但是: 77, 采用: 78, : 79, 每个: 80, : 81, 序列: 82, 表示: 83, 而: 84, 一些: 85, 这: 86}文本: [比较直观的编码方式是采用上面提到的字典序列。例如对于一个有三个类别的问题可以用1、2和3分别表示这三个类别。但是这种编码方式存在一个问题就是模型可能会错误地认为不同类别之间存在一些顺序或距离关系而实际上这些关系可能是不存在的或者不具有实际意义的。, 为了避免这种问题引入了one-hot编码也称独热编码。one-hot编码的基本思想是将每个类别映射到一个向量其中只有一个元素的值为1其余元素的值为0。这样每个类别之间就是相互独立的不存在顺序或距离关系。例如对于三个类别的情况可以使用如下的one-hot编码]分词结果 [[比较, 直观, 的, 编码方式, 是, 采用, 上面, 提到, 的, 字典, 序列, 。, 例如, , 对于, 一个, 有, 三个, 类别, 的, 问题, , 可以, 用, 1, 、, 2, 和, 3, 分别, 表示, 这, 三个, 类别, 。, 但是, , 这种, 编码方式, 存在, 一个, 问题, , 就是, 模型, 可能, 会, 错误, 地, 认为, 不同, 类别, 之间, 存在, 一些, 顺序, 或, 距离, 关系, , 而, 实际上, 这些, 关系, 可能, 是, 不, 存在, 的, 或者, 不, 具有, 实际意义, 的, 。], [为了, 避免, 这种, 问题, , 引入, 了, one, -, hot, 编码, , 也, 称, 独热, 编码, , 。, one, -, hot, 编码, 的, 基本, 思想, 是, 将, 每个, 类别, 映射, 到, 一个, 向量, , 其中, 只有, 一个, 元素, 的, 值, 为, 1, , 其余, 元素, 的, 值, 为, 0, 。, 这样, , 每个, 类别, 之间, 就是, 相互, 独立, 的, , 不, 存在, 顺序, 或, 距离, 关系, 。, 例如, , 对于, 三个, 类别, 的, 情况, , 可以, 使用, 如下, 的, one, -, hot, 编码, ]]文本序列: [[43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72, 3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77, 71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11, 85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61, 14, 15, 44], [66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44, 0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3, 15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]]One-Hot编码: tensor([[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1.,0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0.,0., 1., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 0.,0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1.,1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1.],[1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0.,1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 0., 0., 1., 1., 1.,1., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 1., 1.,1., 0., 0., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1.,1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]])# 分词后的文本和对应的标签 text_data sequences# 对应的标签 # 假设有两个标签 labels [0, 1]# 创建数据集和数据加载器 my_dataset MyDataset(text_data, labels) data_loader DataLoader(my_dataset, batch_size2, shuffleTrue, collate_fnlambda x: x)for batch in data_loader:print(batch)代码输出 [(tensor([66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54, 62, 13, 44,0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73, 71, 7, 52,21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65, 71, 80, 3,30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71, 58, 72, 3,15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]), tensor(1.)), (tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77,71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11,85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,14, 15, 44]), tensor(0.))]4.3.定义模型 class EmbeddingBagModel(nn.Module):def __init__(self, vocab_size, embed_dim):super(EmbeddingBagModel, self).__init__()self.embedding_bag nn.EmbeddingBag(vocab_size, embed_dim, modemean)self.fc nn.Linear(embed_dim, 1) # 假设我们做一个二分类任务def forward(self, text, offsets):print(embedding_bag输入文本是,text)print(embedding_bag输入文本shape,text.shape)embedded self.embedding_bag(text, offsets)print(embedding_bag输出文本shape,embedded.shape)return self.fc(embedded)4.4.训练模型 # 示例词典大小和嵌入维度 vocab_size vocab_size embed_dim 6# 创建模型实例 model EmbeddingBagModel(vocab_size, embed_dim)# 定义一个简单的损失函数和优化器 criterion nn.BCEWithLogitsLoss() optimizer optim.SGD(model.parameters(), lr0.01)# 训练模型 for epoch in range(1): # 训练1个epochfor batch in data_loader:# 将批处理的数据展平并计算偏移量texts, labels zip(*batch)offsets [0] [len(text) for text in texts[:-1]]offsets torch.tensor(offsets).cumsum(dim0)texts torch.cat(texts)labels torch.tensor(labels).unsqueeze(1)# 前向传播outputs model(texts, offsets)loss criterion(outputs, labels)# 反向传播和优化optimizer.zero_grad()loss.backward()optimizer.step()print(fEpoch {epoch1}, Loss: {loss.item()})代码输出 embedding_bag输入文本是 tensor([43, 67, 15, 32, 33, 78, 51, 19, 15, 69, 82, 44, 24, 71, 58, 21, 42, 72,3, 15, 26, 71, 23, 22, 39, 46, 17, 55, 75, 37, 83, 86, 72, 3, 44, 77,71, 48, 32, 11, 21, 26, 71, 10, 41, 29, 25, 74, 38, 56, 20, 3, 30, 11,85, 34, 47, 57, 64, 71, 84, 5, 31, 64, 29, 33, 28, 11, 15, 70, 28, 61,14, 15, 44, 66, 8, 48, 26, 71, 63, 1, 0, 6, 76, 62, 79, 68, 59, 54,62, 13, 44, 0, 6, 76, 62, 15, 2, 35, 33, 36, 80, 3, 50, 16, 21, 73,71, 7, 52, 21, 53, 15, 40, 45, 39, 71, 60, 53, 15, 40, 45, 49, 44, 65,71, 80, 3, 30, 10, 27, 12, 15, 71, 28, 11, 34, 47, 57, 64, 44, 24, 71,58, 72, 3, 15, 9, 71, 23, 18, 4, 15, 0, 6, 76, 62, 81]) embedding_bag输入文本shape torch.Size([159]) embedding_bag输出文本shape torch.Size([2, 6]) Epoch 1, Loss: 0.711330235004425

查看全文

http://www.w-s-a.com/news/892961/