网站开发预算表,微信公众号建设公司,网站后台页面进不去,成都医院做网站建设深入解析 Loss 减少方式#xff1a;mean 和 sum 的区别及其在大语言模型中的应用
在训练大语言模型#xff08;Large Language Models, LLM#xff09;时#xff0c;损失函数#xff08;Loss Function#xff09;的处理方式对模型的性能和优化过程有显著影响。本文以 re…深入解析 Loss 减少方式mean 和 sum 的区别及其在大语言模型中的应用
在训练大语言模型Large Language Models, LLM时损失函数Loss Function的处理方式对模型的性能和优化过程有显著影响。本文以 reduce_loss 参数为例详细探讨 mean 和 sum 两种方式的定义、适用场景及其对对话模型性能的潜在提升原因并通过代码实例加深理解。 1. 什么是 reduce_loss
reduce_loss 决定了在每个 batch 中如何对 token-level 的损失进行归一化或累加处理。常见的选项是
mean: 取每个 token 损失的平均值。sum: 将每个 token 损失直接累加。
参数定义示例在代码中通过 dataclass 定义参考来源https://github.com/allenai/open-instruct
from dataclasses import dataclass, fielddataclass
class TrainingArguments:reduce_loss: str field(defaultmean,metadata{help: (How to reduce loss over tokens. Options are mean or sum.Using sum can improve chat model performance.)},)2. mean 和 sum 的定义
2.1 mean 模式
定义将 batch 中所有 token 的损失值取平均。公式 Loss mean ∑ i 1 N Loss i N \text{Loss}_{\text{mean}} \frac{\sum_{i1}^{N} \text{Loss}_i}{N} LossmeanN∑i1NLossi 其中 ( N N N) 是当前 batch 中的 token 总数。特性每个 token 的损失对最终的 loss 贡献相等损失值与 batch 中的 token 数无关。
2.2 sum 模式
定义将 batch 中所有 token 的损失值直接累加。公式 Loss sum ∑ i 1 N Loss i \text{Loss}_{\text{sum}} \sum_{i1}^{N} \text{Loss}_i Losssumi1∑NLossi特性长序列更多 token的损失对总 loss 的贡献更大损失值直接与 token 数成正比。 3. mean 和 sum 的区别
模式特点优点缺点mean损失对 token 数归一化独立于 batch size。稳定性强适用于 token 数差异大的批次。长序列与短序列对损失的贡献相同可能弱化长序列的重要性。sum损失值与 token 总数成正比长序列贡献更大。在注重长序列表现的任务中效果更好如对话生成。损失值随 batch size 变化波动需要动态调整学习率。 4. 适用场景分析
4.1 mean
适用任务大多数语言建模任务如 GPT 或 BERT 的预训练。适用场景当训练数据中序列长度差异较大时mean 可以避免因长序列的损失值过大而导致梯度更新不均衡。
4.2 sum
适用任务对长序列表现要求较高的任务如对话生成Chat Models和长文本生成。适用场景长序列的损失占比更高从而使优化过程更加关注全局上下文的建模。 5. 为什么 sum 能提升对话模型性能
对话模型Chat Models的训练中长序列往往包含丰富的上下文信息而短序列则可能无法体现模型的上下文理解能力。在 sum 模式下
长序列的重要性增加长序列的损失对总损失的贡献更大这促使模型更关注上下文的建模。对全局一致性更敏感sum 模式下模型的优化方向更倾向于全序列的一致性特别适合需要长距离依赖的任务。
示例 假设一个 batch 包含以下两个样本
样本 A: 长度为 10损失总和为 5。样本 B: 长度为 50损失总和为 25。
计算损失贡献
mean 模式 Loss mean 5 25 10 50 0.5 \text{Loss}_{\text{mean}} \frac{5 25}{10 50} 0.5 Lossmean10505250.5 样本 A 和 B 的贡献权重相同。sum 模式 Loss sum 5 25 30 \text{Loss}_{\text{sum}} 5 25 30 Losssum52530 样本 B 的贡献权重显著增加优化更关注长序列。 6. 实战代码
以下是一个完整的训练脚本展示如何在 Hugging Face 的 transformers 框架中使用 reduce_loss 参数。
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch# 模型和数据集
model_name meta-llama/Llama-3.1-8B
dataset_name allenai/tulu-3-sft-mixturemodel AutoModelForCausalLM.from_pretrained(model_name)
tokenizer AutoTokenizer.from_pretrained(model_name, use_fastFalse)dataset load_dataset(dataset_name)
tokenized_dataset dataset.map(lambda x: tokenizer(x[text], truncationTrue, paddingmax_length), batchedTrue)
train_loader DataLoader(tokenized_dataset[train], batch_size2, shuffleTrue)# 训练设置
reduce_loss sum # 改为 mean 可对比效果
optimizer torch.optim.AdamW(model.parameters(), lr5e-6)
device torch.device(cuda if torch.cuda.is_available() else cpu)
model.to(device)# 训练循环
for epoch in range(2):for batch in train_loader:inputs torch.tensor(batch[input_ids]).to(device)labels inputs.clone()outputs model(inputs, labelslabels)if reduce_loss sum:loss outputs.loss.sum()else: # 默认 meanloss outputs.loss.mean()loss.backward()optimizer.step()optimizer.zero_grad()print(fEpoch: {epoch}, Loss: {loss.item()})7. 注意事项与优化建议 动态调整学习率 使用 sum 时由于损失值放大建议适配学习率如降低到 mean 模式的 ( 1 / N 1/N 1/N )。配合学习率调度器如 linear优化训练。 对长短序列的平衡 若长序列权重过大导致模型性能退化可结合 curriculum learning 或混合训练策略如对长短序列按比例采样。 性能评估 在验证集上关注长序列和短序列的生成性能对比。 8. 总结
reduce_loss 的选择对模型性能有直接影响
mean 更通用适合大多数语言建模任务。sum 在对话生成等长序列敏感任务中表现更优。
希望本文能为 LLM 研究人员提供思路和参考在具体任务中灵活选择合适的损失归一化方式从而提升模型性能。
Understanding the Difference Between mean and sum Loss Reduction in LLM Training
When training large language models (LLMs), the way token-level loss is reduced across a batch can significantly impact optimization and model performance. This article delves into the reduce_loss parameter, exploring the differences between mean and sum reduction modes, their definitions, use cases, and why sum might improve the performance of chat-oriented models. Practical code examples are also provided for clarity. 1. What is reduce_loss?
The reduce_loss parameter determines how the token-level loss values in a batch are aggregated. The two most common options are:
mean: Averages the loss over all tokens in a batch.sum: Sums the loss of all tokens in a batch.
Example definition (from the codebase using Python dataclass):
from dataclasses import dataclass, fielddataclass
class TrainingArguments:reduce_loss: str field(defaultmean,metadata{help: (How to reduce loss over tokens. Options are mean or sum.Using sum can improve chat model performance.)},)2. Definitions of mean and sum
2.1 mean
Definition: Averages the loss across all tokens in a batch.Formula: Loss mean ∑ i 1 N Loss i N \text{Loss}_{\text{mean}} \frac{\sum_{i1}^{N} \text{Loss}_i}{N} LossmeanN∑i1NLossi where ( N N N ) is the total number of tokens in the batch.Characteristics: The contribution of each token to the final loss is normalized, making the loss independent of the batch’s token count.
2.2 sum
Definition: Sums up the loss across all tokens in a batch.Formula: Loss sum ∑ i 1 N Loss i \text{Loss}_{\text{sum}} \sum_{i1}^{N} \text{Loss}_i Losssumi1∑NLossiCharacteristics: The total loss is proportional to the number of tokens, giving longer sequences more weight in the optimization process. 3. Key Differences Between mean and sum
Reduction ModeCharacteristicsAdvantagesDisadvantagesmeanNormalizes the loss by token count.Stable and robust for datasets with variable-length sequences.Long sequences are underweighted relative to short ones.sumLoss scales with the number of tokens.Places greater emphasis on longer sequences, improving performance in tasks requiring context modeling.Loss values vary with batch size, necessitating dynamic learning rate adjustment. 4. Use Cases for mean and sum
4.1 mean
Best Suited For: Pretraining or general language modeling tasks like GPT or BERT.Scenario: When the dataset contains sequences of widely varying lengths, mean ensures that longer sequences do not disproportionately influence gradient updates.
4.2 sum
Best Suited For: Tasks requiring high performance on long sequences, such as dialogue generation or document-level text generation.Scenario: Encourages the model to prioritize sequences with richer contexts, as their loss contributes more to the overall optimization. 5. Why Does sum Improve Chat Model Performance?
In chat-oriented models, sequences are typically longer and require the model to understand and generate coherent responses over extended contexts. Using sum mode:
Enhances Long Sequence Weighting: Longer sequences contribute more to the total loss, emphasizing the importance of context modeling.Encourages Global Consistency: By assigning more weight to longer contexts, the model better captures dependencies across the entire sequence.Balances Token Importance: Since chat models are often evaluated on dialogue-level coherence, sum ensures that tokens from the context and the response are proportionally weighted.
Example: Consider a batch with two samples:
Sample A: Sequence length 10, loss 5.Sample B: Sequence length 50, loss 25.
Loss calculations:
mean mode: Loss mean 5 25 10 50 0.5 \text{Loss}_{\text{mean}} \frac{5 25}{10 50} 0.5 Lossmean10505250.5 Both samples contribute equally to the loss.sum mode: Loss sum 5 25 30 \text{Loss}_{\text{sum}} 5 25 30 Losssum52530 Sample B contributes much more to the total loss, focusing the optimization on longer contexts. 6. Practical Implementation
Here’s a practical training script that demonstrates the use of reduce_loss in both modes.
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch# Model and dataset
model_name meta-llama/Llama-3.1-8B
dataset_name allenai/tulu-3-sft-mixturemodel AutoModelForCausalLM.from_pretrained(model_name)
tokenizer AutoTokenizer.from_pretrained(model_name, use_fastFalse)dataset load_dataset(dataset_name)
tokenized_dataset dataset.map(lambda x: tokenizer(x[text], truncationTrue, paddingmax_length), batchedTrue)
train_loader DataLoader(tokenized_dataset[train], batch_size2, shuffleTrue)# Training setup
reduce_loss sum # Change to mean to compare effects
optimizer torch.optim.AdamW(model.parameters(), lr5e-6)
device torch.device(cuda if torch.cuda.is_available() else cpu)
model.to(device)# Training loop
for epoch in range(2):for batch in train_loader:inputs torch.tensor(batch[input_ids]).to(device)labels inputs.clone()outputs model(inputs, labelslabels)if reduce_loss sum:loss outputs.loss.sum()else: # Default: meanloss outputs.loss.mean()loss.backward()optimizer.step()optimizer.zero_grad()print(fEpoch: {epoch}, Loss: {loss.item()})7. Practical Considerations Learning Rate Adjustment: When using sum, the loss magnitude increases with batch size, so you may need to adjust the learning rate (e.g., scale it down by ( 1 / N 1/N 1/N )). Balancing Long and Short Sequences: Overweighting long sequences can sometimes harm generalization. Using curriculum learning or sampling strategies (e.g., proportional sampling) can help mitigate this. Validation: Evaluate model performance on both short and long sequences to confirm improvements in the intended metrics. 8. Conclusion
The choice between mean and sum loss reduction modes depends on the specific task and dataset:
Use mean for general-purpose language modeling tasks where sequence lengths vary significantly.Use sum for tasks that prioritize long-sequence performance, such as chat models or long-text generation.
Understanding and experimenting with these settings can lead to better-optimized models, particularly in the nuanced field of LLM fine-tuning.
后记
2024年12月3日16点04分于上海在GPT4o大模型辅助下完成。