当前位置：首页 > news >正文

网站到期续费通知wordpress添加上下文

news 2025/12/29 17:58:44

网站到期续费通知,wordpress添加上下文,无网站营销,建设一个网站最好是官网那种假如你学习了新的分类算法并想进一步探索研究、尝试不同的超参数评估模型性能#xff0c;但问题是你找不到好的数据集用于实验。幸运的是Scikit-Learn 提供的 make_classification() 方法可以创建不同类型的数据集#xff0c;它可以生成不同类型的数据集#xff1a;二分类、… 假如你学习了新的分类算法并想进一步探索研究、尝试不同的超参数评估模型性能但问题是你找不到好的数据集用于实验。幸运的是Scikit-Learn 提供的 make_classification() 方法可以创建不同类型的数据集它可以生成不同类型的数据集二分类、多分类、平衡或不平衡数据集、难以分类的数据集等。本文通过示例详细说明并结合随机森林分类算法进行验证。 make_classification函数首先我们介绍该函数参数以及常用参数及默认值 n_samples: 生成多少条样本数据缺省100条.n_features: 有几个数值类型特征缺省为20.n_informative: 有用特征的个数仅这些特征承载对分类信号.缺省为2.n_classes: 分类标签的数量缺省为2. 该函数返回包含函数Numpy 数组的tuple分别为特征X以及标签y。其他参数用到时再作说明。生成二分类数据集下面生成二分类数据集即标签仅有两个可能的值0 、1. 因此需要设置n_classes参数为2。我们需要生成1000条样本包括5个特征其中三个为有用特征另外两个为冗余特征。 from sklearn.datasets import make_classificationX, y make_classification(n_samples1000, # 1000 observations n_features5, # 5 total featuresn_informative3, # 3 useful featuresn_classes2, # binary target/label random_state999 # if you want the same results as mine )下面需转换 make_classification 函数返回值为 padas 数据框。padas 数据框比Numpy数组更易分析。 import pandas as pd# Create DataFrame with features as columns dataset pd.DataFrame(X) # give custom names to the features dataset.columns [X1, X2, X3, X4, X5] # Now add the label as a column dataset[y] ydataset.info()输出结果 class pandas.core.frame.DataFrame RangeIndex: 1000 entries, 0 to 999 Data columns (total 6 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 X1 1000 non-null float641 X2 1000 non-null float642 X3 1000 non-null float643 X4 1000 non-null float644 X5 1000 non-null float645 y 1000 non-null int64 dtypes: float64(5), int64(1) memory usage: 47.0 KB和我们期望一致该数据集包括1000个样本包括5个特征以及对应的响应目标标签。我们设置**n_informative** 为3因此仅 (X1, X2, X3)是重要的另外两个 X4 和 X5, 是多余的。现状我们检查标签y的基数和总数 dataset[y].value_counts()1 502 0 498 Name: y, dtype: int64标签仅包括两个可能的值因此属于二分类数据集。而且两者数量大致相当因此标签分类相对平衡。下面查看前5条样本值 dataset.head()X1X2X3X4X5y02.501284-0.1591550.6724383.4699910.949268012.203247-0.3312710.7943193.2599630.83245102-1.524573-0.8707371.004304-1.028624-0.717383131.8014983.1063361.490633-0.297404-0.60748404-0.1251460.9879150.880293-0.937299-0.6268220 分类示例生成数据集看上去不错下面利用缺省超参数创建随机森林分类器。我们使用交叉验证衡量模型性能 from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_validate# initialize classifier classifier RandomForestClassifier() # Run cross validation with 10 folds scores cross_validate(classifier, X, y, cv10, # measure score for a list of classification metricsscoring[accuracy, precision, recall, f1] )scores pd.DataFrame(scores) scores.mean().round(4)输出结果如下, 模型的 Accuracy, Precision, Recall, 和 F1 Score接近88%. 没有调整任何超参数情况下表现尚可。 fit_time 0.1201 score_time 0.0072 test_accuracy 0.8820 test_precision 0.8829 test_recall 0.8844 test_f1 0.8827 dtype: float64不易分类数据集下面尝试创建要给不容易分类的数据集。我们可以使用下面**make_classification()**函数参数控制数据集的难度级别 flip_y: 通过反转少量标签增加噪声数据 . 举例改变少量标签值0的值为1返回改变1为0. 该值越大噪声越多缺省值为 0.01.class_sep: 类别之间的距离默认值为1.0表示原始特征空间中的类别之间的平均距离. 值越小分类越难. 下面代码使用flip_y较高的值与class_sep较低的值创建有挑战性的数据集 X, y make_classification(# same as the previous sectionn_samples1000, n_features5, n_informative3, n_classes2, # flip_y - high value to add more noiseflip_y0.1, # class_sep - low value to reduce space between classesclass_sep0.5 )# Check label class distribution pd.DataFrame(y).value_counts()1 508 0 492 dtype: int640 和 1 标签对应的样本量几乎相当。因此分类相对平衡。分类较难数据集我们再次构建随机森林模型并使用默认超参数。这次使用较难的数据集 classifier RandomForestClassifier() scores cross_validate(classifier, X, y, cv10, scoring[accuracy, precision, recall, f1] )scores pd.DataFrame(scores) scores.mean()fit_time 0.138662 score_time 0.007333 test_accuracy 0.756000 test_precision 0.764619 test_recall 0.760196 test_f1 0.759281 dtype: float64模型的Accuracy, Precision, Recall, 和F1 Score 参数值大约在75~76%.相对前面88%有了明显下降。 flip_y 和**class_sep** 参数值起作用了它们创建的数据集确实较难分类。不平衡数据集前面我们创建的数据集每个分类对应样本大致相等。但有时我们需要不平衡数据集即其中一个标签分类样本数据比较稀少。我们可以使用参数weights去控制每个分类的比例。下面代码利用make_classification 函数给样本0值标签分配比例97%, 剩下了的分类值1占3% X, y make_classification(# the usual parametersn_samples1000, n_features5, n_informative3, n_classes2, # Set label 0 for 97% and 1 for rest 3% of observationsweights[0.97], )pd.DataFrame(y).value_counts()0 964 1 36 dtype: int64从结果看**make_classification()**函数分配了3%比例给标签值为1的样本确实生成了不平衡数据集。分类不平衡数据集与前节一样仍使用缺省超参数的随机森林模型训练不平衡数据集 classifier RandomForestClassifier() scores cross_validate(classifier, X, y, cv10, scoring[accuracy, precision, recall, f1] )scores pd.DataFrame(scores) scores.mean()fit_time 0.101848 score_time 0.006896 test_accuracy 0.964000 test_precision 0.250000 test_recall 0.083333 test_f1 0.123333 dtype: float64我们看到有趣的现象我们的模型准确率很高96%但精确率和召回率很低25% 和 8%。这是典型的准确率悖论当处理不平衡数据经常会发生。多分类数据集到目前为止我们生成的标签仅有两种可能。如果你需要多分类数据做实验则标签需要超过2个值。n_classes参数可以实现 X, y make_classification(# same parameters as usual n_samples1000, n_features5, n_informative3,# create target label with 3 classesn_classes3, )pd.DataFrame(y).value_counts()1 334 2 333 0 333 dtype: int64从结果看三个分类样本大致相当数据集分类较平衡。多分类不平衡数据集我们也可以很容易创建不平衡多分类数据集只需要使用参数 n_classes 和 weights : X, y make_classification(# same parameters as usual n_samples1000, n_features5, n_informative3,# create target label with 3 classesn_classes3, # assign 4% of rows to class 0, 48% to class 1# and the rest to class 2weights[0.04, 0.48] )pd.DataFrame(y).value_counts()0 值分类占 4% 1 值占 48% 剩下的给值 2 标签。查看结果 2 479 1 477 0 44 dtype: int641000个样本中 0 值标签仅有44个和预期一致。总结现在你学会了使用scikit-learn的make_classification函数生成不同类型数据集了吧。包括二分类或多分类、不平衡数据集、挑战性难分类的数据集等。更多参数可以查看官方文档本文参考How to Generate Datasets Using make_classification | Proclus Academy。

查看全文

http://www.w-s-a.com/news/830059/