当前位置：首页 > news >正文

网站建设广西游戏网页设计

news 2026/4/9 15:07:19

网站建设广西,游戏网页设计,找产品代理去哪个网站,银川网站建设哪家优基于机器学习模型预测信用卡潜在用户#xff08;XGBoost、LightGBM和Random Forest#xff09; 随着数据科学和机器学习的发展#xff0c;越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中#xff0c;我将分享如何利用机器学习模型来预测信用卡的潜在客户。此…基于机器学习模型预测信用卡潜在用户XGBoost、LightGBM和Random Forest 随着数据科学和机器学习的发展越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中我将分享如何利用机器学习模型来预测信用卡的潜在客户。此项目基于我整理的代码和文件涉及数据预处理、数据可视化、模型训练、预测及结果保存的完整流程。项目概述本项目旨在使用机器学习模型预测哪些客户最有可能成为信用卡的潜在客户。我们将使用三个主要的机器学习模型XGBoost、LightGBM和随机森林Random Forest。以下是项目的主要步骤 1、数据预处理 2、数据可视化 3、模型训练 4、模型预测 5、模型保存 1. 数据预处理数据预处理是机器学习项目中至关重要的一步。通过清洗和准备数据我们可以提高模型的性能和准确性。 import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) import matplotlib.pyplot as plt import seaborn as sns#Loading the dataset df_trainpd.read_csv(dataset/train_s3TEQDk.csv) df_train[source]train df_testpd.read_csv(dataset/test_mSzZ8RL.csv) df_test[source]test dfpd.concat([df_train,df_test],ignore_indexTrue) df.head()IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource0NNVBBKZBFemale73RG268OtherX343No1045696No0.0train1IDD62UNGFemale30RG277SalariedX132No581988No0.0train2HD3DSEMCFemale56RG268Self_EmployedX326No1484315Yes0.0train3BF3NC7KVMale34RG270SalariedX119No470454No0.0train4TEASRWXVFemale30RG282SalariedX133No886787No0.0train 1. Checking and Cleaning Dataset : #Checking columns of dataset df.columnsIndex([ID, Gender, Age, Region_Code, Occupation, Channel_Code,Vintage, Credit_Product, Avg_Account_Balance, Is_Active,Is_Lead, source],dtypeobject)#Checking shape df.shape(351037, 12)#Checking unique values df.nunique()ID 351037 Gender 2 Age 63 Region_Code 35 Occupation 4 Channel_Code 4 Vintage 66 Credit_Product 2 Avg_Account_Balance 162137 Is_Active 2 Is_Lead 2 source 2 dtype: int64#Check for Null Values df.isnull().sum()ID 0 Gender 0 Age 0 Region_Code 0 Occupation 0 Channel_Code 0 Vintage 0 Credit_Product 41847 Avg_Account_Balance 0 Is_Active 0 Is_Lead 105312 source 0 dtype: int64Observation: Null values are present in Credit _Product column. #Fill null values in Credit_Product feature df[Credit_Product] df[Credit_Product].fillna(NA)#Again check for null values df.isnull().sum()ID 0 Gender 0 Age 0 Region_Code 0 Occupation 0 Channel_Code 0 Vintage 0 Credit_Product 0 Avg_Account_Balance 0 Is_Active 0 Is_Lead 105312 source 0 dtype: int64#Checking Datatypes and info df.info()class pandas.core.frame.DataFrame RangeIndex: 351037 entries, 0 to 351036 Data columns (total 12 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 351037 non-null object 1 Gender 351037 non-null object 2 Age 351037 non-null int64 3 Region_Code 351037 non-null object 4 Occupation 351037 non-null object 5 Channel_Code 351037 non-null object 6 Vintage 351037 non-null int64 7 Credit_Product 351037 non-null object 8 Avg_Account_Balance 351037 non-null int64 9 Is_Active 351037 non-null object 10 Is_Lead 245725 non-null float6411 source 351037 non-null object dtypes: float64(1), int64(3), object(8) memory usage: 32.1 MB#Changing Yes to 1 and No to 0 in Is_Active column to covert data into floatdf[Is_Active].replace([Yes,No],[1,0],inplaceTrue)df[Is_Active] df[Is_Active].astype(float) df.head()IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource0NNVBBKZBFemale73RG268OtherX343No10456960.00.0train1IDD62UNGFemale30RG277SalariedX132No5819880.00.0train2HD3DSEMCFemale56RG268Self_EmployedX326No14843151.00.0train3BF3NC7KVMale34RG270SalariedX119No4704540.00.0train4TEASRWXVFemale30RG282SalariedX133No8867870.00.0train #Now changing all categorical column into numerical form using label endcoding cat_col[ Gender, Region_Code, Occupation,Channel_Code, Credit_Product]from sklearn.preprocessing import LabelEncoder le LabelEncoder() for col in cat_col:df[col] le.fit_transform(df[col])df_2 df df_2.head()IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource0NNVBBKZB073181243110456960.00.0train1IDD62UNG03027203215819880.00.0train2HD3DSEMC056183226114843151.00.0train3BF3NC7KV13420201914704540.00.0train4TEASRWXV03032203318867870.00.0train #Separating the train and test df_traindf_2.loc[df_2[source]train] df_testdf_2.loc[df_2[source]test] df_1 df_train#we can drop column as they are irrelevant and have no effect on our data df_1.drop(columns[ID,source],inplaceTrue) df_1.head()GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead0073181243110456960.00.0103027203215819880.00.02056183226114843151.00.0313420201914704540.00.0403032203318867870.00.0 2. 数据可视化数据可视化有助于我们更好地理解数据的分布和特征。以下是一些常用的数据可视化方法 import warnings warnings.filterwarnings(ignore) plt.rcParams[figure.figsize] (10,6) plt.rcParams[font.size] 16 sns.set_style(whitegrid)sns.distplot(df[Age]);sns.distplot(df[Avg_Account_Balance]) plt.show()#Countplot for Gender feature # plt.figure(figsize(8,4)) sns.countplot(df[Gender],paletteAccent) plt.show()#Countplot for Target variable i.e Is_Lead target Is_Lead # plt.figure(figsize(8,4)) sns.countplot(df[target],palettehls) print(df[target].value_counts())0.0 187437 1.0 58288 Name: Is_Lead, dtype: int64plt.rcParams[figure.figsize] (12,6)#Checking occupation with customers # plt.figure(figsize(8,4)) sns.countplot(xOccupation,hueIs_Lead,datadf,palette magma) plt.show()#Plot showing Activness of customer in last 3 months with respect to Occupation of customer # plt.figure(figsize(8,4)) sns.catplot(yAge,xOccupation,hueIs_Active,datadf,kindbar,paletteOranges) plt.show()3. 模型训练我们将使用三个模型进行训练XGBoost、LightGBM和随机森林。以下是模型的训练代码 # To balance the dataset , we will apply undersampling method from sklearn.utils import resample # separate the minority and majority classes df_majority df_1[df_1[Is_Lead]0] df_minority df_1[df_1[Is_Lead]1]print( The majority class values are, len(df_majority)) print( The minority class values are, len(df_minority)) print( The ratio of both classes are, len(df_majority)/len(df_minority))The majority class values are 187437The minority class values are 58288The ratio of both classes are 3.215704776283283# undersample majority class df_majority_undersampled resample(df_majority, replaceTrue, n_sampleslen(df_minority), random_state0) # combine minority class with oversampled majority class df_undersampled pd.concat([df_minority, df_majority_undersampled])df_undersampled[Is_Lead].value_counts() df_1df_undersampled# display new class value counts print( The undersamples class values count is:, len(df_undersampled)) print( The ratio of both classes are, len(df_undersampled[df_undersampled[Is_Lead]0])/len(df_undersampled[df_undersampled[Is_Lead]1])) The undersamples class values count is: 116576The ratio of both classes are 1.0# dropping target variable #assign the value of y for training and testing phase xc df_1.drop(columns[Is_Lead]) yc df_1[[Is_Lead]]df_1.head()GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead6162321220010567501.01.01513318316905170631.01.016046181297222825020.01.017059331215223846920.01.020144193119210016500.01.0 #Importing necessary libraries from sklearn import metrics from scipy.stats import zscore from sklearn.preprocessing import LabelEncoder,StandardScaler from sklearn.model_selection import train_test_split,GridSearchCV from sklearn.decomposition import PCA from sklearn.metrics import precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score, roc_curve from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score,roc_curve from sklearn.metrics import auc from sklearn.metrics import plot_roc_curve from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier from sklearn.model_selection import cross_val_score from sklearn.naive_bayes import GaussianNB#Import warnings import warnings warnings.filterwarnings(ignore)#Standardizing value of x by using standardscaler to make the data normally distributed sc StandardScaler() df_xc pd.DataFrame(sc.fit_transform(xc),columnsxc.columns) df_xc.head()GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_Active00.8719221.1029871.080645-1.2543101.098078-0.961192-1.495172-0.1189581.19288010.871922-0.895316-0.2085251.008933-0.0528280.484610-1.495172-0.7482731.1928802-1.1468910.000475-0.208525-1.2543101.0980781.3107841.2200961.310361-0.8383073-1.1468910.8962661.172729-1.2543101.098078-1.1087231.2200961.429522-0.83830740.871922-0.137339-0.1164411.008933-0.052828-0.9906991.220096-0.183209-0.838307 #defining a function to find fit of the modeldef max_accuracy_scr(names,model_c,df_xc,yc):accuracy_scr_max 0roc_scr_max0train_xc,test_xc,train_yc,test_yc train_test_split(df_xc,yc,random_state 42,test_size 0.2,stratify yc)model_c.fit(train_xc,train_yc)pred model_c.predict_proba(test_xc)[:, 1]roc_score roc_auc_score(test_yc, pred)accuracy_scr accuracy_score(test_yc,model_c.predict(test_xc))if roc_score roc_scr_max:roc_scr_maxroc_scorefinal_model model_cmean_acc cross_val_score(final_model,df_xc,yc,cv5,scoringaccuracy).mean()std_dev cross_val_score(final_model,df_xc,yc,cv5,scoringaccuracy).std()cross_val cross_val_score(final_model,df_xc,yc,cv5,scoringaccuracy)print(**50)print(Results for model : ,names,\n,max roc score correspond to random state ,roc_scr_max ,\n,Mean accuracy score is : ,mean_acc,\n,Std deviation score is : ,std_dev,\n,Cross validation scores are : ,cross_val) print(froc_auc_score: {roc_score})print(**50)#Now by using multiple Algorithms we are calculating the best Algo which performs best for our data set accuracy_scr_max [] models[] #accuracy[] std_dev[] roc_auc[] mean_acc[] cross_val[] models.append((Logistic Regression, LogisticRegression())) models.append((Random Forest, RandomForestClassifier())) models.append((Decision Tree Classifier,DecisionTreeClassifier())) models.append((GausianNB,GaussianNB()))for names,model_c in models:max_accuracy_scr(names,model_c,df_xc,yc) ************************************************** Results for model : Logistic Regression max roc score correspond to random state 0.727315712597147 Mean accuracy score is : 0.6696918411779096 Std deviation score is : 0.0030322593046897828 Cross validation scores are : [0.67361469 0.66566588 0.66703839 0.67239974 0.66974051] roc_auc_score: 0.727315712597147 ************************************************** ************************************************** Results for model : Random Forest max roc score correspond to random state 0.8792762631904103 Mean accuracy score is : 0.8117279862602139 Std deviation score is : 0.002031698139189051 Cross validation scores are : [0.81043061 0.81162342 0.81158053 0.81115162 0.81616985] roc_auc_score: 0.8792762631904103 ************************************************** ************************************************** Results for model : Decision Tree Classifier max roc score correspond to random state 0.7397495282209642 Mean accuracy score is : 0.7426399792028343 Std deviation score is : 0.0025271129138200485 Cross validation scores are : [0.74288043 0.74162556 0.74149689 0.73870899 0.74462792] roc_auc_score: 0.7397495282209642 ************************************************** ************************************************** Results for model : GausianNB max roc score correspond to random state 0.7956111563031266 Mean accuracy score is : 0.7158677336619202 Std deviation score is : 0.0015884106712636206 Cross validation scores are : [0.71894836 0.71550504 0.71546215 0.71443277 0.71499035] roc_auc_score: 0.7956111563031266 **************************************************First Attempt:Random Forest Classifier # Estimating best n_estimator using grid search for Randomforest Classifier parameters{n_estimators:[1,10,100]} rf_clfRandomForestClassifier() clf GridSearchCV(rf_clf, parameters, cv5,scoringroc_auc) clf.fit(df_xc,yc) print(Best parameter : ,clf.best_params_,\nBest Estimator : , clf.best_estimator_,\nBest Score : , clf.best_score_)Best parameter : {n_estimators: 100} Best Estimator : RandomForestClassifier() Best Score : 0.8810508979668068#Again running RFC with n_estimator 100 rf_clfRandomForestClassifier(n_estimators100,random_state42) max_accuracy_scr(RandomForest Classifier,rf_clf,df_xc,yc)************************************************** Results for model : RandomForest Classifier max roc score correspond to random state 0.879415808805665 Mean accuracy score is : 0.8115392510996895 Std deviation score is : 0.0008997445291505284 Cross validation scores are : [0.81180305 0.81136607 0.81106584 0.81037958 0.81308171] roc_auc_score: 0.879415808805665 **************************************************xc_train,xc_test,yc_train,yc_testtrain_test_split(df_xc, yc,random_state 80,test_size0.20,stratifyyc) rf_clf.fit(xc_train,yc_train) yc_predrf_clf.predict(xc_test)plt.rcParams[figure.figsize] (12,8)# Random Forest Classifier Resultspred_pbrf_clf.predict_proba(xc_test)[:,1] Fpr,Tpr,thresholds roc_curve(yc_test,pred_pb,pos_labelTrue) auc roc_auc_score(yc_test,pred_pb)print( ROC_AUC score is ,auc) print(accuracy score is : ,accuracy_score(yc_test,yc_pred)) print(Precision is : ,precision_score(yc_test, yc_pred)) print(Recall is: ,recall_score(yc_test, yc_pred)) print(F1 Score is : ,f1_score(yc_test, yc_pred)) print(classification report \n,classification_report(yc_test,yc_pred))#Plotting confusion matrix cnf confusion_matrix(yc_test,yc_pred) sns.heatmap(cnf, annotTrue, cmap magma)ROC_AUC score is 0.8804566893762799 accuracy score is : 0.8127466117687425 Precision is : 0.8397949673811743 Recall is: 0.7729456167438669 F1 Score is : 0.8049848132928354 classification report precision recall f1-score support0.0 0.79 0.85 0.82 116581.0 0.84 0.77 0.80 11658accuracy 0.81 23316macro avg 0.81 0.81 0.81 23316 weighted avg 0.81 0.81 0.81 23316AxesSubplot:plt.rcParams[figure.figsize] (12,6)#plotting the graph for area under curve for representing accuracy of data plt.plot([0,1],[1,0],g--) plt.plot(Fpr,Tpr) plt.xlabel(False_Positive_Rate) plt.ylabel(True_Positive_Rate) plt.title(Random Forest Classifier) plt.show()Second Attempt: XG Boost Classifer from sklearn.utils import class_weight class_weight.compute_class_weight(balanced, np.unique(yc_train), yc_train[Is_Lead])weights np.ones(y_train.shape[0], dtype float) for i, val in enumerate(y_train):weights[i] classes_weights[val-1]xgb_classifier.fit(X, y, sample_weightweights)#Trying XGBoost import xgboost as xg from xgboost import XGBClassifier from sklearn.utils import class_weightclf2 xg.XGBClassifier(class_weightbalanced).fit(xc_train, yc_train) class_weight.compute_class_weight(balanced, np.unique(yc_train), yc_train[Is_Lead]) xg_pred clf2.predict(xc_test)[23:35:16] WARNING: /private/var/folders/fc/8d9mxh2s4ssd8k64mkmlsrj00000gn/T/pip-req-build-y40nwdrb/build/temp.macosx-10.9-x86_64-3.8/xgboost/src/learner.cc:576: Parameters: { class_weight } might not be used.This may not be accurate due to some parameters are only used in language bindings butpassed down to XGBoost core. Or some parameters are not used but slip through thisverification. Please open an issue if you find above cases.[23:35:16] WARNING: /private/var/folders/fc/8d9mxh2s4ssd8k64mkmlsrj00000gn/T/pip-req-build-y40nwdrb/build/temp.macosx-10.9-x86_64-3.8/xgboost/src/learner.cc:1100: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective binary:logistic was changed from error to logloss. Explicitly set eval_metric if youd like to restore the old behavior.plt.rcParams[figure.figsize] (12,8)#XG Boost Results xg_pred_2clf2.predict_proba(xc_test)[:,1] Fpr,Tpr,thresholds roc_curve(yc_test,xg_pred_2,pos_labelTrue) auc roc_auc_score(yc_test,xg_pred_2)print( ROC_AUC score is ,auc) print(accuracy score is : ,accuracy_score(yc_test,xg_pred)) print(Precision is : ,precision_score(yc_test, xg_pred)) print(Recall is: ,recall_score(yc_test, xg_pred)) print(F1 Score is : ,f1_score(yc_test, xg_pred)) print(classification report \n,classification_report(yc_test,xg_pred))cnf confusion_matrix(yc_test,xg_pred) sns.heatmap(cnf, annotTrue, cmap magma)ROC_AUC score is 0.8706238059470456 accuracy score is : 0.8033968090581575 Precision is : 0.8246741325500275 Recall is: 0.7706296105678504 F1 Score is : 0.7967364313586378 classification report precision recall f1-score support0.0 0.78 0.84 0.81 116581.0 0.82 0.77 0.80 11658accuracy 0.80 23316macro avg 0.80 0.80 0.80 23316 weighted avg 0.80 0.80 0.80 23316AxesSubplot:plt.rcParams[figure.figsize] (12,6)#plotting the graph for area under curve for representing accuracy of data plt.plot([0,1],[1,0],g--) plt.plot(Fpr,Tpr) plt.xlabel(False_Positive_Rate) plt.ylabel(True_Positive_Rate) plt.title(XG_Boost Classifier) plt.show()Third Attempt: LGBM Model with Stratification Folds #Trying stratification modeling from sklearn.model_selection import KFold, StratifiedKFolddef cross_val(xc, yc, model, params, folds10):skf StratifiedKFold(n_splitsfolds, shuffleTrue, random_state42)for fold, (train_idx, test_idx) in enumerate(skf.split(xc, yc)):print(fFold: {fold})xc_train, yc_train xc.iloc[train_idx], yc.iloc[train_idx]xc_test, yc_test xc.iloc[test_idx], yc.iloc[test_idx]model_c model(**params)model_c.fit(xc_train, yc_train,eval_set[(xc_test, yc_test)],early_stopping_rounds100, verbose300)pred_y model_c.predict_proba(xc_test)[:, 1]roc_score roc_auc_score(yc_test, pred_y)print(froc_auc_score: {roc_score})print(-*50)return model_c#Applying LGBM Model with 10 stratified cross-folds from lightgbm import LGBMClassifierlgb_params {learning_rate: 0.045, n_estimators: 10000,max_bin: 84,num_leaves: 10,max_depth: 20,reg_alpha: 8.457,reg_lambda: 6.853,subsample: 0.749} lgb_model cross_val(xc, yc, LGBMClassifier, lgb_params)Fold: 0 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.433821 [600] valid_0s binary_logloss: 0.433498 Early stopping, best iteration is: [599] valid_0s binary_logloss: 0.433487 roc_auc_score: 0.8748638095718249 -------------------------------------------------- Fold: 1 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.434881 [600] valid_0s binary_logloss: 0.43445 Early stopping, best iteration is: [569] valid_0s binary_logloss: 0.43442 roc_auc_score: 0.8755631159104413 -------------------------------------------------- Fold: 2 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.431872 [600] valid_0s binary_logloss: 0.43125 [900] valid_0s binary_logloss: 0.430984 Early stopping, best iteration is: [1013] valid_0s binary_logloss: 0.430841 roc_auc_score: 0.877077541404848 -------------------------------------------------- Fold: 3 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.442048 [600] valid_0s binary_logloss: 0.44142 [900] valid_0s binary_logloss: 0.441142 Early stopping, best iteration is: [895] valid_0s binary_logloss: 0.44114 roc_auc_score: 0.8721270953106521 -------------------------------------------------- Fold: 4 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.439466 [600] valid_0s binary_logloss: 0.438899 Early stopping, best iteration is: [782] valid_0s binary_logloss: 0.438824 roc_auc_score: 0.8709229804739002 -------------------------------------------------- Fold: 5 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.427545 Early stopping, best iteration is: [445] valid_0s binary_logloss: 0.42739 roc_auc_score: 0.8792290845510382 -------------------------------------------------- Fold: 6 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.440554 [600] valid_0s binary_logloss: 0.439762 [900] valid_0s binary_logloss: 0.439505 [1200] valid_0s binary_logloss: 0.439264 Early stopping, best iteration is: [1247] valid_0s binary_logloss: 0.439142 roc_auc_score: 0.872610593872283 -------------------------------------------------- Fold: 7 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.423764 Early stopping, best iteration is: [414] valid_0s binary_logloss: 0.423534 roc_auc_score: 0.8806521642373888 -------------------------------------------------- Fold: 8 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.440673 Early stopping, best iteration is: [409] valid_0s binary_logloss: 0.440262 roc_auc_score: 0.8708570312002339 -------------------------------------------------- Fold: 9 Training until validation scores dont improve for 100 rounds [300] valid_0s binary_logloss: 0.441536 [600] valid_0s binary_logloss: 0.441034 Early stopping, best iteration is: [661] valid_0s binary_logloss: 0.440952 roc_auc_score: 0.8713195377336685 --------------------------------------------------#LGBM results lgb_pred_2clf2.predict_proba(xc_test)[:,1] Fpr,Tpr,thresholds roc_curve(yc_test,lgb_pred_2,pos_labelTrue) auc roc_auc_score(yc_test,lgb_pred_2)print( ROC_AUC score is ,auc) lgb_model.fit(xc_train,yc_train) lgb_predlgb_model.predict(xc_test) print(accuracy score is : ,accuracy_score(yc_test,lgb_pred)) print(Precision is : ,precision_score(yc_test, lgb_pred)) print(Recall is: ,recall_score(yc_test, lgb_pred)) print(F1 Score is : ,f1_score(yc_test, lgb_pred)) print(classification report \n,classification_report(yc_test,lgb_pred))cnf confusion_matrix(yc_test,lgb_pred) sns.heatmap(cnf, annotTrue, cmap magma)ROC_AUC score is 0.8706238059470456 accuracy score is : 0.8030965860353405 Precision is : 0.8258784469242829 Recall is: 0.7681420483787956 F1 Score is : 0.7959646237944981 classification report precision recall f1-score support0.0 0.78 0.84 0.81 116581.0 0.83 0.77 0.80 11658accuracy 0.80 23316macro avg 0.80 0.80 0.80 23316 weighted avg 0.80 0.80 0.80 23316AxesSubplot:plt.rcParams[figure.figsize] (12,6)#plotting the graph for area under curve for representing accuracy of data plt.plot([0,1],[1,0],g--) plt.plot(Fpr,Tpr) plt.xlabel(False_Positive_Rate) plt.ylabel(True_Positive_Rate) plt.title(LGB Classifier model) plt.show()5. 模型预测模型训练完成后我们使用测试数据进行预测 #we can drop column as they are irrelevant and have no effect on our data df_3 df_testdf_3.drop(columns[source],inplaceTrue) df_3.head()IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead245725VBENBARO1294102527423660.0NaN245726CCMEWNKY14318114909255370.0NaN245727VK3KGA9M13120201412159490.0NaN245728TT8RPZVC12922103318680700.0NaN245729SHQZEYTZ02920101916570870.0NaN # dropping target variable #assign the value of y for training and testing phase xc_pred df_3.drop(columns[Is_Lead,ID])#Standardizing value of x by using standardscaler to make the data normally distributed sc StandardScaler() df_xc_pred pd.DataFrame(sc.fit_transform(xc_pred),columnsxc_pred.columns)lead_pred_xgclf2.predict_proba(df_xc_pred)[:,1] lead_pred_lgblgb_model.predict_proba(df_xc_pred)[:,1] lead_pred_rfrf_clf.predict_proba(df_xc_pred)[:,1] print(lead_pred_xg, lead_pred_lgb, lead_pred_rf)[0.09673516 0.9428428 0.12728807 ... 0.31698707 0.1821623 0.17593904] [0.14278614 0.94357392 0.13603912 ... 0.22251432 0.24186564 0.16873483] [0.17 0.97 0.09 ... 0.5 0.09 0.15]#Dataframe for lead prediction lead_pred_lgb pd.DataFrame(lead_pred_lgb,columns[Is_Lead]) lead_pred_xg pd.DataFrame(lead_pred_xg,columns[Is_Lead]) lead_pred_rf pd.DataFrame(lead_pred_rf,columns[Is_Lead])df_test df_test.reset_index() df_test.head()indexIDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead0245725VBENBARO1294102527423660.0NaN1245726CCMEWNKY14318114909255370.0NaN2245727VK3KGA9M13120201412159490.0NaN3245728TT8RPZVC12922103318680700.0NaN4245729SHQZEYTZ02920101916570870.0NaN #Saving ID and prediction to csv file for XG Model df_pred_xgpd.concat([df_test[ID],lead_pred_xg],axis1,ignore_indexTrue) df_pred_xg.columns [ID,Is_Lead] print(df_pred_xg.head()) df_pred_xg.to_csv(Credit_Card_Lead_Predictions_final_xg.csv,indexFalse)#Saving ID and prediction to csv file for LGB Model df_pred_lgbpd.concat([df_test[ID],lead_pred_lgb],axis1,ignore_indexTrue) df_pred_lgb.columns [ID,Is_Lead] print(df_pred_lgb.head()) df_pred_lgb.to_csv(Credit_Card_Lead_Predictions_final_lgb.csv,indexFalse)#Saving ID and prediction to csv file for RF model df_pred_rfpd.concat([df_test[ID],lead_pred_rf],axis1,ignore_indexTrue) df_pred_rf.columns [ID,Is_Lead] print(df_pred_rf.head()) df_pred_rf.to_csv(Credit_Card_Lead_Predictions_final_rf.csv,indexFalse)ID Is_Lead 0 VBENBARO 0.096735 1 CCMEWNKY 0.942843 2 VK3KGA9M 0.127288 3 TT8RPZVC 0.052260 4 SHQZEYTZ 0.057762ID Is_Lead 0 VBENBARO 0.142786 1 CCMEWNKY 0.943574 2 VK3KGA9M 0.136039 3 TT8RPZVC 0.084144 4 SHQZEYTZ 0.055887ID Is_Lead 0 VBENBARO 0.17 1 CCMEWNKY 0.97 2 VK3KGA9M 0.09 3 TT8RPZVC 0.12 4 SHQZEYTZ 0.096. 模型保存为了在未来能够方便地加载和使用训练好的模型我们将模型保存为pickle文件 import joblib # 将模型保存为文件中的pickle joblib.dump(lgb_model,lgb_model.pkl)[lgb_model.pkl]如有遇到问题可以找小编沟通交流哦。另外小编帮忙辅导大课作业学生毕设等。不限于MapReduce MySQL, pythonjava大数据模型训练等。 hadoop hdfs yarn spark Django flask flink kafka flume datax sqoop seatunnel echart可视化机器学习等

查看全文

http://www.w-s-a.com/news/596974/