Posts: 6
Threads: 1
Joined: Aug 2018
Aug-29-2018, 04:25 PM
(This post was last modified: Aug-29-2018, 04:25 PM by sandy49992.)
Hello, my friends. This is my first post in python. My English is not very well,sorry...
I'm try to practice more. I do my best to express the question.
This is a loan data code to predict default or not.
I use Xgboost to predict.
I need to write for loops to random select rows or columns to test its accuracy.
I'm not sure how to write, and it could write or not.
Please give me some advices, thank you very much~~
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
df = pd.read_csv('LendingClubLoanData.csv',encoding='big5')
df = df.fillna(df.mean())
part(1)
#df=df.sample(n=50000) # choose row, total 60000 rows, each 1000 need to test.EX:1000,2000,...,60000
#for i in range(1000,60000,1000): # I try to write for loop in here
# print(df.sample(i))
X=df.drop('loan_status',axis=1)
part(2)
#X=X.sample(20, axis=1) #choose columns, total 38 columns, each 10 need to test.EX:10,20,30,38
Y=df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=7)
model = XGBClassifier()
model.fit(X_train, y_train)
print(model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred.round())
print("Accuracy: %.2f%%" % (accuracy * 100.0)) Outcome:part(1)
sizes Accurancy
1000 0.67
2000 0.68
3000 0.71
.... ....
Outcome:part(2)
feature Accurancy
10 0.66
20 0.74
30 0.73
38 0.71
Accuracy is not actual, it is only assumption.
If code have any wrong, please tell me~
Posts: 817
Threads: 1
Joined: Mar 2018
Aug-30-2018, 01:09 AM
(This post was last modified: Aug-30-2018, 01:09 AM by scidam.)
model = XGBClassifier()
np.random.seed(32) # this makes results reproducible; comment this line if needed
accuracies = []
for counter in range(100):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3) # if random_state parameter is removed, train_test_split will use random seed value from the numpy package
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred.round())
accuracies.append(accuracy)
print('Total number of splittings: ', len(accuracies), 'Mean accuracy score: ', np.mean(accuracies), 'std.dev.: ', np.std(accuracies))
Posts: 6
Threads: 1
Joined: Aug 2018
Hello, Sciadm
I've combine your code,thanks you so much.
Posts: 6
Threads: 1
Joined: Aug 2018
Hello,Sciadm.
Your code is really perfect,and solve my problem, thank you so much.
And I change a little to select sample data.
However I still have some question?
I'm not sure how to write it run 20 times and get mean and than fill in the accuracy?
I guess need to write next for loops insides, and I have little idea to write.
df = pd.read_csv('LoanStats_2017+2016-39VARIABLES.csv',encoding='big5')
df = df.fillna(df.mean())
accuracies = []
acc = []
i=[]
for counter in range(1000,220000,5000):
j=20# I guess write run 20 times place
i.append(counter)
c=df.sample(counter)
X=c.drop('loan_status',axis=1)
#X=X.sample(counter, axis=1)
Y=c['loan_status']
model = XGBClassifier()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred.round())[j] #run j time
acc.append(accuracy)
accuracy = np.mean(acc) # I need to write
accuracies.append(accuracy)
print('Total number of splsplittings: ', len(accuracies), 'Mean accuracy score: ', np.mean(accuracies), 'std.dev.: ', np.std(accuracies)) I'm not sure how to write code, I think it have some mistakes.
Please give me some advice~~thanks you very much.
Posts: 6
Threads: 1
Joined: Aug 2018
And I want to append run 20 times values to create each new row in the dataframes.
I try think how to do, but it still meet some difficulties.
I hope that I could explaine well.
thanks for the hard work!
Expect outcome:
size Accuracy1 Accuracy2 Accuracy3
1000 0.67 0.68 0.70
2000 0.68 0.73 0.72
3000 0.71 0.72 0.78
Posts: 817
Threads: 1
Joined: Mar 2018
I think... I understand what do you trying to get...
You trying to get subsamples of predefined size from the main dataset, train
the classifier for each subsample and estimate its accuracy for each case.
Hope this code with comments will help you to get what you want...
import pandas as pd
from collections import defaultdict
# note pandas imports numpy as np, so it is accessible as pd.np.*
pd.np.random.seed(123) # set some value for results reproducibility... or remove this line to be `completely` random...
df = pd.read_csv('LoanStats_2017+2016-39VARIABLES.csv', encoding='big5')
df = df.fillna(df.mean())
accuracies = defaultdict(list) # this is our container where we will store results (accuracies)
model = XGBClassifier() # we don't need to redefine model each time in the loop, just refit it
N_replications = 20
for counter in range(1000, 220000, 5000):
c = df.sample(counter) # behaviour of this function is controlled by numpy RandomState, we set it value above
X = c.drop('loan_status', axis=1)
Y = c['loan_status']
for i in range(N_replications):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracies[counter].append(accuracy_score(y_test, y_pred.round()))
#Here we have dictionary of the following structure:
# accuracies = {1000: [an array ( of len = N_replications) of accuracy values ],
# 6000: [an array ( of len = N_replications) of accuracy values ],
# 11000: [an array ( of len = N_replications) of accuracy values ],
# etc... keys are equal to counter values
# }
#Now we are ready to convert the accuracies container to dataframe...
RESULT = pd.DataFrame([{'size': k, 'mean': pd.np.mean(v), 'std': pd.np.std(v)} for k, v in accuracies.items()])
Posts: 6
Threads: 1
Joined: Aug 2018
Posts: 6
Threads: 1
Joined: Aug 2018
Sep-21-2018, 01:47 AM
(This post was last modified: Sep-21-2018, 01:47 AM by sandy49992.)
pd.np.random.seed(123)
df = pd.read_csv('LoanStats_2017+2016-39orgin.csv', encoding='big5')
df = df.fillna(df.mean())
accuracies = defaultdict(list)
f1err=defaultdict(list)
f2err=defaultdict(list)
dummy_code=pd.get_dummies(df[['home_ownership','purpose','verification_status','grade']])
df=df.drop(['home_ownership','purpose','mths_since_last_delinq','verification_status','grade'],axis=1)
df=df.join(dummy_code)
N_replications = 10
for counter in range(73000, 240000, 12000):
c = df.sample(counter) # behaviour of this function is controlled by numpy RandomState, we set it value above
X = c.drop('loan_status', axis=1)
#dtrain = xgb.DMatrix(X, label=[0,1])
#X = X.sample(counter, axis=1)
Y = c['loan_status']
for i in range(N_replications):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
model=XGBClassifier(max_depth = 5,objective='binary:logistic')
model.fit(X_train.round(), y_train.round())
y_train=y_train.tolist()
y_pred = model.predict(X_test)
y_pre = [int(item>0) for item in y_pred]
predictions = [round(value) for value in y_pre]
accuracies[counter].append((accuracy_score(y_test , predictions)))
f1err[counter].append(1-precision_score(y_test, predictions))
f2err[counter].append(1-recall_score(y_test, predictions)) hello, Scidam.
thank you for your help.
Now I find error in my code, but I'm not sure why it happen?
--->error:accuracies[counter].append((accuracy_score(y_test , predictions)))
Classification metrics can't handle a mix of continuous and binary targets
Could you help me? Where it's wrong?
|