Ticker

6/recent/ticker-posts

Evaluating the performance of algorithms

    macbook pro beside white ceramic mug on brown wooden table

 We know that machine learning algorithms are used to predict unknown outputs with better performance that human,when provided with data to learn.But it is also important know if the the predicted output is correct or not or how far our algorithm has been effective in making predictions.Simply we need to how how well it performs on unseen data.Here comes into play some techniques to evaluate the model performance, so that we can make necessary changes to  improve it.
    We don't generally train ML algorithms on the data and use its predictions to evaluate the model because of over-fitting. And there is no guarantee in performance.Once we evaluate we can retrain the model and make changes.So in this article we will first see the approaches to evaluate the model and later, in the next article we'll be checking out the strategy to improve them.

Train -Test splitting

    To begin with, let us know about what Training and test sets are and why do we use them.As said above ,when you train a data with the algorithms and use its prediction to evaluate there are chances of over-fitting . The model thus doesn't seem to work well on unseen data.This is one of the simplest method of resampling.Actually all these methods used are known as resampling techniques to make better predictions.For a convenience what we do is , we split the data given into two, training and test sets in a particular ration using the train_test_split method.The ratio is selected using the test_size attribute.On specifying the random state function it makes possible to obtain the same kind of split on any system provided you use the same data.Take  look at the illustration below.Here we are using the iris dataset imported from the sklearn datasets package

#importing the libraries

import pandas as pd
import numpy as np

#importing the dataset using scikitlearn

from sklearn.datasets import load_iris
dataset=load_iris()

#seperating into independent and dependent variables

X = dataset.data
y = dataset.target

#train-test splitting

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

Initially import the libraries and datasets.Then split this as the X and y

This algorithm works with good speed and produce less bias for large data.


The K-Fold Cross Validation

The limited data is split into K groups.This kind of resampling method usually provides less variation.
The data is randomly split into K-folds(splits),and at a time a fold is considered for training while the remaining K-1 folds are considered the testing sets.The error in predictions recorded.This is repeated until each of the fold gets chance of being trained on.Finally we end up with k different error or score values and we summarize using mean or standard deviation.And then we get a model that has better performance since it has been trained and tested multiple times on different folds.Usually K=5 and K=10 are used to carry out cross validation.Check the code below. Here we are using Boston house prize dataset from the sklearn dataset package

We import KFold from the sklearn model_selection library to perform this action.
#importing libraries
import pandas as pd
import numpy as np

#loading dataset
from sklearn.datasets import load_boston
dataset=load_boston()

#splitting into independent and dependent variable
X = dataset.data
y = dataset.target

#feature scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
X = sc.fit_transform(X)

#importing decisiontree regressor for creating the model
from sklearn.tree import DecisionTreeRegressor
dtr = DecisionTreeRegressor()
scores = []

#applying cross validation
from sklearn.model_selection import KFold
cv = KFold(n_splits=10, random_state=0, shuffle=False)
for train_index, test_index in cv.split(X):
    print("Train Index: ", train_index, "\n")
    print("Test Index: ", test_index)
   X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
    dtr.fit(X_train, y_train)
    scores.append(dtr.score(X_test, y_test))
    
 
 K-fold CV is best to use on unseen data.

    That was about the two different resampling techniques for evaluating model performance.In a coming article we will discuss about the methods of improving model performance.You can see our ML series here .

Keep reading!!!


Post a Comment

0 Comments