Ticker

6/recent/ticker-posts

Decision Tree Regression - A Practice Problem

    Coding, Programming, Working, Macbook, Laptop

    In today's section we are going to learn one of the important algorithm used in both classification and regression analysis, Decision Tree Regression . What is decision tree regression, let's learn about it.

    Decision tree regression is a form of regression in which dataset is splitting into smaller and smaller subsets thus forming a tree like structure. Before falling into the concept of decision tree we should learn some key terms used here, they are,

  • Splitting   - The process of dividing the dataset into smaller sub-units
  • Parent and Child Node - The node which get divided into several sub-node is parent    node and the sub-node formed is called child node. If this parent node is the starting point of the  entire splitting it is called Root Node
  • Subtree /Branch - If a subnode again split into further subnodes that entire part is      called subtree (one Parent - Child part).It is a part of entire tree.
  • Decision Node - If a subnode split into further subnodes Then that splitted subnode is called decision node.
  • Terminal / Leaf Node - The bottom level node which do not further split is called terminal / leaf node.
  • Pruning - It is the opposite of splitting , that is process of removing subnodes.


HOW DOES THE TREE SPLIT  ?  - Concept                                                                                                                         
    There is a certain criteria for splitting in the trees.How to initialize split is determined by the algorithm and is stopped when there is minimum number of information to take into account(You can't split any more).                                                                                                                                                      

    Take many data points as scattered in plot.Initially take take into count a condition(first split) for splitting the any two data points, and you get two categories.The two categories are one which has only one data point on the left and other with many data points to the right.Any new point that falls to the left is assigned the value of that data point that lies there(one). Now you have to consider the average value of the responses of data points in right category as the predicted output for a certain new  data point introduced there. If the predicted value comes out to be same as the actual value then you leave that aside(that will be your output value-Regression) .  Else if it is not the same then you have to check for the Square Residual- The sum of  squares of difference between the actual and predicted value of all the categories(left +right).Now inside the right category again, as we did in the beginning consider two data points repeat the steps above.                                                                                     

     On each split between two data points, in the end you get many  square residuals for each splits. Out of which you get a minimum value of a square residuals.That categorical average value with minimum square residuals  is taken as the ROOT of the TREE.                                                                               
    
    We have now 2 points to our left take its average,compare actual and this average ,if they are not same take square residual.If same leave aside, add data point ,and that average value is your regressor output value.The process of splitting and assigning average values to each categories goes on and on reaching a optimum case again splitting occurs.                                                                                                      

**Note that this may not be left and right for every categories ,  it is just for your understanding**


 EXAMPLE Let's go through an example problem to give you overview of the splitting and predict the values, here we are using the same dataset (download dataset) that we are used in the polynomial regression. As we discussed in the earlier article it is a dataset  of salaries of different positions in a company and it is beginner friendly. 

So let's go through the example.

Have a look at the dataset.


Now let's go to our code part. First of all import all the required libraries.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
Next we have to import the dataset and slice it into independent(X) and dependent variables(y)
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
Now it is the time to create our Decision Tree regression model. Note that we are not using train test splitting since our data is small
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
 
 We created our model. So we should have to analyse it graphically, so let's create the prediction graph using matplotlib python package.
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.style.use('dark_background')
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Decision Tree Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
 
The graph will be



The graph is well plotted, this implies our model is good for predicting the new salaries. So let's predict with some input value
 regressor.predict([[7.5]])
 
Output is 20000

Full code:
 
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

# Training the Decision Tree Regression model on the whole dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)

# Visualising the Decision Tree Regression results (higher resolution)
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Decision Tree Regression')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

# Predicting a new result
regressor.predict([[7.5]])
 
This simply considering as a classification decision tree you can evaluate as for example taking the 700000 as the root then , You can just consider two categories , ie, above this root and below this root.Any person with salary  above the 700000 is given position of  CEO. node for below this value you still have a number of choices say 40000.Above this you have C-level employee and below you have another number of cases. finally you reach the lowest position and the tree terminates by stopping the splitting. Here what we do is check if the salary and position matches , to check if the job applicant is speaking truth about his previous position and salary.Hence giving his position as input we try to predict the person's salary.Our splits are based on the positions.

That was the concept behind Decision Tree implementation.Most of the times applying the decision tree algorithms gives you better accuracy.Hence we find this as an effective algorithm.With more concepts and practice problems we will be back again.Kindly workout the practice problems.

KEEP READING !!!

Post a Comment

1 Comments