Ticker

6/recent/ticker-posts

Introduction to Regression techniques in machine learning - Part 1

black and silver laptop computer on table


This is the 3rd part of Machine Learning Series.
    Lets start with a most often used algorithm type for simple output predictions which is Regression,a supervised learning algorithm.

     We basically train machines so as to  include some kind of automation in it.In machine learning we use various kinds of algorithms to allow machine learn the relation within the data provided and make predictions using them.So,the kind of model prediction where we need the predicted output is a continuous numerical value,it is called a regression problem.Regression analysis convolve around with simple algorithms,which are often used in finance,investing and others and establish the relation between a single dependent variable dependent of several independent ones.For example predicting house price or salary of an employee etc are the most common regression problems.

We will first discuss the types of regression algorithms in short and then move to an example.These algorithms may be linear as well as non-linear.

Linear ML algorithms

Linear Regression

   It is a commonly used algorithm and can be imported from Linear Regression class. A single input variable(the significant one) is used to predict one or more output variable, assuming that the input variable aren't correlated with each other.It is represented as :
                                                y=b*x + c            
where y-dependent variable,x-independent,b-slope of the best fit line that could get accurate output and c -its intercept.Unless these is an exact line that relates the depended and independent variables there might be a loss in output which is usually taken as the square of difference between the predicted and actual output, ie the loss function.
When you use more than one independent variable to get output, it is termed Multiple linear regression


Ridge Regression-The L2 Norm

  This is a kind of algorithm which is an extension of linear regression which tries to minimize the loss,also uses multiple regression data.Its coefficients are not estimated by ordinary least squares (OLS) , but by an estimator called ridge , that is biased and has lower variance than the OLS estimator.

LASSO Regression -The L1 Norm

    It is The Least Absolute Shrinkage and Selection Operator. This penalize the sum of absolute values of the coefficients to minimize the prediction error. It causes the regression coefficients for some of the variables to shrink to Zero .It can be constructed using LASSO class


Non-Linear ML algorithms

Decision Tree Regression 

  It breaks down a data set into smaller and smaller subsets by 
splitting resulting in a tree with decision nodes and leaf nodes. Here the idea is to plot a value for any new data point connecting the problem. The kind of way in which the split are conducted is determined by the parameters and algorithm, and the split is stopped when the minimal number information to be added reaches

Random Forest

   The idea behind random forest regression is that in order find the output it uses multiple  Decision Trees.The steps involved in it is:
    - Pick K random data points from training set.
    - Build decision tree associated to these data points
    - Choose number of trees we need to build and repeat the above steps(provided as                   argument)
    - For a new data point ,make each of the tree predict value of the dependent variable for          the input given.
    - Assign the average value of the predicted values to the actual final output.

  This is similar to guessing the number of balls in a box. Let us assume we randomly note the prediction values given by  many people, and then calculate the average to make a decision on the number of balls in the box.

K Nearest Neighbors(KNN model)

  It can be used from the KNearestNeighbors class.These are simple and easy to implement. For a  input introduced in data set,the K Nearest neighbors helps to find out the k most similar instances in the training set.Either average value of median of the neighbors is taken as value for that  input.The method to find the value can be given as argument ,of which default value is "Minkowski"-combination of "euclideian" and "manhattan" distances,

Support Vector Machines(SVM)

  It can solve both linear and non-linear regression problems.We create a SVM model using the SVR class.Ina multi-dimensional space ,when we have more than one variable to determine the output,then each of the point is m=no longer a point as in 2D ,but are vectors.The most extreme kind of assigning values can be done using this method.You separate classes and give them values.The separation is by the concept of Max-Margin(a hyperplane).



  If training data is much larger than numner of featurs , KNN is better than SVM. SVM outperforms KNN when there are larger features and lesser training data 

   We have come to an end of the article ,we have discussed the kinds of regression algorithms(theory) in brief,in the part 2 of the regression we will be going through some implementation of the algorithms and we will show you the output.Stay awaited!!








Post a Comment

5 Comments