Ticker

6/recent/ticker-posts

Multiple Linear Regression-Practice problem

MacBook Pro on brown wooden table inside room

In the previous blog we are discussed briefly about the Simple linear regression .Now its time for us to have a look at a Multi-linear problem statement.

Multi-Linear Regression

   Taking count of multiple features and predicting the output is the way of carrying out multi-linear problems.Selection of these independent features is also an important step.This comes into play when the output variable cannot be just predicted on taking a single feature since there are other factors too that affect the output.
                                
                                y= b0 + b1X1 + b2X2 +.........+bnXn

   Not all the features in the data sets affects the output , so we have to find the significant features that affect the output. There are certain ways to be carried out to find the most significant features,among which one is backward elimination- the step-wise selection of features by removing the statistically least significant features one by one,considering the p-value ,which is the probability that the null hypothesis - the phenomenon where there exist no correlation between variables  is true.

Different steps in Backward Elimination:-
  • Step 1 - Select the significant level (we are selecting this as 0.05 )
  • Step 2 - Fit model with all possible predictors 
  • Step 3 - Consider the predictor with high p-value                                                                          if P-value > Significant level go to step 4 else finish the process
  • Step 4 - Eliminate the predictor        
  • Step 5 - Fit the model without predictor (continue process until step 3 satisfied)

       Have a look at the dataset (download dataset) showing the data relating to some number of startups ,applying simple logic we can understand that this output cannot be just dependent on any one of the feature.Hence we select the significant number of features from this and now move to the necessary data pre-processing steps.In this case we know that ultimately the profit is what we need as output counting factors like amount spent on R&D,Administration,Marketing and also the country where the startup is established.



 Remember to initially import the required libraries.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
   Next we have to import the dataset and slice it into independent variables as x and dependent variable as y
data=pd.read_csv('50_Startups.csv')
x=data.iloc[:,:-1].values
y=data.iloc[:,4]
  Data pre-processing to be done since the feature set contains categorical variable.So in order to convert them to numerical values we apply one hot encoding as well as label encoding which is a categorical encoding technique,imported from OneHotEncoder(column transformer method) and LabelEncoder class.
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
lab_x=LabelEncoder()
x[:,3]=lab_x.fit_transform(x[:,3])
ct=ColumnTransformer([('State',OneHotEncoder(),[3])],remainder='passthrough')
x=np.array(ct.fit_transform(x))
   Later to avoid the dummy variable trap- the scenario in which one or more independent variables are highly correlated ,we omit a encoded column.
x=x[:,1:]
    As we discussed earlier we have to find which of the features in the dataset are most significant.for this purpose we are using statsmodels module.Before that we should analyse the equation of y, it contains a term 'b0' ,the x value in that position is 1 ,so we have to append a column of 1's to the x .
x=np.append(arr=np.ones((50,1)).astype(int),values=x,axis=1)
Then we can proceed to the selection of significant features,ie Backward elimination.Follow the steps given at the beginning of article 
import statsmodels.regression.linear_model as sm
x_opt=x[:,[0,1,2,3,4,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()
By analyzing the summary we can choose which of the predictors that are to be removed 


We can omit x2 ,since it has large p value(0.990) and continue until all the p values are less than 0.05 .
x_opt=x[:,[0,1,3,4,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,1,3,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,3,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,3]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()
Finally our independent variable only contains the constant and R&D spend.So lets split our data set into test and training set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state = 0)
Now lets create the model.
from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(x_train,y_train)
Finally we predict the output using the predict function.
y_predict=reg.predict(x_test)
Now let's compare the actual and predicted value.

                                 

                                       y_predict                                         y_test


Full Code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn

data=pd.read_csv('50_Startups.csv')
x=data.iloc[:,:-1].values
y=data.iloc[:,4]

#Encoding categorical data
#Encoding Independent variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.compose import ColumnTransformer
lab_x=LabelEncoder()
x[:,3]=lab_x.fit_transform(x[:,3])
ct=ColumnTransformer([('State',OneHotEncoder(),[3])],remainder='passthrough')
x=np.array(ct.fit_transform(x))

#Avoiding dummy variable trap
x=x[:,1:]

#Building the optimal model using Backward elimination
import statsmodels.regression.linear_model as sm
#to append the x feature frame to ones array at the  begining
x=np.append(arr=np.ones((50,1)).astype(int),values=x,axis=1)
x_opt=x[:,[0,1,2,3,4,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()


x_opt=x[:,[0,1,3,4,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,1,3,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,3,5]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

x_opt=x[:,[0,3]]
x_opt=np.array(x_opt,dtype=float)
reg_OLS=sm.OLS(endog=y,exog=x_opt).fit()
reg_OLS.summary()

#train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2,random_state = 0)

#Fitting mutliple regression to the training set
from sklearn.linear_model import LinearRegression
reg=LinearRegression()
reg.fit(x_train,y_train)

#prediction
y_predict=reg.predict(x_test)

         Ah, that was quite a bit to understand.The science of ML algorithms is indeed difficult.But it gradually becomes less complex when you have the sufficient mathematical and analytical skills.In the later posts will again go through the regression algorithms but this time non-linear ones.
Keep reading and do practice.

Post a Comment

2 Comments