Ticker

6/recent/ticker-posts

Data Preprocessing in Machine Learning

A MacBook with lines of code on its screen on a busy desk

This part 2 of Machine Learning series.In previous blog we discussed about What is Machine Learning

Hoping you people might have a good understanding in python programming language lets proceed.Notice that its necessary to have a basic understanding of NumPy ,Pandas and Matplotlib libraries and also sci-kit learn.Let us proceed.
           
One of the major steps in Machine learning is DATA PREPROCESSING. Data should be handled carefully in order to get the best machine learning model. It comprises various steps. Cleaning your data is what you call pre-processing. Poor data will affect your model. Now, what you need to know is that what do you mean by poor data ?. Let us learn more..
Poor data is one that comprises the missing values or has categorical variables, false values, or data that isn't properly scaled. One of the important steps of preprocessing is Feature selection or what is called Feature extraction that can directly impact the model. Dimensionality reduction is an important part of data pre-processing where the number of attributes goes on increasing. This is a problem due to the curse of dimensionality where the model gets tougher and tougher to train as the dimensions of the attributes increases. If you have text data you have many methods to extract information out of text data which are count vectorizer, TFIDF, word vectors, a bag of words, and many others dealing with it is an important part of data analytics. The most used libraries for data preprocessing is scikit-learn, pandas, and NumPy

Analyze the given data set,



  We should have to perform some modifications in the given data, otherwise, the machine learning model we created will be not enough efficient to predict the input test data.

  So what are the modifications that we have to perform, let's analyze
   
  * Slicing the dataset into independent and dependent variables 
  * Taking care of missing data (nan values)
  * Encoding the categorical data
  * Feature scaling
  * Splitting the data set into training and testing set


Let's go through the above-mentioned methods,

Slicing the data set into Independent and Dependent variables:
 
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

Independent variable is the input data and dependent variable is the output data.The python library used for slicing is Pandas

Taking care of missing values:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
print(X)

Output will be:

In the above example, missing values are replaced by the mean of elements of the column
Other types of the strategy used are  mean, median, most frequent, constant

Encoding the categorical data:
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
print(X)
# Encoding the Dependent Variable
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

Both LabelEncoder and OneHotEncoder are used to convert the non-numeric data into numeric data

Output of independent variable(X):



Output of Dependent Variable(y):



Feature Scaling:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])

A large range of values can reduce the accuracy of the model, so it is reduced to small ranges 

Output:


Splitting the data set into training and testing set:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

In train-test split what we do is that instead of testing on the same data used to train the model, we split the given data into 2, training and test set, thus preventing overfitting. This is because it may show good accuracy results if we use the data that it has learned to test it and results for the new test data may show high variance, which shows that the model hasn't learned well.                                                               
Now that we have learned the necessity of data preprocessing, we will be back next time with the topic continued.                                                                                                                                                   

Post a Comment

7 Comments