This part 2 of Machine Learning series.In previous blog we discussed about What is Machine Learning
Hoping you people might have a good understanding in python programming language lets proceed.Notice that its necessary to have a basic understanding of NumPy ,Pandas and Matplotlib libraries and also sci-kit learn.Let us proceed.
One of the major steps in Machine learning is DATA PREPROCESSING. Data should be handled carefully in order to get the best machine learning model. It comprises various steps. Cleaning your data is what you call pre-processing. Poor data will affect your model. Now, what you need to know is that what do you mean by poor data ?. Let us learn more..
Poor data is one that comprises the missing values or has categorical variables, false values, or data that isn't properly scaled. One of the important steps of preprocessing is Feature selection or what is called Feature extraction that can directly impact the model. Dimensionality reduction is an important part of data pre-processing where the number of attributes goes on increasing. This is a problem due to the curse of dimensionality where the model gets tougher and tougher to train as the dimensions of the attributes increases. If you have text data you have many methods to extract information out of text data which are count vectorizer, TFIDF, word vectors, a bag of words, and many others dealing with it is an important part of data analytics. The most used libraries for data preprocessing is scikit-learn, pandas, and NumPy
Analyze the given data set,
We should have to perform some modifications in the given data, otherwise, the machine learning model we created will be not enough efficient to predict the input test data.
So what are the modifications that we have to perform, let's analyze
* Slicing the dataset into independent and dependent variables
* Taking care of missing data (nan values)
* Encoding the categorical data
* Feature scaling
* Splitting the data set into training and testing set
Let's go through the above-mentioned methods,
Slicing the data set into Independent and Dependent variables:
X = dataset.iloc[:,:-1].values y = dataset.iloc[:,-1].values
Independent variable is the input data and dependent variable is the output data.The python library used for slicing is Pandas
Taking care of missing values:
from sklearn.impute import SimpleImputer imputer = SimpleImputer(missing_values=np.nan, strategy='mean') imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) print(X)
Output will be:
In the above example, missing values are replaced by the mean of elements of the column
Other types of the strategy used are mean, median, most frequent, constant
Encoding the categorical data:
# Encoding categorical data # Encoding the Independent Variable from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') X = np.array(ct.fit_transform(X)) print(X) # Encoding the Dependent Variable from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y = le.fit_transform(y) print(y)
Output of independent variable(X):
Feature Scaling:
# Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
A large range of values can reduce the accuracy of the model, so it is reduced to small ranges
Output:
Splitting the data set into training and testing set:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
In train-test split what we do is that instead of testing on the same data used to train the model, we split the given data into 2, training and test set, thus preventing overfitting. This is because it may show good accuracy results if we use the data that it has learned to test it and results for the new test data may show high variance, which shows that the model hasn't learned well.
Now that we have learned the necessity of data preprocessing, we will be back next time with the topic continued.
7 Comments
Good one
ReplyDelete👌👍
ReplyDelete👌
ReplyDeleteGood one.. 👌
ReplyDeleteSuper👍👍👍👍
ReplyDeleteWow supy
ReplyDeleteInformative..
ReplyDelete