Data Science #2: Data Preprocessing using Scikit Learn

Dhruv Dalsania
4 min readSep 7, 2021

What is data preprocessing?

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Dataset

Data Set is all about Credit Card Approval Prediction. Credit score cards are a common risk control method in the financial industry. It uses personal information and data submitted by credit card applicants to predict the probability of future defaults and credit card borrowings. The bank is able to decide whether to issue a credit card to the applicant. Credit scores can objectively quantify the magnitude of risk. There is 19 Coloum is there.

You can download the dataset from here.

Encoding :

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms. There are two types of encoders we will discuss here.

  1. LabelEncoder: when we convert a dataset that to convert those categories into numerical features we can use a Label encoder.

Here Female and Male Labels will be Convert into 0 and 1.

process of LabelEncoder

2. OneHotEncoder: One hot encoder does the same things but in a different way. Label Encoder initializes the particular number but one hot encoder will assign a whole new column to particular categories.

OneHotEncoder

Normalization: Data normalization is a basic element of data mining. It means transforming the data, namely converting the source data in to another format that allows processing data effectively. The main purpose of data normalization is to minimize or even exclude duplicated data.

Before Normalization

After Normalization

Standardization: Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

scaling data using standardization

Imputing the missing value: In statistics, imputation is the process of replacing missing data with substituted values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results.

Remove the Year Employed Column

Discretization : Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns. There are several methods that you can use to discretize data.

Discretization using Uniform

Discretization using Kmeans

--

--