Home Scaling and Preprocessing Data with Scikit-Learn for Machine Learning
Post
Cancel

Scaling and Preprocessing Data with Scikit-Learn for Machine Learning

Data scaling is a data preprocessing step for numerical features. Many machine learning algorithms such as e Gradient descent methods, KNN algorithm, linear and logistic regression, Principle Component Analysis (PCA), etc require data scaling for good results.

Explore more at https://scikit-learn.org

  1. Standard Scaling (StandardScaler)
  2. Min-Max Scaling (MinMaxScaler)
  3. Max-Abs Scaling (MaxAbsScaler)
  4. Robust Scaling (RobustScaler)
  5. Power Transformation (Yeo-Johnson) PowerTransformer(method="yeo-johnson")
  6. Power Transformation (Box-Cox) PowerTransformer(method="box-cox")
  7. Quantile Transformation (Uniform pdf) QuantileTransformer(output_distribution="uniform")
  8. Quantile Transformation (Gaussian pdf) QuantileTransformer(output_distribution="normal")
  9. Sample-wise L2 Normalizing (Normalizer)
  10. Binarize (Binarizer)
  11. Spline Transformation (SplineTransforme)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import os
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.gridspec import GridSpec
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

from sklearn.datasets import fetch_california_housing
os.chdir("/Users/lrobinson/Python_Projects")
from normalize_plot_0624 import make_plot
from normalize_plot_0624 import getdata

Loading California Housing

For this study, let's take California housing as an example. The goal is to compare data with and without scaling and observe the effects of the presence of outliers.

1
2
3
4
5
california_housing = fetch_california_housing(as_frame=True)
print("\n\n========================DATA TYPE==============================\n\n")
print(type(california_housing))
print("\n\n========================DESCRIPTION==============================\n\n")
print(dir(california_housing))
1
2
3
4
5
6
7
8
9
10
========================DATA TYPE==============================


<class 'sklearn.utils._bunch.Bunch'>


========================DESCRIPTION==============================


['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

Data Insights

Data Description and Summary

1
2
print("\n\n========================DATA DESCRIPTION==============================\n\n")
print(california_housing.DESCR)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
========================DATA DESCRIPTION==============================


.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297
1
2
3
data = pd.DataFrame(data= np.c_[california_housing['data'], california_housing['target']],
                     columns= california_housing['feature_names'] + california_housing['target_names'])
data.head()
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
08.325241.06.9841271.023810322.02.55555637.88-122.234.526
18.301421.06.2381370.9718802401.02.10984237.86-122.223.585
27.257452.08.2881361.073446496.02.80226037.85-122.243.521
35.643152.05.8173521.073059558.02.54794537.85-122.253.413
43.846252.06.2818531.081081565.02.18146737.85-122.253.422
1
2
print("\n\n==================================DATA SUMMARY==============================\n\n")
data.describe()
1
==================================DATA SUMMARY==============================
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
count20640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.00000020640.000000
mean3.87067128.6394865.4290001.0966751425.4767443.07065535.631861-119.5697042.068558
std1.89982212.5855582.4741730.4739111132.46212210.3860502.1359522.0035321.153956
min0.4999001.0000000.8461540.3333333.0000000.69230832.540000-124.3500000.149990
25%2.56340018.0000004.4407161.006079787.0000002.42974133.930000-121.8000001.196000
50%3.53480029.0000005.2291291.0487801166.0000002.81811634.260000-118.4900001.797000
75%4.74325037.0000006.0523811.0995261725.0000003.28226137.710000-118.0100002.647250
max15.00010052.000000141.90909134.06666735682.0000001243.33333341.950000-114.3100005.000010

Data Visualization with seaborn

Original Data

1
2
3
4
interest_attr = ["MedInc","AveRooms", "AveBedrms", "AveOccup", "Population", "MedHouseVal"]
subset = data[interest_attr]
subset.describe()
_ = sns.pairplot(data=subset, hue="MedHouseVal", palette="plasma", plot_kws=dict(linewidth=0))

Using midpoint with Quantize the target MedHouseVal

1
2
3
4
5
# Quantize the target MedHouseVal 
subset.loc[:,"MedHouseVal"] = pd.qcut(subset["MedHouseVal"], q = 6, precision=0)
# using midpoint
subset.loc[:,"MedHouseVal"] = subset["MedHouseVal"].apply(lambda x: x.mid)
_ = sns.pairplot(data=subset, hue="MedHouseVal", palette="plasma", plot_kws=dict(linewidth=0))

Scaling Processing

The Original Data with and without extreme values

I used "Median income" and "Average house occupancy" with mapping color of target values Median House Values to demonstrate the affect of scaling processing

1
2
3
4
5
6
values = ["Median income", "Median house age", "Average number of rooms", "Average number of bedrooms", 
           "Population", "Average house occupancy", "House latitude", "House longitude"]
keys = california_housing.feature_names
feature_mapping = dict(zip(keys, values))
#interest_attr = ["MedInc","AveRooms", "AveBedrms", "AveOccup", "Population", "MedHouseVal"]
make_plot(0, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "ocean_r")

1. Standard Scaling (StandardScaler)

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

1
make_plot(1, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

2. Min-Max Scaling (MinMaxScaler)

If your data consists of attributes with different scales, many machine learning algorithms can benefit from rescaling the attributes so that they are all on the same scale.

Normalization using MinMaxScaler rescaled between 0 and 1 for the attributes. This is useful for optimization algorithms used at the core of machine learning algorithms such as gradient descent. It is also useful for algorithms that weight inputs, such as regression and neural networks, and algorithms that use distance measures, such as K nearest neighbors.

1
make_plot(2, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

3. Max-Abs Scaling (MaxAbsScaler)

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

1
make_plot(2, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

4. Robust Scaling (RobustScaler)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

1
make_plot(4, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "summer_r")

5. Power Transformation (Yeo-Johnson)

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

1
make_plot(5, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

6. Power Transformation (Box-Cox)

1
make_plot(6, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

7. Quantile Transformation (Uniform pdf)

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

1
make_plot(7, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

8. Quantile Transformation (Gaussian pdf)

1
make_plot(8, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "ocean_r")

9. Sample-wise L2 Normalizing (Normalizer)

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

1
make_plot(9, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup",  color = "ocean_r")

Thank you for reading


This post is licensed under CC BY 4.0 by the author.

-

Connect Database with R

/*

Comments powered by Disqus.

*/