Data scaling is a data preprocessing step for numerical features. Many machine learning algorithms such as e Gradient descent methods, KNN algorithm, linear and logistic regression, Principle Component Analysis (PCA), etc require data scaling for good results.

Explore more at https://scikit-learn.org

Standard Scaling (StandardScaler)
Min-Max Scaling (MinMaxScaler)
Max-Abs Scaling (MaxAbsScaler)
Robust Scaling (RobustScaler)
Power Transformation (Yeo-Johnson) PowerTransformer(method="yeo-johnson")
Power Transformation (Box-Cox) PowerTransformer(method="box-cox")
Quantile Transformation (Uniform pdf) QuantileTransformer(output_distribution="uniform")
Quantile Transformation (Gaussian pdf) QuantileTransformer(output_distribution="normal")
Sample-wise L2 Normalizing (Normalizer)
Binarize (Binarizer)
Spline Transformation (SplineTransforme)

  
import os
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.gridspec import GridSpec
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import minmax_scale
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

from sklearn.datasets import fetch_california_housing
os.chdir("/Users/lrobinson/Python_Projects")
from normalize_plot_0624 import make_plot
from normalize_plot_0624 import getdata

Loading California Housing

For this study, let's take California housing as an example. The goal is to compare data with and without scaling and observe the effects of the presence of outliers.

  
california_housing = fetch_california_housing(as_frame=True)
print("\n\n========================DATA TYPE==============================\n\n")
print(type(california_housing))
print("\n\n========================DESCRIPTION==============================\n\n")
print(dir(california_housing))

========================DATA TYPE==============================

<class 'sklearn.utils._bunch.Bunch'>

========================DESCRIPTION==============================

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

Data Insights

Data Description and Summary

  
print("\n\n========================DATA DESCRIPTION==============================\n\n")
print(california_housing.DESCR)

========================DATA DESCRIPTION==============================


.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297

  
data = pd.DataFrame(data= np.c_[california_housing['data'], california_housing['target']],
                     columns= california_housing['feature_names'] + california_housing['target_names'])
data.head()

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

  
print("\n\n==================================DATA SUMMARY==============================\n\n")
data.describe()

==================================DATA SUMMARY==============================

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
count	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000	20640.000000
mean	3.870671	28.639486	5.429000	1.096675	1425.476744	3.070655	35.631861	-119.569704	2.068558
std	1.899822	12.585558	2.474173	0.473911	1132.462122	10.386050	2.135952	2.003532	1.153956
min	0.499900	1.000000	0.846154	0.333333	3.000000	0.692308	32.540000	-124.350000	0.149990
25%	2.563400	18.000000	4.440716	1.006079	787.000000	2.429741	33.930000	-121.800000	1.196000
50%	3.534800	29.000000	5.229129	1.048780	1166.000000	2.818116	34.260000	-118.490000	1.797000
75%	4.743250	37.000000	6.052381	1.099526	1725.000000	3.282261	37.710000	-118.010000	2.647250
max	15.000100	52.000000	141.909091	34.066667	35682.000000	1243.333333	41.950000	-114.310000	5.000010

Data Visualization with seaborn

Original Data

  
interest_attr = ["MedInc","AveRooms", "AveBedrms", "AveOccup", "Population", "MedHouseVal"]
subset = data[interest_attr]
subset.describe()
_ = sns.pairplot(data=subset, hue="MedHouseVal", palette="plasma", plot_kws=dict(linewidth=0))

Using midpoint with Quantize the target MedHouseVal

  
# Quantize the target MedHouseVal 
subset.loc[:,"MedHouseVal"] = pd.qcut(subset["MedHouseVal"], q = 6, precision=0)
# using midpoint
subset.loc[:,"MedHouseVal"] = subset["MedHouseVal"].apply(lambda x: x.mid)
_ = sns.pairplot(data=subset, hue="MedHouseVal", palette="plasma", plot_kws=dict(linewidth=0))

Scaling Processing

The Original Data with and without extreme values

I used "Median income" and "Average house occupancy" with mapping color of target values Median House Values to demonstrate the affect of scaling processing

  
values = ["Median income", "Median house age", "Average number of rooms", "Average number of bedrooms", 
           "Population", "Average house occupancy", "House latitude", "House longitude"]
keys = california_housing.feature_names
feature_mapping = dict(zip(keys, values))
#interest_attr = ["MedInc","AveRooms", "AveBedrms", "AveOccup", "Population", "MedHouseVal"]
make_plot(0, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "ocean_r")

1. Standard Scaling (StandardScaler)

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

  
make_plot(1, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

2. Min-Max Scaling (MinMaxScaler)

If your data consists of attributes with different scales, many machine learning algorithms can benefit from rescaling the attributes so that they are all on the same scale.

Normalization using MinMaxScaler rescaled between 0 and 1 for the attributes. This is useful for optimization algorithms used at the core of machine learning algorithms such as gradient descent. It is also useful for algorithms that weight inputs, such as regression and neural networks, and algorithms that use distance measures, such as K nearest neighbors.

  
make_plot(2, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

3. Max-Abs Scaling (MaxAbsScaler)

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

  
make_plot(2, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

4. Robust Scaling (RobustScaler)

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

  
make_plot(4, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "summer_r")

5. Power Transformation (Yeo-Johnson)

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

  
make_plot(5, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

6. Power Transformation (Box-Cox)

  
make_plot(6, california_housing, feature_mapping, xvar = "MedInc", yvar = "Population", color = "ocean_r")

7. Quantile Transformation (Uniform pdf)

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

  
make_plot(7, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "plasma_r")

8. Quantile Transformation (Gaussian pdf)

  
make_plot(8, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup", color = "ocean_r")

9. Sample-wise L2 Normalizing (Normalizer)

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

  
make_plot(9, california_housing, feature_mapping, xvar = "MedInc", yvar = "AveOccup",  color = "ocean_r")

Scaling and Preprocessing Data with Scikit-Learn for Machine Learning

Loading California Housing

Data Insights

Data Description and Summary

Data Visualization with seaborn

Original Data

Using midpoint with Quantize the target MedHouseVal

Scaling Processing

The Original Data with and without extreme values

1. Standard Scaling (StandardScaler)

2. Min-Max Scaling (MinMaxScaler)

3. Max-Abs Scaling (MaxAbsScaler)

4. Robust Scaling (RobustScaler)

5. Power Transformation (Yeo-Johnson)

6. Power Transformation (Box-Cox)

7. Quantile Transformation (Uniform pdf)

8. Quantile Transformation (Gaussian pdf)

9. Sample-wise L2 Normalizing (Normalizer)

Thank you for reading

Scaling and Preprocessing Data with Scikit-Learn for Machine Learning

Loading California Housing

Data Insights

Data Description and Summary

Data Visualization with seaborn

Original Data

Using midpoint with Quantize the target MedHouseVal

Scaling Processing

The Original Data with and without extreme values

1. Standard Scaling (StandardScaler)

2. Min-Max Scaling (MinMaxScaler)

3. Max-Abs Scaling (MaxAbsScaler)

4. Robust Scaling (RobustScaler)

5. Power Transformation (Yeo-Johnson)

6. Power Transformation (Box-Cox)

7. Quantile Transformation (Uniform pdf)

8. Quantile Transformation (Gaussian pdf)

9. Sample-wise L2 Normalizing (Normalizer)

Thank you for reading

Further Reading

Connect Database with R