↪ Transforming Data
Last updated
Last updated
Data Transformation in a statistics context means the application of a mathematical expression to each point in the data. In contrast, in a Data Engineering context Transformation can also mean transforming data from one format to another in the (ETL) process.
It is important to know what we are talking about when we use the term transformation. Transformation, normalization and standardization are often used interchangeably and wrongly so.
Normalization is the process of scaling in respect to the entire data range so that the data has a range from 0 to 1.
Standardization is the process of transforming in respect to the entire data range so that the data has a mean of 0 and a standard deviation of 1. It’s distribution is now a Standard Normal Distribution.
Transformation is the application of the same calculation to every point of the data separately.
Standardization transforms the data to follow a Standard Normal Distribution (left graph). Normalization and Standardization can be seen as special cases of Transformation. To demonstrate the difference between a standard normal distribution and a standard distribution we simulate data and graph it:
To get insights, data is most often transformed to follow close to a normal distribution either to meet statistical assumptions or to detect linear relationships between other variables. One of the first steps for those techniques is to check how close the variables already follow a normal distribution.
It is common to inspect your data visually and/or check the assumption of normality with a statistical test.
To visually explore the distribution of your data, we will look at the density plot as well as a simple QQ-plot. The QQ-plot
is an excellent tool for inspecting various properties of your data distribution and asses if and how you need to transform your data. Here the quantiles of a perfect normal distribution are plotted against the quantiles of your data. Quantiles measure at which data point a certain percentage of the data is included. For example, the data point of the 0.2 quantile is the point where 20% of the data is below and 80% is above. A reference line is drawn which indicates how the plot would look if your variable would follow a perfect normal distribution. The closer your points in the QQ-plot
are to this line, the more likely it is that your data follows a normal distribution and does not need additional transformation.
The following diagrams show simulated data with the density distribution and the corresponding QQ-plot. Four strong and typical deviations from a normal distribution are shown. Only for the normally distributed data an additional statistical test for normality is shown in the code snippet for completeness.
For transformation multiply every data point with one of the following expression. The expressions are sorted from weakest effect to strongest. If your transformation of choice is too strong, you will end up with data skewed in the other direction.
Right (positive) skewed data:
Root ⁿ√x. Weakest transformation, stronger with higher order root. For negative numbers special care needs to be taken with the sign while transforming negative numbers:
Logarithm log(x). Commonly used transformation, the strength of this transformation can be somewhat altered by the root of the logarithm. It can not be used on negative numbers or 0, here you need to shift the entire data by adding at least |min(x)|+1.
Reciprocal 1/x. Strongest transformation, the transformation is stronger with higher exponents, e.g. 1/x³. This transformation should not be done with negative numbers and numbers close to zero, hence the data should be shifted similar as the log transform.
There are various implementations of automatic transformations in R that choose the optimal transformation expression for you. They determine a lambda value which is the power coefficient used to transform your data closest to a normal distribution.
Use Lambert W x Gaussian transform. The R package has an implementation for automatically transforming heavy or light tailed data with Gaussianize()
.
Tukey’s Ladder of Powers. For skewed data, the transformTukey()
from the R package rcompanion uses Shapiro-Wilk tests iteratively to find at which lambda value the data is closest to normality and transforms it. Left skewed data should be reflected to right skew and there should be no negative values.
Box-Cox Transformation. The BoxCox.lambda()
from the R package forecast finds iteratively a lambda value which maximizes the log-likelihood of a linear model. However it can be used on a single variable with model formula x~1. The transformation with the resulting lambda value can be done via the forecast BoxCox()
. There is also an implementation in the R package MASS. Standard Box-Cox can not be used with negative values, two-parameter Box-Cox however can.
Yeo-Johnson Transformation. This can be seen as an useful extension to the Box-Cox. It is the same as Box-Cox for non-negative values and handles negative and 0 values as well. There are various implementations in R via packages , and in the meta machine-learning framework tidymodels.