Transforming Data (R)
A lot of analyses is dependent on data being normally distributed. One problem with your data might be that it is skewed. Lets focus on the gapminder data from 2007 to see if the gdp and life expectancy data is skewed, and how this could be addressed.
library(gapminder)
library(ggplot2)
# create a new data frame that only focuses on data from 2007
<- subset(
gapminder_2007 # the data set
gapminder, == 2007
year
)
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
=function(x) {
spssSkewKurtosis=length(x)
w=mean(x)
m1=sum((x-m1)^2)
m2=sum((x-m1)^3)
m3=sum((x-m1)^4)
m4=sd(x)
s1=w*m3/(w-1)/(w-2)/s1^3
skew=sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
sdskew=(w*(w+1)*m4 - 3*m2^2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1^4)
kurtosis=sqrt( 4*(w^2-1) * sdskew^2 / ((w-3)*(w+5)) )
sdkurtosis
## z-scores added by reading-psych
= skew/sdskew
zskew = kurtosis/sdkurtosis
zkurtosis
=matrix(c(skew,kurtosis, sdskew,sdkurtosis, zskew, zkurtosis), 2,
matdimnames=list(c("skew","kurtosis"), c("estimate","se","zScore")))
return(mat)
}spssSkewKurtosis(gapminder_2007$gdpPercap)
estimate se zScore
skew 1.2241977 0.2034292 6.0178067
kurtosis 0.3500942 0.4041614 0.8662238
spssSkewKurtosis(gapminder_2007$lifeExp)
estimate se zScore
skew -0.6887771 0.2034292 -3.385832
kurtosis -0.8298204 0.4041614 -2.053191
import pandas as pd
import numpy as np
from gapminder import gapminder
# Filter data for the year 2007
= gapminder.loc[gapminder['year'] == 2007]
gapminder_2007
# Define a function to calculate skewness and kurtosis with their standard errors
def spssSkewKurtosis(x):
= len(x)
w = np.mean(x)
m1 = np.sum((x - m1) ** 2)
m2 = np.sum((x - m1) ** 3)
m3 = np.sum((x - m1) ** 4)
m4 = np.std(x)
s1 = (w * m3 / ((w - 1) * (w - 2)) / s1 ** 3)
skew = np.sqrt(6 * w * (w - 1) / ((w - 2) * (w + 1) * (w + 3)))
sdskew = (w * (w + 1) * m4 - 3 * m2 ** 2 * (w - 1)) / ((w - 1) * (w - 2) * (w - 3) * s1 ** 4)
kurtosis = np.sqrt(4 * (w ** 2 - 1) * sdskew ** 2 / ((w - 3) * (w + 5)))
sdkurtosis
# Calculate z-scores
= skew / sdskew
zskew = kurtosis / sdkurtosis
zkurtosis
# Create a DataFrame for the results
= pd.DataFrame({
result_df "estimate": [skew, kurtosis],
"se": [sdskew, sdkurtosis],
"zScore": [zskew, zkurtosis]
=["skew", "kurtosis"])
}, index
return result_df
# Calculate skewness and kurtosis for 'gdpPercap' and 'lifeExp' in 2007
= spssSkewKurtosis(gapminder_2007['gdpPercap'])
gdpPercap_results = spssSkewKurtosis(gapminder_2007['lifeExp'])
lifeExp_results
print("Skewness and Kurtosis for 'gdpPercap' in 2007:")
print(gdpPercap_results)
print("\nSkewness and Kurtosis for 'lifeExp' in 2007:")
print(lifeExp_results)
So it looks like both the gdp and life expectancy are skewed (as their z-scores are greater than 1.96). Lets double check with a quick plot:
plot(
$gdpPercap,
gapminder_2007$lifeExp
gapminder_2007 )
import matplotlib.pyplot as plt
# Create a scatter plot
=(10, 6))
plt.figure(figsize'gdpPercap'], gapminder_2007['lifeExp'], alpha=0.6)
plt.scatter(gapminder_2007['Scatter Plot of GDP per Capita vs. Life Expectancy (2007)')
plt.title('GDP per Capita')
plt.xlabel('Life Expectancy')
plt.ylabel(True)
plt.grid(
# Show the plot
plt.show()
It’s relatively easy to see the skewness of gdp, but life expectancy is a bit more subtle. As the data is skewed, we may want to transform it to make it less skewed.
We can complete a logarithmic transformation to reduce the skewness, so lets do that to both variables and then replot the data:
$gdpPercap_log <- log(gapminder_2007$gdpPercap)
gapminder_2007$lifeExp_log <- log(gapminder_2007$lifeExp)
gapminder_2007plot(
$gdpPercap_log,
gapminder_2007$lifeExp_log
gapminder_2007 )
# Calculate the logarithms of 'gdpPercap' and 'lifeExp'
'gdpPercap_log'] = np.log(gapminder_2007['gdpPercap'])
gapminder_2007['lifeExp_log'] = np.log(gapminder_2007['lifeExp'])
gapminder_2007[
# Create a scatter plot of the log-transformed variables
=(10, 6))
plt.figure(figsize'gdpPercap_log'], gapminder_2007['lifeExp_log'], alpha=0.6)
plt.scatter(gapminder_2007['Scatter Plot of Log(GDP per Capita) vs. Log(Life Expectancy) (2007)')
plt.title('Log(GDP per Capita)')
plt.xlabel('Log(Life Expectancy)')
plt.ylabel(True)
plt.grid(
# Show the plot
plt.show()
Lets check if the skewness has changed for the gdp:
# original gdp
spssSkewKurtosis(gapminder_2007$gdpPercap)
estimate se zScore
skew 1.2241977 0.2034292 6.0178067
kurtosis 0.3500942 0.4041614 0.8662238
# transformed gdp (log)
spssSkewKurtosis(gapminder_2007$gdpPercap_log)
estimate se zScore
skew -0.1540524 0.2034292 -0.7572778
kurtosis -1.1256815 0.4041614 -2.7852277
# original gdp
'gdpPercap'])
spssSkewKurtosis(gapminder_2007[# transformed gdp (log)
'gdpPercap_log']) spssSkewKurtosis(gapminder_2007[
So, transforming the gdp did reduce skewness but increased kurtsosis, so beware that applying a transformation may cause other problems! Lets check whether the log transformation reduced skewness for life expectancy:
# original life expectancy
spssSkewKurtosis(gapminder_2007$lifeExp)
estimate se zScore
skew -0.6887771 0.2034292 -3.385832
kurtosis -0.8298204 0.4041614 -2.053191
# transformed life expectancy (log)
spssSkewKurtosis(gapminder_2007$lifeExp_log)
estimate se zScore
skew -0.9043617 0.2034292 -4.445584
kurtosis -0.4136699 0.4041614 -1.023527
# original life expectancy
'lifeExp'])
spssSkewKurtosis(gapminder_2007[
# transformed life expectancy (log)
'lifeExp_log']) spssSkewKurtosis(gapminder_2007[
Seems like the answer is no.
An important question is whether the associations between your variables change after transformation, so let’s check that next:
# correlation on original data
cor.test(
$gdpPercap,
gapminder_2007$lifeExp
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap and gapminder_2007$lifeExp
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# correlation on transformed data
cor.test(
$gdpPercap_log,
gapminder_2007$lifeExp_log
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap_log and gapminder_2007$lifeExp_log
t = 14.752, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7060729 0.8372165
sample estimates:
cor
0.7800706
from scipy import stats
# Perform a correlation test on the original data
= stats.pearsonr(gapminder_2007['gdpPercap'], gapminder_2007['lifeExp'])
correlation_original
# Perform a correlation test on the transformed data
= stats.pearsonr(gapminder_2007['gdpPercap_log'], gapminder_2007['lifeExp_log'])
correlation_transformed
# Print the correlation results
print("Correlation on Original Data:")
print("Pearson correlation coefficient:", correlation_original[0])
print("p-value:", correlation_original[1])
print("\nCorrelation on Transformed Data:")
print("Pearson correlation coefficient:", correlation_transformed[0])
print("p-value:", correlation_transformed[1])
The log transformed data is more strongly associated with each other than the original data. However, not all transformations will change associations between variables.
Linear vs. non-linear transformations
Linear transformation includes adding, subtracting from, multiplying or dividing variables. These transformations change the absolute value, but not pattern of the distribution of the variable. Let’s use life expectancy to illustrate how linear transformations change the absolute values without changing the distribution.
Additive transformations
If you added 100 to the life expectancy for all countries, you would change the absolute value:
# before transformation
mean(gapminder_2007$lifeExp)
[1] 67.00742
# after transformation
mean(gapminder_2007$lifeExp + 100)
[1] 167.0074
# before transformation
'lifeExp'].mean()
gapminder_2007[
# after transformation
'lifeExp'] + 100) np.mean(gapminder_2007[
67.00742253521126
167.00742253521128
There’s a big difference between the means, but all we’ve done is shift the distribution up 100, we haven’t made it wider or thinner:
# before transformation
sd(gapminder_2007$lifeExp)
[1] 12.07302
# after transformation
sd(gapminder_2007$lifeExp + 100)
[1] 12.07302
# before transformation [use ddof =1 for sample sd, and ddof=0 for population sd]
'lifeExp'], ddof=1)
np.std(gapminder_2007[
# after transformation [use ddof =1 for sample sd, and ddof=0 for population sd]
'lifeExp'] + 100, ddof=1) np.std(gapminder_2007[
12.07302050222512
12.07302050222512
If we were to visualise this transformation
<- data.frame(
life_exp_before_after life_exp = c(gapminder_2007$lifeExp, gapminder_2007$lifeExp + 100),
tranformed = c(rep("before", each = 142), rep("after", each =142))
)
ggplot(life_exp_before_after, aes(x=life_exp, fill=tranformed)) +
geom_histogram(binwidth = 2, alpha=.5, position = "identity") +
ggtitle("Before vs. after additive transformation")
# Create a DataFrame for 'lifeExp' before and after the transformation
= pd.DataFrame({
life_exp_before_after 'life_exp': np.concatenate([gapminder_2007['lifeExp'], gapminder_2007['lifeExp'] + 100]),
'transformed': np.concatenate([np.repeat('before', len(gapminder_2007)), np.repeat('after', len(gapminder_2007))])
})
# Create a histogram
=(10, 6))
plt.figure(figsize'transformed'] == 'before']['life_exp'],
plt.hist(life_exp_before_after[life_exp_before_after[=range(0, 120, 2), alpha=0.5, label='Before Transformation', color='blue')
bins'transformed'] == 'after']['life_exp'],
plt.hist(life_exp_before_after[life_exp_before_after[=range(0, 220, 2), alpha=0.5, label='After Transformation', color='green')
bins"Before vs. After Additive Transformation")
plt.title("Life Expectancy")
plt.xlabel(
plt.legend()
# Show the histogram
plt.show()
We can see above that there is no difference in the shape of the distributions, but a shift. You would get the same pattern shifted also if you had subtracted from the original data. As a result, any association between the transformed variable and another variable will be the same as it was before the transformation as the shapes of the distributions are still the same.
Multiplicative transformations
If you multiplied the life expectancy by 1.5 then you would change both the mean
# before transformation
mean(gapminder_2007$lifeExp)
[1] 67.00742
# after transformation
mean(gapminder_2007$lifeExp * 1.5)
[1] 100.5111
# before transformation
'lifeExp'].mean()
gapminder_2007[
# after transformation
'lifeExp'] * 1.5) np.mean(gapminder_2007[
67.00742253521126
100.51113380281689
and SD of life expectancy
# before transformation
sd(gapminder_2007$lifeExp)
[1] 12.07302
# after transformation
sd(gapminder_2007$lifeExp * 1.5)
[1] 18.10953
# before transformation [use ddof =1 for sample sd, and ddof=0 for population sd]
'lifeExp'], ddof=1)
np.std(gapminder_2007[
# after transformation [use ddof =1 for sample sd, and ddof=0 for population sd]
'lifeExp'] * 1.5, ddof=1) np.std(gapminder_2007[
12.07302050222512
18.109530753337683
We established above that changing the mean isn’t sufficient to change the shape of a distribution, but would changing the standard deviation change the shape of the distribution. Let’s put two histograms of each distribution side by side to evaluate this:
par(mfrow = c(1,2),
mar = c(0,0,2,1))
hist(gapminder_2007$lifeExp, breaks = seq(min(gapminder_2007$lifeExp), max(gapminder_2007$lifeExp), length.out = 11), main = "Original")
hist(gapminder_2007$lifeExp*1.5, breaks = seq(min(gapminder_2007$lifeExp*1.5), max(gapminder_2007$lifeExp*1.5), length.out = 11), main = "Original * 1.5")
# Create subplots with two histograms
= plt.subplots(1, 2, figsize=(12, 5))
fig, axs
# Plot the original 'lifeExp' histogram
0].hist(gapminder_2007['lifeExp'],bins=range(0, 120, 2), alpha=0.5, label='Before Transformation', color='blue')
axs[0].set_title("Original")
axs[0].set_xlabel("Life Expectancy")
axs[0].set_ylabel("Frequency")
axs[
# Plot the 'lifeExp' * 1.5 histogram
1].hist(gapminder_2007['lifeExp'] * 1.5, bins=range(0, 120, 2), color='green', alpha=0.5)
axs[1].set_title("Original * 1.5")
axs[1].set_xlabel("Life Expectancy")
axs[1].set_ylabel("Frequency")
axs[
# Adjust layout
plt.tight_layout()
# Show the histograms
plt.show()
We can see that the shape/pattern of the distribution is the same, and so the association between the transformed variable and other variables will stay the same after transformation. This is because associations between variables ignore the scale of either variable.
Non-linear transformations
Unlike linear transformations, non-linear transformations change the shape of distributions. There are a wide variety of non-linear transformations you could apply to a variable, such as…
Square (\(^2\))
par(mfrow = c(1,2),
mar = c(0,0,2,1))
hist(gapminder_2007$gdpPercap, main = "Original")
hist(gapminder_2007$gdpPercap^2, main = "Squared")
# Create subplots with two histograms
= plt.subplots(1, 2, figsize=(12, 5))
fig, axs
# Plot the original 'gdpPercap' histogram
0].hist(gapminder_2007['gdpPercap'], alpha=0.5, label='Before Transformation', color='blue')
axs[0].set_title("Original")
axs[0].set_xlabel("Life Expectancy")
axs[0].set_ylabel("Frequency")
axs[
# Plot the squared 'gdpPercap' histogram
1].hist(np.square(gapminder_2007['gdpPercap']), color='green', alpha=0.5)
axs[1].set_title("Squared")
axs[1].set_xlabel("Life Expectancy")
axs[1].set_ylabel("Frequency")
axs[
# Adjust layout
plt.tight_layout()
# Show the histograms
plt.show()
Squaring data is likely to make the distributions more extreme, and so isn’t often a pragmatic solution to try to make your data less skewed.
Square root (\(\sqrt{}\))
par(mfrow = c(1,2),
mar = c(0,0,2,1))
hist(gapminder_2007$gdpPercap, main = "Original")
hist(sqrt(gapminder_2007$gdpPercap), main = "Square root")
# Create subplots with two histograms
= plt.subplots(1, 2, figsize=(12, 5))
fig, axs
# Plot the original 'gdpPercap' histogram
0].hist(gapminder_2007['gdpPercap'], alpha=0.5, label='Before Transformation', color='blue')
axs[0].set_title("Original")
axs[0].set_xlabel("GDP per Capita")
axs[0].set_ylabel("Frequency")
axs[
# Plot the square root 'gdpPercap' histogram
1].hist(np.sqrt(gapminder_2007['gdpPercap']), color='green', alpha=0.5)
axs[1].set_title("Square root")
axs[1].set_xlabel("GDP per Capita")
axs[1].set_ylabel("Frequency")
axs[
# Adjust layout
plt.tight_layout()
# Show the histograms
plt.show()
This transformation appears to have reduced the skewness of the distribution. Calculating the square root of a variable will disproportionately reduce extreme values compared to less extreme values. This might be more clearly shown by looking at the change in the individual data points:
# focusing on 5 countries to make it visually easier
<- data.frame(
gapminder_sqrt country = gapminder_2007$country[1:5],
transformed = c(
rep("Original",5),
rep("Square Root",5)
),# gdp has been divided by 500 to make the comparisons more visible
gdp = c(gapminder_2007$gdpPercap[1:5]/500, sqrt(gapminder_2007$gdpPercap[1:5]/500))
)ggplot(gapminder_sqrt, aes(x=country, y = gdp, color = transformed)) +
geom_point(size=5) +
xlab("Country index") +
ylab("GDP (before and after transformation)")
# Sample data for 5 countries
= pd.DataFrame({
data 'Country': gapminder_2007['country'].iloc[:5].tolist() * 2,
'Transformation': ['Original'] * 5 + ['Square Root'] * 5,
'GDP': (gapminder_2007['gdpPercap'].iloc[:5] / 500).tolist() + (np.sqrt(gapminder_2007['gdpPercap'].iloc[:5] / 500)).tolist()
})
# Create a scatter plot
=(12, 5))
plt.figure(figsize= ['blue', 'green']
colors = ['o', 's']
markers for i, transformation in enumerate(['Original', 'Square Root']):
= data[data['Transformation'] == transformation]
subset 'Country'], subset['GDP'], label=transformation, color=colors[i], s=100)
plt.scatter(subset[
"Comparison of GDP (Original vs. Square Root Transformation)")
plt.title("Country Index")
plt.xlabel("GDP (before and after transformation)")
plt.ylabel(='Transformation')
plt.legend(titleTrue)
plt.grid(
# Show the plot
plt.show()
As you can see above, the original values (pink) that are higher are much more heavily reduced by square root transforming them than lower original values.
Logarithmic (\(\log\))
par(mfrow = c(1,2),
mar = c(0,0,2,1))
hist(gapminder_2007$gdpPercap, main = "Original")
hist(log(gapminder_2007$gdpPercap), main = "Logarithmic")
# Create subplots with two histograms
= plt.subplots(1, 2, figsize=(12, 5))
fig, axs
# Plot the original 'gdpPercap' histogram
0].hist(gapminder_2007['gdpPercap'], alpha=0.5, label='Before Transformation', color='blue')
axs[0].set_title("Original")
axs[0].set_xlabel("GDP per Capita")
axs[0].set_ylabel("Frequency")
axs[
# Plot the 'gdpPercap' * 1.5 histogram
1].hist(np.log(gapminder_2007['gdpPercap']), color='green', alpha=0.5)
axs[1].set_title("Logarithmic")
axs[1].set_xlabel("GDP per Capita")
axs[1].set_ylabel("Frequency")
axs[
# Adjust layout
plt.tight_layout()
# Show the histograms
plt.show()
This transformation seems very successful in changing the distribution shape to be less skewed. Let’s see if the log transformation follows a similar pattern as the sqrt in disproportionately impacting larger values than smaller values.
# focusing on 5 countries to make it visually easier
<- data.frame(
gapminder_log country = gapminder_2007$country[1:5],
transformed = c(
rep("Original",5),
rep("Log Transformation",5)
),# gdp has been divided by 500 to make the comparisons more visible
gdp = c(gapminder_2007$gdpPercap[1:5]/500, log(gapminder_2007$gdpPercap[1:5]/500))
)ggplot(gapminder_log, aes(x=country, y = gdp, color = transformed)) +
geom_point(size=5) +
xlab("Country index") +
ylab("GDP (before and after transformation)")
# Sample data for 5 countries
= pd.DataFrame({
data 'Country': gapminder_2007['country'].iloc[:5].tolist() * 2,
'Transformation': ['Original'] * 5 + ['Log Transformation'] * 5,
'GDP': (gapminder_2007['gdpPercap'].iloc[:5] / 500).tolist() + (np.log(gapminder_2007['gdpPercap'].iloc[:5] / 500)).tolist()
})
# Create a scatter plot
=(12, 5))
plt.figure(figsize= ['blue', 'green']
colors = ['o', 's']
markers for i, transformation in enumerate(['Original', 'Log Transformation']):
= data[data['Transformation'] == transformation]
subset 'Country'], subset['GDP'], label=transformation, color=colors[i], s=100)
plt.scatter(subset[
"Comparison of GDP (Original vs. Log Transformation)")
plt.title("Country Index")
plt.xlabel("GDP (before and after transformation)")
plt.ylabel(='Transformation')
plt.legend(titleTrue)
plt.grid(
# Show the plot
plt.show()
Yep, log also reduces skewness by disproportionately reducing higher values.
Linear transformations will not change the association between variables
You can transform a single variable by adding and multiplying it, but as these are linear transformations they do not change the shape of the distributions of the original variables, and thus do not change the association between variables. For example:
# correlation with original data
cor.test(
$gdpPercap,
gapminder_2007$lifeExp
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap and gapminder_2007$lifeExp
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# correlation with original data + 5 to one variable (an additive change)
cor.test(
$gdpPercap + 5,
gapminder_2007$lifeExp + 5
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap + 5 and gapminder_2007$lifeExp + 5
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# correlation with original data - 10 to one variable (an additive change)
cor.test(
$gdpPercap - 10,
gapminder_2007$lifeExp - 10
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap - 10 and gapminder_2007$lifeExp - 10
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# correlation with multiplication of 5 to one variable (multiplicative)
cor.test(
$gdpPercap * 5,
gapminder_2007$lifeExp * 5
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap * 5 and gapminder_2007$lifeExp * 5
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# grid comparing the 4 transformations
par(mfrow = c(2,2),
mar = c(2,2,2,1))
plot(gapminder_2007$gdpPercap, gapminder_2007$lifeExp, main="original correlation")
plot(gapminder_2007$gdpPercap + 5, gapminder_2007$lifeExp +5, main="added 5 to gdp")
plot(gapminder_2007$gdpPercap - 10, gapminder_2007$lifeExp - 10, main = "took 10 away from gdp")
plot(gapminder_2007$gdpPercap *5, gapminder_2007$lifeExp *5, main = "multiplied gdp by 5")
from scipy.stats import pearsonr
# Define the transformations
= gapminder_2007['gdpPercap'] + 5
gdp_plus_5 = gapminder_2007['gdpPercap'] - 10
gdp_minus_10 = gapminder_2007['gdpPercap'] * 5
gdp_times_5
= gapminder_2007['lifeExp'] + 5
life_plus_5 = gapminder_2007['lifeExp'] - 10
life_minus_10 = gapminder_2007['lifeExp'] * 5
life_times_5
# correlation with original data
= pearsonr(gapminder_2007['gdpPercap'], gapminder_2007['lifeExp'])
correlation_original, pvalue_original print("Correlation with original data:", correlation_original)
print("p-value of correlation with original data:", pvalue_original)
# correlation with original data + 5 to both variables (an additive change)
= pearsonr(gdp_plus_5, life_plus_5)
correlation_plus_5, pvalue_plus_5 print("Correlation with original data + 5:", correlation_plus_5)
print("p-value of correlation with original data +5:", pvalue_plus_5)
# correlation with original data - 10 to both variables (an additive change)
= pearsonr(gdp_minus_10, life_minus_10)
correlation_minus_10, pvalue_minus_10 print("Correlation with original data - 10:", correlation_minus_10)
print("p-value of correlation with original data -10:", pvalue_minus_10)
# correlation with multiplication of 5 to both variables (multiplicative)
= pearsonr(gdp_times_5, life_times_5)
correlation_times_5, pvalue_times_5 print("Correlation with multiplication of 5:", correlation_times_5)
print("p-value of correlation with multiplication of 5:", pvalue_times_5)
# Create a grid of scatter plots
=(12, 6))
plt.figure(figsize2, 2, 1)
plt.subplot('gdpPercap'], gapminder_2007['lifeExp'])
plt.scatter(gapminder_2007["Original Correlation")
plt.title(
2, 2, 2)
plt.subplot(
plt.scatter(gdp_plus_5, life_plus_5)"Added 5P")
plt.title(
2, 2, 3)
plt.subplot(
plt.scatter(gdp_minus_10, life_minus_10)"Took 10 Away")
plt.title(
2, 2, 4)
plt.subplot(
plt.scatter(gdp_times_5, life_times_5)"Multiplied by 5")
plt.title(
plt.tight_layout() plt.show()
You can see that the transformations being linear haven’t changed the nature of the associations.
Non-linear transformations do change associations
If you apply non-linear transformations to one or both variables this will change the direction and strength of the associations. Below are some examples when you transform both variables:
# correlation with original data
cor.test(
$gdpPercap,
gapminder_2007$lifeExp
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap and gapminder_2007$lifeExp
t = 10.933, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5786217 0.7585843
sample estimates:
cor
0.6786624
# correlation with log of variables
cor.test(
log(gapminder_2007$gdpPercap),
log(gapminder_2007$lifeExp)
)
Pearson's product-moment correlation
data: log(gapminder_2007$gdpPercap) and log(gapminder_2007$lifeExp)
t = 14.752, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7060729 0.8372165
sample estimates:
cor
0.7800706
# correlation with both variables squared
cor.test(
$gdpPercap ^ 2,
gapminder_2007$lifeExp ^ 2
gapminder_2007 )
Pearson's product-moment correlation
data: gapminder_2007$gdpPercap^2 and gapminder_2007$lifeExp^2
t = 8.6437, df = 140, p-value = 1.123e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4709220 0.6877841
sample estimates:
cor
0.5898894
# correlation with square root of both variable
cor.test(
sqrt(gapminder_2007$gdpPercap),
sqrt(gapminder_2007$lifeExp)
)
Pearson's product-moment correlation
data: sqrt(gapminder_2007$gdpPercap) and sqrt(gapminder_2007$lifeExp)
t = 12.981, df = 140, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6539524 0.8057025
sample estimates:
cor
0.7390648
# grid comparing the 4 transformations
par(mfrow = c(2,2),
mar = c(2,2,2,1))
plot(gapminder_2007$gdpPercap, gapminder_2007$lifeExp, main = "original correlation")
plot(log(gapminder_2007$gdpPercap),log(gapminder_2007$lifeExp), main = "log applied")
plot(gapminder_2007$gdpPercap ^ 2,gapminder_2007$lifeExp ^ 2, main = "data squared")
plot(sqrt(gapminder_2007$gdpPercap),sqrt(gapminder_2007$lifeExp), main = "square root of data")
from scipy.stats import pearsonr
# Define the transformations
= np.log(gapminder_2007['gdpPercap'])
gdp_log = np.sqrt(gapminder_2007['gdpPercap'])
gdp_sqrt = np.square(gapminder_2007['gdpPercap'])
gdp_squared
= np.log(gapminder_2007['lifeExp'])
life_log = np.sqrt(gapminder_2007['lifeExp'])
life_sqrt = np.square(gapminder_2007['lifeExp'])
life_squared
# correlation with original data
= pearsonr(gapminder_2007['gdpPercap'], gapminder_2007['lifeExp'])
correlation_original, pvalue_original print("Correlation with original data:", correlation_original)
print("p-value of correlation with original data:", pvalue_original)
# correlation with both variables log transformed
= pearsonr(gdp_log, life_log)
correlation_log, pvalue_log print("Correlation with both variables log transformed:", correlation_log)
print("p-value of correlation with both variables log transformed:", pvalue_log)
# correlation with both variables square rooted
= pearsonr(gdp_sqrt, life_sqrt)
correlation_sqrt, pvalue_sqrt print("Correlation with both variables square rooted:", correlation_sqrt)
print("p-value of correlation with both variables square rooted:", pvalue_sqrt)
# correlation with both variables squared
= pearsonr(gdp_squared, life_squared)
correlation_squared, pvalue_squared print("Correlation with both variables squared:", correlation_squared)
print("p-value of correlation with multiplication of 5:", pvalue_squared)
# Create a grid of scatter plots
=(12, 6))
plt.figure(figsize2, 2, 1)
plt.subplot('gdpPercap'], gapminder_2007['lifeExp'])
plt.scatter(gapminder_2007["Original Correlation")
plt.title(
2, 2, 2)
plt.subplot(
plt.scatter(gdp_log, life_log)"Log Transformation")
plt.title(
2, 2, 3)
plt.subplot(
plt.scatter(gdp_sqrt, life_sqrt)"Square Root Transformation")
plt.title(
2, 2, 4)
plt.subplot(
plt.scatter(gdp_squared, life_squared)"Squared Transformation")
plt.title(
plt.tight_layout() plt.show()
Question 1
Which types of transformations might make a distribution normal?
Question 2
Which of the following transformations is least likely to result in a normal distribution?