Skewness (R,Python)
Parametric analyses are based on the assumption that the data you are analysing is normally distributed (see below):
library(ggplot2)
# https://stackoverflow.com/a/12429538
<-seq(-4,4,0.001)
norm_x
<-data.frame(x=norm_x,y=dnorm(norm_x,0,1))
norm_data_frame
<- ggplot(
pdata = norm_data_frame,
aes(
x = x,
y = y
)+ geom_line()
)
+
p xlim(c(-3,3)) +
geom_col(aes(fill=x)) +
scale_fill_gradient2(low = "red",
high = "blue",
mid = "white",
midpoint = median(norm_data_frame$x))+
annotate("text", x=-2.3, y=0.01, label= "13.6%") +
annotate("text", x=-1.5, y=0.01, label= "34.1%") +
annotate("text", x=-0.6, y=0.01, label= "50%") +
annotate("text", x=0.5, y=0.01, label= "84.1%") +
annotate("text", x=1.5, y=0.01, label= "97.7%") +
annotate("text", x=2.25, y=0.01, label= "100%") +
geom_vline(xintercept = -2) +
geom_vline(xintercept = -1) +
geom_vline(xintercept = 0) +
geom_vline(xintercept = 1) +
geom_vline(xintercept = 2) +
xlab("Z-score") +
ylab("Frequency") +
theme(legend.position="none")
import numpy as np
import matplotlib as mpl
from scipy.stats import norm
import matplotlib.pyplot as plt
= np.arange(-4.0, 4.1, 0.1)
norm_x
= norm.pdf(norm_x, loc=0, scale=1)
norm_y
= mpl.cm.get_cmap('jet')
colourmap
= plt.subplots()
fig, ax
7)
fig.set_figheight(10)
fig.set_figwidth(
='black', alpha=1.00)
ax.plot(norm_x, norm_y, color= mpl.colors.Normalize(vmin=norm_y.min(), vmax=norm_y.max())
normalize = 80
npts for i in range(npts - 1):
+1]],
plt.fill_between([norm_x[i], norm_x[i+1]],
[norm_y[i], norm_y[i=colourmap(normalize(norm_y[i]))
color=0.6)
,alpha
2.05, 0.001, '100%', fontsize = 10)
plt.text(1.2, 0.001, '97.7%', fontsize = 10)
plt.text(0.2, 0.001, '84.1%', fontsize = 10)
plt.text(-0.7, 0.001, '50%', fontsize = 10)
plt.text(-1.8, 0.001, '34.1%', fontsize = 10)
plt.text(-2.7, 0.001, '13.6%', fontsize = 10)
plt.text(
=-2, color='black', ls='-', lw=1)
plt.axvline(x=-1, color='black', ls='-', lw=1)
plt.axvline(x=0, color='black', ls='-', lw=1)
plt.axvline(x=1, color='black', ls='-', lw=1)
plt.axvline(x=2, color='black', ls='-', lw=1)
plt.axvline(x
# add title on the x-axis
"Z-score")
plt.xlabel(
# add title on the y-axis
"Frequency")
plt.ylabel(
# show the plot
plt.show()
White in the above figure represents the median. Note that the mean and median overlap in a normal distribution.
If your data fits a normal distribution, then you can draw conclusions based on certain facts about this distribution, e.g. the fact that 97.7% of your population should have a score that is more negative than +2 standard deviations above the mean (because Z-scores represent standard deviations from the mean). As a result, if your data is skewed:
library(fGarch)
<-data.frame(
skew_data_framex = norm_x,
y = dsnorm(
x = norm_x,
mean = 0,
sd = 1,
xi = 2,
)
)
ggplot(
data = skew_data_frame,
aes(
x = x,
y = y
)+
) geom_line() +
geom_col(aes(fill=x)) +
scale_fill_gradient2(low = "red",
high = "blue",
mid = "white",
midpoint = -.15)+
theme(legend.position="none") +
xlim(-3,3) +
ylab("Frequency") +
xlab("Z-score")
from scipy.stats import skewnorm
= np.arange(-3.0, 3.1, 0.1)
norm_x
= norm.pdf(norm_x, loc=0, scale=1)
norm_y = skewnorm.pdf(norm_x, a=-2)
norm_y
= mpl.cm.get_cmap('jet')
colourmap
= plt.subplots()
fig, ax
7)
fig.set_figheight(10)
fig.set_figwidth(
='black', alpha=1.00)
ax.plot(norm_x, norm_y, color= mpl.colors.Normalize(vmin=norm_y.min(), vmax=norm_y.max())
normalize = 60
npts for i in range(npts - 1):
+1]],
plt.fill_between([norm_x[i], norm_x[i+1]],
[norm_y[i], norm_y[i=colourmap(normalize(norm_y[i]))
color=0.5)
,alpha
=-0.5, color='white', ls='-', lw=1)
plt.axvline(x
# add title on the x-axis
"Z-score")
plt.xlabel(
# add title on the y-axis
"Frequency")
plt.ylabel(
# show the plot
plt.show()
White represents the median in the figure above. As you can see from the above skewed distribution, the median is below the mean, consistent with the data being skewed. Importantly, the assumptions that we can make about what proportion of the population are 1 standard deviation above and below the mean are no longer valid, as more than half the population are below the mean in this case. This would suggest that non-parametric analyses could be more appropriate if your data is skewed.
So now that we know what skewed distributions look like, we now need to quantify how much of a problem with skewness there is.
The next section is a breakdown of the formula for those interested in it (but this is not crucial).
Optional
If we want to manually calculate skewness:
\[ \tilde{\mu_{3}} = \sum((\frac{x_i- \bar{x}{}} {\sigma} )^3) * \frac{N}{(N-1) * (N-2)} \] To do this in R you could calculate it manually. We’ll use Gapminder’s data from 2007:
library(gapminder)
# create a new data frame that only focuses on data from 2007
<- subset(
gapminder_2007 # the data set
gapminder, == 2007
year
)
hist(gapminder_2007$lifeExp)
= length(gapminder_2007$lifeExp)
positive_skew_n = sum(((gapminder_2007$lifeExp - mean(gapminder_2007$lifeExp))/sd(gapminder_2007$lifeExp))^3) * (
positive_skewness / ((positive_skew_n-1) * (positive_skew_n - 2))
positive_skew_n
)# to show the output:
positive_skewness
[1] -0.6887771
import matplotlib.pyplot as plt
import numpy as np
import math
# load the gapminder module and import the gapminder dataset
from gapminder import gapminder
# create a new data frame that only focuses on data from 2007
= gapminder.loc[gapminder['year'] == 2007]
gapminder_2007
= plt.subplots(figsize =(7, 5))
fig, ax 'lifeExp'], bins=10, color='grey',edgecolor = "black", alpha =0.8)
ax.hist(gapminder_2007[# add title on the x-axis
"gapminder_2007['lifeExp']")
plt.xlabel(
# add title on the y-axis
"Frequency")
plt.ylabel(
# show the plot
plt.show()
= len(gapminder_2007['lifeExp'])
positive_skew_n = sum(((gapminder_2007['lifeExp'] - gapminder_2007['lifeExp'].mean())/gapminder_2007['lifeExp'].std())**3) * (
positive_skewness / ((positive_skew_n-1) * (positive_skew_n - 2))
positive_skew_n
)
# to show the output:
positive_skewness
-0.68877706485775
… or just
Use code from https://stackoverflow.com/a/54369572 to give you values for skewness (this has been chosen as this gives skewness and its standard error as calculated by major software like SPSS and JASP):
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
=function(x) {
spssSkewKurtosis=length(x)
w=mean(x)
m1=sum((x-m1)^2)
m2=sum((x-m1)^3)
m3=sum((x-m1)^4)
m4=sd(x)
s1=w*m3/(w-1)/(w-2)/s1^3
skew=sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
sdskew=(w*(w+1)*m4 - 3*m2^2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1^4)
kurtosis=sqrt( 4*(w^2-1) * sdskew^2 / ((w-3)*(w+5)) )
sdkurtosis
## z-scores added by reading-psych
= skew/sdskew
zskew = kurtosis/sdkurtosis
zkurtosis
=matrix(c(skew,kurtosis, sdskew,sdkurtosis, zskew, zkurtosis), 2,
matdimnames=list(c("skew","kurtosis"), c("estimate","se","zScore")))
return(mat)
}spssSkewKurtosis(gapminder_2007$lifeExp)
estimate se zScore
skew -0.6887771 0.2034292 -3.385832
kurtosis -0.8298204 0.4041614 -2.053191
# Skewness and kurtosis and their standard errors as implement by SPSS
#
# Reference: pp 451-452 of
# http://support.spss.com/ProductsExt/SPSS/Documentation/Manuals/16.0/SPSS 16.0 Algorithms.pdf
#
# See also: Suggestion for Using Powerful and Informative Tests of Normality,
# Ralph B. D'Agostino, Albert Belanger, Ralph B. D'Agostino, Jr.,
# The American Statistician, Vol. 44, No. 4 (Nov., 1990), pp. 316-321
def spssSkewKurtosis(x):
import pandas as pd
import math
=len(x)
w=x.mean()
m1=sum((x-m1)**2)
m2=sum((x-m1)**3)
m3=sum((x-m1)**4)
m4=(x).std()
s1=w*m3/(w-1)/(w-2)/s1**3
skew=math.sqrt( 6*w*(w-1) / ((w-2)*(w+1)*(w+3)) )
sdskew=(w*(w+1)*m4 - 3*m2**2*(w-1)) / ((w-1)*(w-2)*(w-3)*s1**4)
kurtosis=math.sqrt( 4*(w**2-1) * sdskew**2 / ((w-3)*(w+5)) )
sdkurtosis
## z-scores added by reading-psych
= skew/sdskew
zskew = kurtosis/sdkurtosis
zkurtosis
= pd.DataFrame([[skew, sdskew, zskew],[kurtosis, sdkurtosis, zkurtosis]], columns=['estimate','se','zScore'], index=['skew','kurtosis'])
mat
return mat
"lifeExp"]) spssSkewKurtosis(gapminder_2007[
The above function generated three columns: \(estimate\) of skewness (and kurtosis, but we’ll deal with kurtosis separately, \(standard\) \(error\) (SE) and the \(skewness_z\) score. Z-scores can capture how unlikely the \(skewness\) estimate is considering what you would normally expect. In this case, when you divide skewness by standard error, you get a z-score, and if the absolute value (i.e. ignoring whether it is positive or negative) of the z-score is greater than 1.96 then it is significantly skewed. If you want to understand why 1.96 is the main number, check out the subsection on normal distributions.
One way to deal with skewed data is to transform the data.
Consolidation questions
Question 1
Is a skewness z-score of indicative of a significant problem of skewness?
Your answer is