Kurtosis: Four Momentous Uses for the Fourth Moment of Statistical Distributions
Features
- Author: Kirk Borne
- Date: 04 Apr 2014
- Copyright: Image appears courtesy of iStock Photo. Figure is copyright of Kirk Borne.
We frequently see much mulling over mean,
medians, and modes of statistical distributions, and lengthy
discussions of variance and skew (including the now famous "long tail"
[1], [2], [3]). What about fat tails? Is that a taboo subject? Maybe it
is! For example, in the widely respected book Numerical Recipes: The Art of Scientific Computing,
the authors had the audacity to say "the skewness (or third moment) and
the kurtosis (or fourth moment) should be used with caution or, better
yet, not at all." [4] Those warnings notwithstanding, kurtosis is making
a comeback. Not that it ever went away, but a recent search on Google
Scholar found over 3000 articles mentioning kurtosis in the context of
statistics within the first three months of this year, and over 12,000
articles in 2013, though only about 4000 such articles were cited in the
preceding three years combined. Many of those contributions focus on
real-world uses of that particular characteristic of data distributions
[5], [6].
So, what is kurtosis? It is a
statistical measure of the peakiness of the data distribution,
effectively measuring how peaked (positive kurtosis) or flattened
(negative kurtosis) the data distribution is compared to the normal
distribution. See the attached figure for an illustration of these 3
types of data distributions. For a statistical distribution f(x):
the mean m (first moment) of the distribution is the average value of x
over the full range of data values (i.e., the weighted mean of x,
weighted by the frequency of occurrence of each value x, which is the
distribution function f(x)); the variance s (second moment) is
the average value of (x-m)²; the skew (third moment) is the average
value of (x-m)³/s³/²; and the kurtosis is the fourth moment of the data
distribution [average value of (x-m)^4/s²] minus 3. In the latter case,
the “minus 3” is applied in order to set kurtosis=0 (Mesokurtic) for a
normal distribution, kurtosis>0 (Leptokurtic) for a peaked
distribution, and kurtosis<0 (Platykurtic) for a flattened
distribution [7]. Traditionally it is assumed that very large data sets
are required in order to estimate kurtosis. This is true if the primary
goal of the study is to determine with strong statistical confidence
whether kurtosis=0 or not (is the data normally distributed or not?).
This constraint is not essential in some data science applications where
the kurtosis is used to monitor shifts in the data distribution
(changes in the stationarity of the system being studied [8]), which can
be detected with moderate-sized data sets.
I describe here four practical
applications that demonstrate significant uses of the fourth moment of a
statistical distribution. Showing some love to kurtosis is consistent
with one of the fundamental principles of data science: "Data are never
perfect, but love your data anyway" [9]. The value of exploring the
features, characteristics, and moments of your data distribution was
further highlighted in this article: "Data Profiling – Four Steps to
Knowing Your Big Data" [10].
1) Independent Component Analysis:
ICA is a variant of PCA in cases where the data distribution contains
subcomponents that are statistically independent of each other, though
generally not orthogonal. ICA is an example of blind source separation,
sometimes called the “cocktail party problem”, in which you try to
isolate a specific speech signal out of a superposition of many
independent voices. In large data collections, these independent
components are unlikely to have the same means, medians, and modes.
Consequently, the broad (fat tail) distribution on the data that is
identified through high kurtosis is an indicator of the presence of
multiple components in a complex signal (as illustrated in the figure,
we see multiple components in the data distribution that has negative
kurtosis). Estimating the slice through the data (i.e., the “x”
dimension for the f(x) calculation) that yields the highest
kurtosis will begin to identify those separate sources – subsequent
application of SVM will assist in source separation [11].
2) Hidden variable discovery:
There are often explanatory variables that are not measured that help to
identify different categories of objects or events in data collections.
Whether the source is scientific data, or social data, or financial
data, or machine data, the ability to recognize the existence of such
hidden variables can help to explain unusual correlations or
inexplicable inaccuracies in classification models. We encountered an
example of this when analyzing galaxy classifications from the Galaxy
Zoo citizen science project [12]. For each one of approximately 900,000
galaxies, there were about 200 citizen scientist volunteers who provided
a classification label for the galaxy: either spiral galaxy, or
elliptical galaxy, or a merging galaxy. We attempted to build a
predictive model for these volunteer-provided classifications using the
measured features of the galaxies that were recorded in the scientific
database. The predictive model worked very well (95% accuracy) for
galaxies that had nearly uniform concurrence among the volunteers’
classifications (i.e., the distribution of class labels had a single
peak with low kurtosis). However, our predictive model was an abysmal
failure (5% accuracy) for galaxies that had a largely split vote (50-50
spiral vs. elliptical), for which the distribution of class labels had
high kurtosis. We concluded that there must be some “hidden” feature
(not contained in our scientific database of measurements for those
galaxies) that the human eye sees that makes it difficult to classify
the galaxy unequivocally as one type or the other. Now that we realize
that there is a hidden explanatory variable that probably accounts for
this, the hunt is on! We are continuing our search for an explanation of
what the high kurtosis is signaling to us.
3) Change-point detection in dynamic streaming data:
When capturing massive streams of data (from social media or scientific
experiments or whatever), it is often beneficial (and efficient) to
track a few key parameters that characterize the behavior of the data
(i.e., effective descriptors of the system or population that is being
monitored and measured). For example, calculating running averages and
variances in stock market prices can produce alerts to traders. The more
data characteristics that can be easily measured and tracked, then the
more likely that any early warning system will generate meaningful and
timely alerts [6]. Kurtosis is one of those characterizations of the
data stream that is particularly effective in such applications,
precisely because of its use in ICA – its ability to identify the
emergence of new behaviors (new independent components) and thereby
detect changes in the stationarity of the system (which otherwise may
have relatively invariant mean values of key parameters, thanks to the
central limit theorem).
4) Drastically improving the estimated age of the Universe:
A remarkable example of kurtosis in action was in the study of
classical variable stars in astronomy (Cepheid variables, in
particular). Members of this class of pulsating stars follow a tightly
correlated period-luminosity relationship: the longer the period of
pulsation, the more luminous (brighter) the star. Using easily measured
periods of these stars in images of galaxies has enabled astronomers to
estimate the distances to those galaxies (which would otherwise be very
hard to estimate). Unfortunately, in the mid-20th century, there was a
serious discrepancy (of about a factor of two between different studies)
in the estimated distances of these variable stars. Consequently, a
factor two uncertainty in the distance scale of distant galaxies
translated into factor of two uncertainties in the size and age
estimates of the Universe. This was embarrassing for astronomers. The
solution to the problem was the recognition that there was high kurtosis
in the distribution of Cepheid variable stars’ data, particularly in
their period-luminosity 2-dimensional scatter plot. The high kurtosis
was an indisputable indicator of two independent components – in this
case, two independent types of Cepheid variable stars. Once we had the
Hubble Space Telescope in orbit, with the finest scientific camera ever
used in astronomy, astronomers were able to identify uniquely which
types of Cepheid variable stars were being seen in any particular galaxy
image, and thus the factor of two uncertainty in their distances (and
in the size and age of the Universe) was reduced to a few percent
uncertainty, with kurtosis contributing to that improvement [13].
Finally, the most important result of any
data mining and statistical analysis activity is what you do with what
you have discovered. In science, one may say that the discovery is a
sufficient result, but in fact the discovery should provide decision
support for further action, such as: publish a research paper, make a
time-critical response to the discovery, refine your hypothesis, design a
new experiment, etc. More generally, in any data-driven environment,
monitoring and responding to changes in the characteristic features of
the data stream can lead to new discoveries and new opportunities,
especially in autonomous intelligent systems, including “Decision
Science-as-a-Service” for business analytics applications using big data
[14], or Dynamic Data-Driven Application Systems [15], or a space probe
operating in deep space with little (if any) human intervention [16].
Tapping into the power of the fourth moment of the data distribution
should not be an outlier activity, but an essential component of data
science and knowledge discovery within any data-driven decision-making
process.
References
[1] Glanzel, W., "High-end performance or outlier? Evaluating the tail of scientometric distributions," Scientometrics, 97(1), 13-23 (2013).
[2] Brynjolfsson, E., et al., "Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales," Management Science, 57(8), 1373-1386 (2011).
[3] Anderson, C., “The Long Tail: Why the Future of Business is Selling Less of More,” Hyperion Press (2006).
[4] Press, W., et al., "Numerical Recipes in C: The Art of Scientific Computing," Cambridge University Press, 2nd Edition, pg. 612 (1992).
[5] Mora, P., et al., “Impact of Heat on the Pressure Skewness and Kurtosis in Supersonic Jets,” AIAA Journal, 52(4), 777-787 (2014).
[6] Hou, W., et al., “Detection of small target using recursive higher order statistics,” Proc. SPIE 9142, International Conference on Frontiers in Optical Imaging Technology and Application, DOI:10.1117/12.2054029 (2014).
[7] “Does kurtosis require immediate hospitalization?” Downloaded from http://www.pqsystems.com/healthcare/PatientPuzzlers_KurtosisandHospitalization.php
[8] Sierra-Fernandez, J., et al., “Adaptive detection and classification system for power quality disturbances,” 2013 International Conference on Power, Energy and Control (ICPEC), DOI: 10.1109/ICPEC.2013.6527713 (2013).
[9] Borne, K. "Five Fundamental Concepts of Data Science," http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html (2013).
[10] Borne, K. "Data Profiling – Four Steps to Knowing Your Big Data," http://insideanalysis.com/2014/02/data-profiling-four-steps-to-knowing-your-big-data/ (2014).
[11] Lu, C.-J., et al., “Recognition of Concurrent Control Chart Patterns by Integrating ICA and SVM,” Applied Mathematics & Information Sciences, 8(2), 681-689 (2014).
[12] http://galaxyzoo.org/
[13] http://www.atnf.csiro.au/outreach/education/senior/astrophysics/variable_cepheids.html
[14] http://www.syntasa.com
[15] http://www.dddas.org/
[16] Borne, K., “Data-Driven Discovery through e-Science Technologies,” in SMC-IT 2006: Second IEEE International Conference on Space Mission Challenges for Information Technology (2006). Downloaded from http://kirkborne.net/papers/Borne2006-SMC-IT-DataDriven-eScience.pdf
[2] Brynjolfsson, E., et al., "Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales," Management Science, 57(8), 1373-1386 (2011).
[3] Anderson, C., “The Long Tail: Why the Future of Business is Selling Less of More,” Hyperion Press (2006).
[4] Press, W., et al., "Numerical Recipes in C: The Art of Scientific Computing," Cambridge University Press, 2nd Edition, pg. 612 (1992).
[5] Mora, P., et al., “Impact of Heat on the Pressure Skewness and Kurtosis in Supersonic Jets,” AIAA Journal, 52(4), 777-787 (2014).
[6] Hou, W., et al., “Detection of small target using recursive higher order statistics,” Proc. SPIE 9142, International Conference on Frontiers in Optical Imaging Technology and Application, DOI:10.1117/12.2054029 (2014).
[7] “Does kurtosis require immediate hospitalization?” Downloaded from http://www.pqsystems.com/healthcare/PatientPuzzlers_KurtosisandHospitalization.php
[8] Sierra-Fernandez, J., et al., “Adaptive detection and classification system for power quality disturbances,” 2013 International Conference on Power, Energy and Control (ICPEC), DOI: 10.1109/ICPEC.2013.6527713 (2013).
[9] Borne, K. "Five Fundamental Concepts of Data Science," http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html (2013).
[10] Borne, K. "Data Profiling – Four Steps to Knowing Your Big Data," http://insideanalysis.com/2014/02/data-profiling-four-steps-to-knowing-your-big-data/ (2014).
[11] Lu, C.-J., et al., “Recognition of Concurrent Control Chart Patterns by Integrating ICA and SVM,” Applied Mathematics & Information Sciences, 8(2), 681-689 (2014).
[12] http://galaxyzoo.org/
[13] http://www.atnf.csiro.au/outreach/education/senior/astrophysics/variable_cepheids.html
[14] http://www.syntasa.com
[15] http://www.dddas.org/
[16] Borne, K., “Data-Driven Discovery through e-Science Technologies,” in SMC-IT 2006: Second IEEE International Conference on Space Mission Challenges for Information Technology (2006). Downloaded from http://kirkborne.net/papers/Borne2006-SMC-IT-DataDriven-eScience.pdf