Saturday, May 10, 2014

Kurtosis: Four Momentous Uses for the Fourth Moment of Statistical Distributions

Features

  • Author: Kirk Borne
  • Date: 04 Apr 2014
  • Copyright: Image appears courtesy of iStock Photo. Figure is copyright of Kirk Borne.
We frequently see much mulling over mean, medians, and modes of statistical distributions, and lengthy discussions of variance and skew (including the now famous "long tail" [1], [2], [3]). What about fat tails? Is that a taboo subject? Maybe it is! For example, in the widely respected book Numerical Recipes: The Art of Scientific Computing, the authors had the audacity to say "the skewness (or third moment) and the kurtosis (or fourth moment) should be used with caution or, better yet, not at all." [4] Those warnings notwithstanding, kurtosis is making a comeback. Not that it ever went away, but a recent search on Google Scholar found over 3000 articles mentioning kurtosis in the context of statistics within the first three months of this year, and over 12,000 articles in 2013, though only about 4000 such articles were cited in the preceding three years combined. Many of those contributions focus on real-world uses of that particular characteristic of data distributions [5], [6].
thumbnail image: Kurtosis: Four Momentous Uses for the Fourth Moment of Statistical Distributions
So, what is kurtosis? It is a statistical measure of the peakiness of the data distribution, effectively measuring how peaked (positive kurtosis) or flattened (negative kurtosis) the data distribution is compared to the normal distribution. See the attached figure for an illustration of these 3 types of data distributions. For a statistical distribution f(x): the mean m (first moment) of the distribution is the average value of x over the full range of data values (i.e., the weighted mean of x, weighted by the frequency of occurrence of each value x, which is the distribution function f(x)); the variance s (second moment) is the average value of (x-m)²; the skew (third moment) is the average value of (x-m)³/s³/²; and the kurtosis is the fourth moment of the data distribution [average value of (x-m)^4/s²] minus 3. In the latter case, the “minus 3” is applied in order to set kurtosis=0 (Mesokurtic) for a normal distribution, kurtosis>0 (Leptokurtic) for a peaked distribution, and kurtosis<0 (Platykurtic) for a flattened distribution [7]. Traditionally it is assumed that very large data sets are required in order to estimate kurtosis. This is true if the primary goal of the study is to determine with strong statistical confidence whether kurtosis=0 or not (is the data normally distributed or not?). This constraint is not essential in some data science applications where the kurtosis is used to monitor shifts in the data distribution (changes in the stationarity of the system being studied [8]), which can be detected with moderate-sized data sets.
I describe here four practical applications that demonstrate significant uses of the fourth moment of a statistical distribution. Showing some love to kurtosis is consistent with one of the fundamental principles of data science: "Data are never perfect, but love your data anyway" [9]. The value of exploring the features, characteristics, and moments of your data distribution was further highlighted in this article: "Data Profiling – Four Steps to Knowing Your Big Data" [10].
1) Independent Component Analysis: ICA is a variant of PCA in cases where the data distribution contains subcomponents that are statistically independent of each other, though generally not orthogonal. ICA is an example of blind source separation, sometimes called the “cocktail party problem”, in which you try to isolate a specific speech signal out of a superposition of many independent voices. In large data collections, these independent components are unlikely to have the same means, medians, and modes. Consequently, the broad (fat tail) distribution on the data that is identified through high kurtosis is an indicator of the presence of multiple components in a complex signal (as illustrated in the figure, we see multiple components in the data distribution that has negative kurtosis). Estimating the slice through the data (i.e., the “x” dimension for the f(x) calculation) that yields the highest kurtosis will begin to identify those separate sources – subsequent application of SVM will assist in source separation [11].
2) Hidden variable discovery: There are often explanatory variables that are not measured that help to identify different categories of objects or events in data collections. Whether the source is scientific data, or social data, or financial data, or machine data, the ability to recognize the existence of such hidden variables can help to explain unusual correlations or inexplicable inaccuracies in classification models. We encountered an example of this when analyzing galaxy classifications from the Galaxy Zoo citizen science project [12]. For each one of approximately 900,000 galaxies, there were about 200 citizen scientist volunteers who provided a classification label for the galaxy: either spiral galaxy, or elliptical galaxy, or a merging galaxy. We attempted to build a predictive model for these volunteer-provided classifications using the measured features of the galaxies that were recorded in the scientific database. The predictive model worked very well (95% accuracy) for galaxies that had nearly uniform concurrence among the volunteers’ classifications (i.e., the distribution of class labels had a single peak with low kurtosis). However, our predictive model was an abysmal failure (5% accuracy) for galaxies that had a largely split vote (50-50 spiral vs. elliptical), for which the distribution of class labels had high kurtosis. We concluded that there must be some “hidden” feature (not contained in our scientific database of measurements for those galaxies) that the human eye sees that makes it difficult to classify the galaxy unequivocally as one type or the other. Now that we realize that there is a hidden explanatory variable that probably accounts for this, the hunt is on! We are continuing our search for an explanation of what the high kurtosis is signaling to us.
3) Change-point detection in dynamic streaming data: When capturing massive streams of data (from social media or scientific experiments or whatever), it is often beneficial (and efficient) to track a few key parameters that characterize the behavior of the data (i.e., effective descriptors of the system or population that is being monitored and measured). For example, calculating running averages and variances in stock market prices can produce alerts to traders. The more data characteristics that can be easily measured and tracked, then the more likely that any early warning system will generate meaningful and timely alerts [6]. Kurtosis is one of those characterizations of the data stream that is particularly effective in such applications, precisely because of its use in ICA – its ability to identify the emergence of new behaviors (new independent components) and thereby detect changes in the stationarity of the system (which otherwise may have relatively invariant mean values of key parameters, thanks to the central limit theorem).
4) Drastically improving the estimated age of the Universe: A remarkable example of kurtosis in action was in the study of classical variable stars in astronomy (Cepheid variables, in particular). Members of this class of pulsating stars follow a tightly correlated period-luminosity relationship: the longer the period of pulsation, the more luminous (brighter) the star. Using easily measured periods of these stars in images of galaxies has enabled astronomers to estimate the distances to those galaxies (which would otherwise be very hard to estimate). Unfortunately, in the mid-20th century, there was a serious discrepancy (of about a factor of two between different studies) in the estimated distances of these variable stars. Consequently, a factor two uncertainty in the distance scale of distant galaxies translated into factor of two uncertainties in the size and age estimates of the Universe. This was embarrassing for astronomers. The solution to the problem was the recognition that there was high kurtosis in the distribution of Cepheid variable stars’ data, particularly in their period-luminosity 2-dimensional scatter plot. The high kurtosis was an indisputable indicator of two independent components – in this case, two independent types of Cepheid variable stars. Once we had the Hubble Space Telescope in orbit, with the finest scientific camera ever used in astronomy, astronomers were able to identify uniquely which types of Cepheid variable stars were being seen in any particular galaxy image, and thus the factor of two uncertainty in their distances (and in the size and age of the Universe) was reduced to a few percent uncertainty, with kurtosis contributing to that improvement [13].
Finally, the most important result of any data mining and statistical analysis activity is what you do with what you have discovered. In science, one may say that the discovery is a sufficient result, but in fact the discovery should provide decision support for further action, such as: publish a research paper, make a time-critical response to the discovery, refine your hypothesis, design a new experiment, etc. More generally, in any data-driven environment, monitoring and responding to changes in the characteristic features of the data stream can lead to new discoveries and new opportunities, especially in autonomous intelligent systems, including “Decision Science-as-a-Service” for business analytics applications using big data [14], or Dynamic Data-Driven Application Systems [15], or a space probe operating in deep space with little (if any) human intervention [16]. Tapping into the power of the fourth moment of the data distribution should not be an outlier activity, but an essential component of data science and knowledge discovery within any data-driven decision-making process.

References
[1] Glanzel, W., "High-end performance or outlier? Evaluating the tail of scientometric distributions," Scientometrics, 97(1), 13-23 (2013).
[2] Brynjolfsson, E., et al., "Goodbye Pareto Principle, Hello Long Tail: The Effect of Search Costs on the Concentration of Product Sales," Management Science, 57(8), 1373-1386 (2011).
[3] Anderson, C., “The Long Tail: Why the Future of Business is Selling Less of More,” Hyperion Press (2006).
[4] Press, W., et al., "Numerical Recipes in C: The Art of Scientific Computing," Cambridge University Press, 2nd Edition, pg. 612 (1992).
[5] Mora, P., et al., “Impact of Heat on the Pressure Skewness and Kurtosis in Supersonic Jets,” AIAA Journal, 52(4), 777-787 (2014).
[6] Hou, W., et al., “Detection of small target using recursive higher order statistics,” Proc. SPIE 9142, International Conference on Frontiers in Optical Imaging Technology and Application, DOI:10.1117/12.2054029 (2014).
[7] “Does kurtosis require immediate hospitalization?” Downloaded from http://www.pqsystems.com/healthcare/PatientPuzzlers_KurtosisandHospitalization.php
[8] Sierra-Fernandez, J., et al., “Adaptive detection and classification system for power quality disturbances,” 2013 International Conference on Power, Energy and Control (ICPEC), DOI: 10.1109/ICPEC.2013.6527713 (2013).
[9] Borne, K. "Five Fundamental Concepts of Data Science," http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html (2013).
[10] Borne, K. "Data Profiling – Four Steps to Knowing Your Big Data," http://insideanalysis.com/2014/02/data-profiling-four-steps-to-knowing-your-big-data/ (2014).
[11] Lu, C.-J., et al., “Recognition of Concurrent Control Chart Patterns by Integrating ICA and SVM,” Applied Mathematics & Information Sciences, 8(2), 681-689 (2014).
[12] http://galaxyzoo.org/
[13] http://www.atnf.csiro.au/outreach/education/senior/astrophysics/variable_cepheids.html
[14] http://www.syntasa.com
[15] http://www.dddas.org/
[16] Borne, K., “Data-Driven Discovery through e-Science Technologies,” in SMC-IT 2006: Second IEEE International Conference on Space Mission Challenges for Information Technology (2006). Downloaded from http://kirkborne.net/papers/Borne2006-SMC-IT-DataDriven-eScience.pdf

“My English teacher thought he could get me into RADA but it worked out that I became a statistician”: An interview with Sir Adrian Smith

Features

  • Author: Statistics Views
  • Date: 16 Apr 2014
  • Copyright: Photograph appears courtesy of Sir Adrian Smith
Sir Adrian Smith is a world-renowned British statistician. He studied mathematics at Selwyn College, Cambridge and statistics at University College London.
He is the former Head of the Department of Mathematics at Imperial College London. He served on the Advisory Council for the Office for National Statistics from 1996–1998, was Statistical Advisor to the Nuclear Waste Inspectorate from 1991–1998 and was advisor on Operational Analysis to the Ministry of Defence from 1982–1987. He was Principal of Queen Mary, University of London, from 1998-2008; Director General of Knowledge and Innovation in the Department of Business, Innovation and Skills until 2012; and is now Vice-Chancellor of the University of London. He is a former President of the Royal Statistical Society and is currently a Deputy Chair of the UK Statistics Authority.
He is best known for his work in statistical theory, in particular Bayesian statistics and evidence-based practice. When I interviewed the late Professor Dennis Lindley last year, he called Sir Adrian “the brightest student I’ve ever had.” With Antonio Machi, Smith translated Bruno de Finetti's Theory of Probability into English. He wrote an influential paper in 1990 along with Alan E. Gelfand, which drew attention to the significance of the Gibbs sampler technique for Bayesian numerical integration problems. He was knighted in the 2011 New Year Honours.
Statistics Views talks to Sir Adrian about his career in statistics, his memories of the late Professor Lindley, teaching statistics, working on the Smith Report on secondary mathematics education in the UK, the challenges statisticians face, Big Data and how his life may have turned out very differently if he had not chosen statistics.
thumbnail image: “My English teacher thought he could get me into RADA but it worked out that I became a statistician”: An interview with Sir Adrian Smith
Video Interview Part I - where Smith talks about the origins of his career, working with the late Professor Dennis Lindley and the Smith Report


Further questions
1) You have an extremely impressive career path including serving on the Advisory Committee to the UK Government Office for National Statistics from 1996-1998, working for the UK Government Department of the Environment from 1991-1998 as a Statistical Advisor to the Nuclear Waste Inspectorate and for the Ministry of Defence from 1982 to 1987 as adviser on Operational Analysis. What are your memories when you look back on these roles in service to statistics and what do you feel were your main achievements?
In most of those cases, I was invited to join in because I was seen as a statistician and where there were issues on data policy to which I could contribute. I liked that part of my life a great deal. There was a kind of joy to be able to poke one’s fingers into worlds that are not your own, such as burying nuclear waste, which is not something I come across every day. The issues around that – the risks, quantifying the uncertainties – was bread and butter in terms of how I think. What one was doing in each of those areas was utilising the toolkit of thought about uncertainty – communicating and quantifying uncertainty, whether it was burying nuclear waste or considering investments in military equipment – at the heart of all these things were complex systems of uncertainty, so I felt that I was able to contribute.
The challenge really is what is the kind of training and what is the positioning of statistics in its own right as a profession that will continue to be relevant, needed and respected?
I’m still quite lucky now that I have a one-day-a-week role as Deputy Chair of the UK Statistics Authority and within that, I chair the Board that oversees the Office for National Statistics – in particular, the controversies about how you do price indices, RPIs, CPIs, how you measure migration, etc. – I am rather privileged to have a sort of front seat in joining in on those policies and debates and ensuring that the statistical evidence base is as good as it can be.
2) Do you think over the years too much research has focussed on less important areas of statistics? Should the gap between research and applications be reduced? How so and by whom?
What is important research and how you prioritise research is an incredibly interesting area. I spent four years of my life overseeing policies and budgets for how you fund research in the UK. You have a real dilemma and it is neither one thing nor the other – why would governments on behalf of taxpayers spend a lot of money on research if it were not to solve economic, societal or health challenges that contribute to one’s wellbeing? However, the timescale on which solving problems actually leads to positive outcomes in society can be very unpredictable and can only turn out to be very important decades later. So you cannot say “we will do the research and provide you with a solution in three weeks’ time” – it just does not work that way. On the other hand, if all research was not going to yield any outcomes for hundreds of years, the next three or four generations of taxpayers may well think they did not have a very good deal. For me, as an individual researcher and considering the way in which the public approaches topics such as investigative research, there has to be some kind of balance. You have to identify the problems we would really like to solve, such as finding a cure for cancer, for which you can perfectly articulate the case for investment, but it would be completely mad to say that you will find the cure in such and such a period of time.
What you really have to try to ensure is the brightest and the best people having a kind of sense about where the interesting problems are and letting them get on with it. Not everyone can do it – you have to have a substantial portfolio to deliver what you want – and that suited me as an individual. I could be a pure academic in the morning and an applied academic in the afternoon. In retrospect, it will always be the case that huge percentages of research lead nowhere but mainly, you don’t know that until afterwards. It is not easy upfront, and indeed there will also always be grant applications for research that you can put aside straight away with a clear conscience knowing that it would be a waste of time.
Video Interview Part II - where Sir Adrian discusses Big Data, the challenges that statisticians face, the best book he has read on statistics and more


3) What do you see as the greatest challenges facing the profession of statisticians in the coming years?
It touches on this issue of Big Data and data science as a more general area. The challenge really is what is the kind of training and what is the positioning of statistics in its own right as a profession that will continue to be relevant, needed and respected? Over the years, in areas such as clinical medicine, statistics has established a key position for itself. The world we are now in where we have huge DNA sequencing and you have millions of pieces of data which if you can mine in the right way, could give you major improvements. Are the key methodological issues of the past still relevant in a world of Big Data? The challenge is for societies such as the RSS to consistently review how the world is changing around them and what we need to do both to be relevant but also, given the expertise and knowledge that statisticians have, how to make sure that is not lost.
4) Are there people or events that have been influential in your career? Also, given that you are one of the most well respected statisticians of your generation and many statisticians look up to you, whose work do you admire (it can be someone working now, or someone whose work you admired greatly earlier on in your career?).
When I started in the world of Bayesian statistics and wanted to become a researcher, I would have to say the big thing that influenced me, not least because it took about a year my life (and completely wiped out my social life!) was translating the work of Bruno de Finetti. He had set himself a huge enterprise really which was an all-encompassing view from a particular standpoint of probability in all its ramifications – mathematical, philosophical, and so on. I admired the scale of that seminal contribution and of course looking back, there wasn’t a subject domain and it had to be forged. People like Jimmie Savage and Dennis Lindley who pioneered the thinking and back before them, de Finetti and Ramsay –those courageous first steps where people do seminal things is something I have always admired.
5) If you had not got involved in the field of statistics, what do you think you would have done? (Is there another field that you could have seen yourself making an impact on?)
When I was at school and deciding where to apply for university, it was at a time post-Cold War when universities were creating for the first time, Departments of Russian. Now when I was applying to Cambridge to study mathematics, I was at a very small country grammar school which was not very well versed on the application procedure, so I did not find out until very late in the day that I had to have studied what is called an inflected language and I had not studied Latin or Greek. So I had to pick something and I picked Russian, which I learnt in two to three months in order to obtain an O-level in order to get into Cambridge. So having done that, I could have gone somewhere to study Russian further. There was an English teacher at my school who’d been trained in Russian immediately after the Second World War and had worked in Germany. He suggested I should study Russian and be amongst the first in a generation to go into Russia.
I was almost certainly a very bad amateur actor and enjoyed acting in plays directed by another of the English teachers at my grammar school who had been a coach at RADA. He said that he thought that he could get me into RADA if I wanted to. So therefore I could have been a foreign office spook or a failed actor but it worked out that I became a statistician in the end.