FAQ 
Calendar 


#1




Is there an example of an *exact* normal distribution in nature?
The normal (or Gauss) distribution is a terrific approximation in a wide variety of settings. And it's backed by the central limit theorem, which I admittedly do not understand.
But is it exactly observed in nature, for samples with an arbitrarily large number of observations? By exact, I mean that large samples consistently reject the hypothesis of nonnormality. In finance, rates of return tend to be leptokurtotic  they have long tails. There's even some (negative?) skew. That doesn't stop analysts from using the Gaussian distribution as a rough and ready model (sometimes to the chagrin of their investors, but that's another matter). What about other settings? Biology? Demography? Chemistry? Atmospheric science? Engineering? I'm guessing that a reliably perfect Gaussian process is empirically rare, since there's typically some extreme event that kicks in periodically. A large and important part of reality may consist of sums of very small errors, but I suspect that another aspect involves sporadic big honking errors. For every few thousand leaky faucets, we get a bust water main. While I'm at it, why do the ShapiroWilk and ShapiroFrancia tests for normality cap out with sample sizes of 2000 and 5000?  I posed this question at the now defunct website, teemings.org in 2008. The short answer (by the esteemable ultrafilter) was, "no, nothing's going to be exactly normally distributed. On the other hand, particularly in cases where the central limit theorem applies, the difference between what you see and what you'd expect to see is negligible." A tighter version of my question follows: 1. Has anybody stumbled upon an unsimulated and naturally occurring dataset with, say, more than 100 observations that looks exactly like a textbook normal curve? 2. Does any natural process consistently spin off Gaussian distributions, with p values consistent with normality virtually all the time? (Presumably this would be produced by something other than the central limit theorem (CLT) alone.) Ultrafilter says, "No", if I understand him correctly. 3. Does any natural process consistently spin off large datasets (thousands of observations each) where normality is *not* rejected  at least 95% of the time (at the 5% level of confidence)? If the CLT is the only thing in play, there should be natural processes like this. But I suspect that black swans are pretty much ubiquitous. Then again, it should be possible to isolate a process that hews to a conventional Gaussian. Bonus question: Do any of the tests for normality evaluate moments higher than 4?  As always wiki is helpful: http://en.wikipedia.org/wiki/Normal_...ion#Occurrence, but I'm not sure whether I should trust their claims of exactness. Finally, if anybody has an empirical dataset whose process is plausibly Gauss, sample is huge and is in a reasonably accessible computer format, feel free to link to it if you're curious. I'll run some tests at some point using the statistical package Stata. Datasets are admittedly "All around the internet", but extracting lots of fairly large (2000+) samples typically requires some work. Last edited by Measure for Measure; 07032011 at 06:45 PM. 
#2




Your question is a little goofy. If I generate 20 random samples from a standard normal distribution, I would expect the hypothesis of normality to be rejected for one of them at alpha = .05. That doesn't mean that the points it produces aren't normally distributed, just that strange things can happen on a small sample.
In fact, I just ran this experiment, and the seventh sample didn't look normal to the robust JarqueBera test (p = .014). 
#4




Quote:
iamnotbatman: Ok. It's just that Wikipedia's image of "The ground state of a quantum harmonic oscillator," doesn't look remotely Gauss, as it lacks a pair of inflection points. Velocities of ideal gases are not empirical observations (are they?). Last edited by Measure for Measure; 07032011 at 08:27 PM. 


#5




Sorry, my question is actually a lotta goofy. "Gauss is an approximation MfM! Who cares what happens at the 12th moment?"
I reply that models based upon rough approximations made by highly paid analysts got us into trouble during the dawn of this Little Depression. And in a more general context, the late and great Peter Kennedy wrote: Quote:

#6




Wiki's picture for the quantum harmonic oscillator is a bit confusing. You see how it has those pixellatedlooking horizontal bands across it? Each of those is one energy level. So the bottommost band represents the ground state (with intensity as a function of position instead of height as a function of position, like on a graph). The ground state of a harmonic oscillator is, in fact, a perfect Gaussian.
But that doesn't really answer the question, either It just moves it to the question of whether there are really any perfect harmonic oscillators in nature. The harmonic oscillator, like the Gaussian itself, is something that doesn't actually show up exactly, very often, but which is a very good and simple approximation for a lot of things that do show up. The best I can come up with is a charged particle in a magnetic field with no other influences on it, but that "no other influences" is a killer. Especially since the particle would also have to have no magnetic dipole moment. But I'm a bit unclear about what the OP means by a "natural" process. For instance, would rolling a whole bunch of dice and adding them up count as "natural"? Because it's really easy to get an arbitrarilygood Gaussian that way. 
#7




Quote:
But the spirit of the OP seeks: a) a one million observation empirical Gaussian dataset (give or take a magnitude) on the internet that I can evaluate in Stata and/or b) verification of the hypothesis that in practice no natural process is wholly governed by the CLT: black swans are ubiquitous for example. And that's just the 4th moment. Hypothesis b) is falsified depending upon whether you consider rolled dice a natural process. Are there even better examples? Quote:
Anyway, Wikipedia gives examples of exact Gaussian processes. I can't evaluate them: I can't tell which are theoretical artifacts and which correspond to actual datasets. Would anyone like to take a crack? http://en.wikipedia.org/wiki/Normal_...ion#Occurrence 
#8




Wikipedia gives three examples of exact normality. In truth, none of them are exact, although in practice I believe that the exactness in all three cases could (and I'm quite sure it has, though I'm too lazy to find a cite) been confirmed over thousands of trials to many decimal places.
The first example is the velocities of the molecules in the ideal gas. One reason this isn't truly exact (in the sense of wanting a true continuous distribution) is because the number of molecules is finite. But in practice, for a macroscopic ensemble like a liter of gas, the normality of the distribution would be impossible to differentiate from nonnormality with 21st century technology. The second example is the ground state of the quantum harmonic oscillator. As Chronos pointed out, there may not be any truly perfect quantum harmonic oscillators in nature. But any potential (smooth, continuous) with local minima will have ground states that to a very good approximation have normal distributions. Nature provides such minima in abundance (in magnetic or electric field configurations, vibrations of diatomic molecules...). The problem is that if you include the finite size of the universe, or the effect of tiny perturbations due to interactions with the quantum vacuum, or even the tiny contribution of gravity waves from the stars in the sky, you are bound to imperceptibly distort your potentials so that they are not exactly gaussian. The third example is the position of a particle in quantum mechanics. If you measure the position exactly, then wait some time, and measure again, the distributions of measured positions is exactly Gaussian. Of course, this is assuming an idealized particle, and no other potentials in the vicinity, which, as I mentioned above, is not ever going to be achieved perfectly in practice. Similarly you are not going to be able to perfectly measure the position of the particle in the first place. 
#9




Radioactive decay.



#10




#11




iamnotbatman:
I have difficulty working through these science examples as I lack a background in either college physics or chemistry. Apologies for my inaccuracy and imprecision: I'm really winging it here. Ok. In nature there are no ideal gases, no perfect harmonic oscillators and no perfect measures of any given particle's position. These models approximate the world quite well though. This isn't exactly what I'm getting at. Is there a dataset with ~10K observations of the velocity of the molecules of a liter of nitrogen? [1] If so, I'm not asking whether the distribution produced is perfectly Gauss. I'm asking whether it is consistently indistinguishable from Gauss at the 95% level of confidence. Sufficiently small perturbations only matter with sufficiently large sample sizes. (My call for huge datasets was so that I could have a set n=2000 subsamples, and work out the share of them that reject normality at 5%). In practice though, I might wonder whether measurement errors make a difference. Ok, now I'm wandering into the goofy again. [1] Seriously, what are we measuring here? Is it the average velocity of molecules in a liter of gas? I'm guessing that the data would be measuring temperature and pressure, which is something different. The Gauss would be used to transform empirical observation into a postulated velocity: it would address some processes and set others aside. This is something different than "A Gaussian dataset". I'm calling it "A Gaussian process". So what I'm gathering from this thread is that "There are lots of examples of the exact Gauss in physics", though I don't have a clear idea of any particular Gauss dataset in physics. Could we specify the experimental setup in greater detail? 
#12




Quote:

#13




iamnotbatman:
Thanks for spelling out the ideal gas example. I concede that a random number generator can produce Gauss random variables; I've even fooled around with that a little. My original motivation was linked to the tendency in the social sciences to assume Gauss errors without credible or even explicit justification. Now that's not necessarily a bad thing: it depends upon the problem in question. And in physics I now understand that there are very good reasons for assuming that certain types of distributions are indeed Gauss. But it is bogus and dangerous to blithely assume Gauss for financial market returns in the face of strong evidence to the contrary, at least without applying robustness checks and general due diligence. And yet that is what was done routinely some years back: this sort of mismanagement formed one of the necessary conditions for the financial crisis and subsequent Little Depression. Construct the Casual Fact So much for highlevel motivation. For this thread I'm working on a more general level. I'd like to say something along the lines of, "The Gauss distribution doesn't exist empirically in nature: what we have are distribution mixtures." But I don't think that's quite correct. I'm trying to work out the proper rough characterization about the prevalence of observed exact Gaussians. Again, in many applications this doesn't matter. If you're conducting an hypothesis test and the underlying distribution is even Laplace, applying the student's tstatistic probably won't steer you that far wrong. Type I and II errors will be less than optimal, but arguably acceptable. Or so I speculate: I haven't read the relevant Monte Carlo study. But if you are forecasting central tendency and dispersion, that's another matter entirely. 
#14




Quote:



#15




Quote:
 Note to future readers (if any): interesting discussion of the Gaussian distribution occurs here: http://boards.academicpursuits.us/sdmb/...d.php?t=614864 Last edited by Measure for Measure; 07042011 at 11:02 PM. 
#16




Quote:
If you toss a coin N times, the number of times the coin shows 'heads' (N_h) is a random variable that is binomially distributed, but for large N is normally distributed. (Btw if you are a masochist you can use this method to produce your own dataset, but you should really just use a software method). Now, one source of systematic error is whether or not the coin is 'fair'. But let's ignore that, because even if it wasn't fair, the distribution shouldn't have a fat tail. Now suppose you needed a coin tossed a trillion times, so you farmed out the work to some company that used a robot to flip coins and an employed image recognition software to determine which side of the coin landed up. The question is, do you trust the company to do this without error? Perhaps the robot can flip with such precision that if it produces the same 'flipping' force the coin will always land headsup, and the company's software was never tested beyond a few million flips, and since some of their variables were 32bit and reset after 2^32 flips, after a few billion flips the robot gets into a pattern where it keeps throwing heads over and over again. If the company was incompetent (which is extremely common in the real world), they may not notice the bug, and hand you the dataset claiming that the systematic error is zero. But in reality you would get a very fat tail  not because the underlying process was nongaussian, but because of external factors which were not correctly accounted for. I think a common problem in the financial world may be an arrogance regarding the evaluation of systematic errors combined with a lot of topdown pressure and underregulated competitive pressure (tragedy of the commons, etc)  not honestly accounting for systematic error. For example, if you have companies competing to build coinflipping machines in the market place, they are going to make competitive shortcuts and unrealistic promises regarding low systematic errors. Cheaper and flimsier coinflipping machines may be built and trusted because it is necessary to compete against others making the same mistakes. Needless to say, in such a market, I would not 'bet on' normally distributed data  humans make mistakes, and it is unrealistic to expect that the probability for making such a mistake is as small as a normal distribution says it can be. I think your focus on normal distributions could be expanded to *any* distribution that claims probabilities that can be vanishingly small (disregarding boundary conditions). In the real world your systematic error is generally large enough that when you add it to the statistical prediction, you always expect some fatness to your tails, to some extent. Most people know this intuitively. For example we know that in quantum mechanics it is possible for your stapler to tunnel through your desk and onto the floor. The probability distribution is a tail very much like that of the tail of the normal curve, and is unimaginably small. But if you are doing a home experiment, you have to account for the possibility that someone took your stapler while you weren't looking, and someone else dropped a stapler near your desk and it ended up below yours. That is in fact analogous to one of the systematic errors that must be controlled for when doing some of these actual quantum mechanical experiments. 
#17




Strictly speaking, an error which causes fat tails on both sides wouldn't be a systematic error, since systematic errors by definition will bias your data in one direction.

#18




Quote:

#19




iamnotbatman :
As I said, my original curiosity arose from the ubiquitous assumption of Gaussian errors. The financial crisis added some additional motivation though. Most analysts are aware of these issues, but I at least don't have a solid grasp of them. I like the systematic error concept. I'm inclined to abandon my "No observed Gauss anywhere" notion: there are solid reasons to believe in Gaussian processes in certain physics contexts. Let me propose another conceptual handle: "In practice, most dataset errors reflect some sort of distribution mixture. Following the central limit theorem, the sum of lots of equally weighted distributions will be Gauss. But in practice, Gauss will be an approximation, since the weights won't be equal.[1] " That should encompass the systematic error concept to some extent. One of the problems in the social sciences is that there typically aren't solid theoretic reasons for believing in any particular exact error distribution (even if there are plausible arguments for Gaussian approximations or whatever). Furthermore your dependent variable typically reflects a lot of unmeasurables and even unponderables. Quote:
Quote:
I downloaded a dataset of 15,000 observations from yahoo. It consists of daily percentage price changes of the S&P Composite, a weighted sum of large capitalization stocks.[2] I'll compare it to Gauss in an upcoming post. [1] (Whether the sample size is sufficient to distinguish your empirical distribution from pure Gauss is a separate matter.) [2] It's the S&P 500, except there were fewer companies in the index during the 1950s. Last edited by Measure for Measure; 07062011 at 12:18 AM. 


#20




So here's what the distribution of the S&P Composite looks like. As a control, I specified a normally distributed random variable with the same mean and standard deviation:
Code:
Variable  Obs Mean Std. Dev. Min Max + dailypret500  15474 .0003306 .00967 .204669 .1158 normrnd  15475 .0002605 .0097017 .036254 .0388416 Now consider kurtosis and skew: Code:
DailyPret500  Percentiles Smallest 1% .025709 .204669 5% .014333 .09035 10% .009873 .089295 Obs 15474 25% .004123 .088068 Sum of Wgt. 15474 50% .0004635 Mean .0003306 Largest Std. Dev. .00967 75% .004963 .070758 90% .010133 .090994 Variance .0000935 95% .014507 .10789 Skewness .6616936 99% .025685 .1158 Kurtosis 25.22477 normrnd  Percentiles Smallest 1% .0224537 .036254 5% .0157981 .0340847 10% .0121544 .0337052 Obs 15475 25% .0062004 .03304 Sum of Wgt. 15475 50% .0002771 Mean .0002605 Largest Std. Dev. .0097017 75% .0068832 .0339581 90% .0126438 .0363308 Variance .0000941 95% .016328 .0382705 Skewness .0217871 99% .0227815 .0388416 Kurtosis 3.00358 Code:
Skewness/Kurtosis tests for Normality  joint  Variable  Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 + dailypret500  0.000 0.000 . . normrnd  0.268 0.904 1.24 0.5379 http://wm55.inbox.com/thumbs/44_130b...5_oP.png.thumb It looks a little like the Burj Dubai. Here it is with a overlaid perfect normal distribution: http://wm55.inbox.com/thumbs/45_130b...c_oP.png.thumb The Burj 500 is pointier, has longer tails, and is somewhat skewed. Are outliers driving this effect? Let's see what happens if we remove the 30 most negative and positive returns. That would be one day per year on average, so we are removing swans of all colors. Code:
DailyPret500  Percentiles Smallest 1% .024287 .04356 5% .014088 .043181 10% .009789 .04279 Obs 15414 25% .004107 .042532 Sum of Wgt. 15414 50% .0004635 Mean .0003492 Largest Std. Dev. .0087627 75% .004948 .040826 90% .010061 .041729 Variance .0000768 95% .014279 .041867 Skewness .0402395 99% .024358 .04241 Kurtosis 5.482383 normrnd  Percentiles Smallest 1% .0224537 .036254 5% .015802 .0340847 10% .0121599 .0337052 Obs 15414 25% .0062022 .03304 Sum of Wgt. 15414 50% .0002828 Mean .0002593 Largest Std. Dev. .0097034 75% .0068737 .0339581 90% .0126438 .0363308 Variance .0000942 95% .0163285 .0382705 Skewness .0216746 99% .0227815 .0388416 Kurtosis 3.004975 Skewness/Kurtosis tests for Normality  joint  Variable  Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2 + dailypret500  0.041 0.000 . 0.0000 normrnd  0.272 0.876 1.23 0.5400 In short, Gauss is a pretty rough approximation for financial returns and while black swans are a big part of the story, they are not the only part. Note that I picked a period of relative economic stability. 18901950 was far more tumultuous. For that matter 18301890 wasn't exactly smooth sailing either. 
#21




Do you know what it means for a time series to be nonstationary?

#22




Quote:
Code:
. dfuller dailypret500, regress DickeyFuller test for unit root Number of obs = 15473  Interpolated DickeyFuller  Test 1% Critical 5% Critical 10% Critical Statistic Value Value Value  Z(t) 120.229 3.430 2.860 2.570  MacKinnon approximate pvalue for Z(t) = 0.0000  D.  dailypret500  Coef. Std. Err. t P>t [95% Conf. Interval] + dailypret500  L1  .9660763 .0080353 120.23 0.000 .9818264 .9503262 _cons  .0003187 .0000777 4.10 0.000 .0001663 .000471  Last edited by Measure for Measure; 07072011 at 01:21 AM. 
#23




...strictly speaking some financial analysts like to use log changes, but I doubt that presentation would make a big difference. I could be wrong though.
ETA: IIRC correctly, stationarity implies constant variance as well, which is something that financial returns don't have. Variance tends to be autocorrelated. Last edited by Measure for Measure; 07072011 at 01:26 AM. 
#24




Quote:
Poking around the internet it seems that Easyfit is one of a few pieces of specialty software used for fitting lots of different distributions to a dataset. It seems that this sort of procedure isn't a standard part of the usual general purpose statistical packages. Stata 8 fits the normal for example, but that's all AFAIK. Then again, I know little about these tests: I don't know whether measuring and matching moments would be straightforward. 


#25




Quote:
You also don't really need a test to see that the S&P 500 returns are not stationary. I did a quick plot of the returns over the period 19992010, and it's immediately obvious that you're not looking at a stationary series. Any longer time period would show much more variability. 
#26




Oh, but I agree financial returns can be and are modeled with GARCH: 2nd moments are autocorrelated after all. In fact I had GARCH in mind when I spoke of "Underlying structure". But returns are also commonly modeled assuming normally distributed returns. Look at the Value at Risk literature. Consider the BlackScholes model of options pricing. Read the business press when they speak of once every 10,000 year events that for some reason seem to recur every 5 years. Taleb wrote an entire book on this, which admittedly I haven't read.
Again, I'm not claiming that financial professionals are unaware of these issues, although I guess I am saying that they are known to be blown off now and then. Permit me to quote from John Hull's Options, Futures and Other Derivatives, 4th ed. He has a chapter on "Estimating Volatilities and Correlations" [with GARCH]: Quote:
Last edited by Measure for Measure; 07092011 at 06:17 PM. 
#27




Quote:
Also, the phsyics department at my Uni back in the day had a really neat demo lab. One of the demos was a impressively large Bean Machine. I used to love playing with that. And it always produced a nice looking normal curve. Not quite nature, since it was constructed by humans, but the principles are correct. 
#28




I believe that radioactive decay is a Poisson Process.

#29




Quote:
There are plenty of harmonic oscillators in nature for which this is a pretty good approximation  it works extremely well for diatomic molecules. If you use the betterfitting Morse potentialinstead of a perfect harmonic potential, the lowest order state is pretty close to a perfect Gaussian. And, of course, the cross section through a TEM_{00} mode in a laser is a perfect Gaussian. In the real world, of course, there will invariably be dust specks in the beam, and the mirrors will be of finite extent. I doubt if anything in the real world can ever be a perfect Gaussian, because I doubt if anything in the real world will ever be a perfect function of any sort. 


#30




Ignorance fought Cal. Thanks to all the participants in this discussion.
I'd like to wind up a few loose threads. Quote:
Quote:
 GARCH is a method of taking into account serially correlated variances. So in the context of the stock market, a large move today implies a large move tomorrow  though we won't know the direction of that move. According to one set of authors^{2} while volatility clustering in normal GARCH models will increase the kurtosis of the series, it generally doesn't do so sufficiently to reflect the kurtosis (or long tails) of financial market returns. Like other researchers, they opt for a GARCH model with nonGauss innovations. I would think that a normal GARCH model would produce zero skew, though I haven't verified this. Eriksson and Forsberg (2004)^{3} use a GARCH model with conditional skewness. That seems to me to be an odd way of modelling momentum in returns, but I frankly don't understand this properly. Anyway, they appear to use the Wald distribution in their GARCH model rather than Gauss. Still, the OP is about the applicability and robustness of Gauss in general, and is not confined to financial markets. As computing power is cheap, it might not be a bad idea for the researcher to examine the descriptive statistics for empirical errors of uncertain provenance. But whether such a procedure would involve a risk of inappropriate data mining is something that I would have to think harder about. ETA: Bean machines. No data. Discrete output. If it was made continuous by measuring impact location on a plate it would have an odd shape unless the bottom pins were moving.  ^{1}See ROBUST INFERENCE by Frank Hampel (2000) ^{2}See Kurtosis of GARCH and Stochastic Volatility Models with Nonnormal Innovations by Xuezheng Bai, Jeffrey R. Russell, George C. Tiao, July 27, 2001 ^{3}See The Mean Variance Mixing GARCH (1,1) model a new approach to estimate conditional skewness by Anders Eriksson Lars Forsberg 2004. All these working papers are available as .pdfs via google. Last edited by Measure for Measure; 07132011 at 03:56 AM. Reason: Bean machine comment. 
#31




A bean machine meets the conditions of the Central Limit Theorem, so it'll be a good approximation to within the limitations imposed by the binning, the finite number of beans, and the truncated tails. But of course you still have those limitations.

#32




If man's current understanding of physics is correct, then I would guess that a true normal distribution in nature is flat out impossible.
One property of normal distributions is that there is a finite probability of exceeding any value. Thus, if the distribution of velocities of a set of particles is truly normal, then there is a finite chance that one or more of the particles will exceed the speed of light. Which is impossible if man's current understanding of physics is correct. Similarly, if the position of a particle after time t is normally distributed, then there is a finite chance that the particle moved faster than the speed of light. 
#33




Trivial Hijack!
Quote:
Boring Historical Details Here's the original french: Quote:
This doesn't teach us much if we don't have data on phi and psi. So we form an hypothesis for phi and call it The Law of Errors.And here's my paraphrase: as the great physicist Gabriel Lippmann once told Poincare, "Everybody believes in the Gaussian Law of Errors, the experimenters because they imagine that it is a mathematical theorem and the mathematicians because they believe it is an empirical fact." ^{1}Translators note: huh? 
#34




Quote:
By the central limit theorem, he hypothesized that the resulting distribution would be Gauss. He was happy with his results: these graphs do indeed appear to be approximately normal^{1}, as he came to label that distribution. This seemed to him to justify the use of least squares methods. Sort of. His data was reevaluated in 1928 by Edwin B. Wilson and Margartz M. Hilrirty. They concluded that the sample had many more outliers than a Gaussian and a positive skew as well. The dataset was revisited in 2009 by Roger Koenker of the University of Illinois using modern significance tests. Gaussian skewness was rejected in 19 out of 24 days; Gaussian kurtosis was rejected on all days. The author suggested that median approaches might be superior to mean ones, and that quantile approaches might be even better. How did Peirce, sometimes referred to as one of the two greatest American scientists of the 1800s, form his conclusion? Well the plots actually do reveal some visual skew and kurtosis. But they also conform to Tukey's Maxim: "All distributions are normal in the middle." ^{1}See Peirce, C. S. (1873): On the Theory of Errors of Observation," Report of the Superintendent of the U.S. Coast Survey, pp. 200224., Reprinted in The New Elements of Mathematics, (1976) collected papers of C.S. Peirce, ed. by C. Eisele, Humanities Press: Atlantic Highlands, N.J., vol. 3, part 1, 639676. Last edited by Measure for Measure; 07232011 at 09:44 PM. 
Reply 
Thread Tools  
Display Modes  

