stats


21
May 13

What are the chances this headline will still be true in 10 years?

In this post I’ll be discussing the ideas presented in The Half-Life of Facts, by Samuel Arbesman. The book argues that facts, which we often take to be iron-clad, unchanging laws of the universe, are regularly discovered to be false or replaced by updated versions. He argues that while it’s impossible to predict in advance how long a particular fact will endure, in aggregate truth values decay at stable rates. In effect, Arbesman is proposing a kind of Law of Large Numbers for belief.

Arebesman’s thesis, I should say right up front, is highly appealing to me. It fits my belief that all facts are, to some extent, fuzzy, uncertain, contingent, and most importantly prone to revision over time as new information comes in. Of course, some facts, or categories of facts, are more likely to be revised than others. What I hoped to get from Arbesman’s book was a deep analysis of why some facts (or fictions) last longer than others, and how you might quantify different categories of facts from the viewpoint of survival analysis.

What are facts?
Arbesman defines facts as “individual states of knowledge awareness.” His main way of subdividing facts is on the basis of how quickly they change, from those constantly in flux (the current weather) to the very stable (the number of continents). In between are what Arbesman calls “mesofacts,” those which change at an intermediate timescale. Most of our scientific knowledge fits in this category.

When I mentioned the continents, you may have wondered whether I was referring to the number of huge landmasses on earth (a slow-changing fact, by any measure), or what we consider to be a continent. For example, if scientists decide that Madagascar or Baffin Island should be called a continent, the quantity of large land-masses on earth hasn’t changed.

Rauncho, the thirst mutilator!
This may seem like an obvious distinction, but it’s one that Arbesman fails to make. He conflates facts about the earth with nomenclature, confusing words with objects. The worst example of this confusion occurs in the chapter on how facts spread. Arbesman explains how we came to use the word “brontosaurous” for what, by scientific convention, should be call the apatosaurus, as this name came first. Here the “fact” that changed doesn’t really have anything to do with the nature of dinosaurs, it has to do with the name we’ve decided to give it (which is, of course, a matter of convention, and arbitrary). To Arbesman, though, this issue of nomenclature becomes an “erroneous” fact which has “sadly” persisted for way to long.

The conflation of semantics and understanding allows Arbesman to hide a normative decree in a linguistic assessment. If my explanation of the confusion between the descriptive and prescriptive is, itself, confusing, consider Mike Judge’s wonderful illustration from the film Idiocracy. The main character tries to explain to the people of the future that their plants are dying because they are being irrigated with Rauncho, a sport drink. Here’s how their conversation goes:

Arbesman’s failure to draw a line around what are facts, and what aren’t, leads to even deeper confusions. Making this distinction clear would be, no doubt, a very difficult task. But instead of attempting it, and risking falling into “an epistemological rabbit hole,” Arbesman’s shrugs and paraphrases the supremely weasely Supreme Court Justice Potter, who said that no precise, legal definition of pornography was needed, because “I know it when I see it.”

Without a line (however fuzzy) drawn around his subject, Arbesman quickly wanders off from an insightful discussion of the decay rate of information in physics, medicine and scientific models in general, to a broad discussion of the things in our world that change. This transition is completed in the chapter titled “Moore’s law of everything,” in which Arbesman compares exponential growth in computing power to other technologies with accelerating levels of change, like transportation. At this point it’s no longer clear which are the facts under consideration. Is it the maximum number of transistors per chip? Is it our model of how technology changes? Or is it the rate of change of change itself?

Is change a constant?
This last question might be the most interesting one of all. More clearly stated, what is the derivative of the half-life of facts, for a given category? And even one more step beyond, are these derivatives themselves stable? I want to know what the evidence says. Are medical facts becoming obsolete faster than ever? Has our knowledge about basic physical concepts like inertia begun to solidify? Arbesman hints at these questions, but just barely. I was very disappointed by his lack of rigor and quantification. Perhaps this field of study still needs it’s Darwin or John Graunt, someone willing to spend years or decades compiling and analyzing the minutia how facts change, before coming up with a well-informed model of truth decay.

My own suspicion? The stability of a fact is proportional to how well the related field of study is established, and to how long that particular fact has been considered valid. Thus the lifespan of facts would be Weibull distributed, or have some variant of the Unreliable Friend distribution (more about that in a future post). Arbesman hints at this possibility when discussing the history of mathematical proof. He notes that the waiting time for a conjecture to be settled follows a heavy-tailed distribution, which makes it difficult to predict how much longer it will take for mathematicians to come to a conclusion about long-standing problems.

But even this attempt at a more nuanced view of half-lives hints at another problem with Arbesman’s incomplete taxonomy of facts, and his unwillingness to specify which facts we are discussing. In this case of mathematics, it seems at first that he might be referring to the underlying proposition itself. This leads me to wonder if Arbesman is positing (at least implicitly) a Schrödinger’s cat view of the mathematics, where Fermat’s Last Theorem (FLT) exists in a state of superposition, both true and false and indeterminate all at once, waiting for Andrew Weil to come along to open the lid, peer into the box, and declare it “true.” Another interpretation is that the fact being discussed is the social phenomenon; mathematicians went from believing that FLT was probably true but definitely unproven, to believing that FLT was indisputably true. Based on his initial definition in terms of awareness, I assume it’s the later. Unfortunately, no clarification is forthcoming, and Arbesman misses out on an opportunity to comment on the two most interesting twists in the FLT saga, especially from the point of view of evaluating “facts”. For one, Weil made a crucial mistake in his first official version of the proof, and for the other, Weil’s proof depends on a newer, and somewhat controversial, mathematical assumption (the Axiom of Choice).

Chart from The Half-Life of Facts showing the increase in transportation speeds over time.

The depths of shallowness
I suppose there’s a limit to how much depth we can expect from a general interest book. Still, I’m disappointed that the author seems to explicitly avoids discussing the basic, hard puzzles of knowledge: How close to the (real?) truth are the “facts” we are learning today? What is the probability that these will be later found out to be untrue? Does that probability go to one on a long enough timeline, and to what extent can we quantify that timeline.

Instead of rigorous analysis, Arbesman fills out his short book by rehashing famous stories from well-known research papers (if I have to read about the gorilla on the basketball court one more time, I just might go apeshit). We do get occasional bits of insight, usually in the form of quotes, like Lord Kelvin’s insistence that anything that can be measured, can be measured incorrectly, or John M. Smith’s quip that “Statistics is the science that lets you do twenty experiments a year and publish one false result in Nature.”

This last quote refers to the p-value, which Arbesman does a decent job of explaining, though I’m+ not sure he fully understands it. He quotes John Ioannidis saying that, “If a study is small, it can yield a positive result more easily due to random chance.” However, the wse a fixed p-vale cutoff generally ensures that the exact oppose is true (see this delightfully humorous video about “The power of the test”). The structure of hypothesis testing can be tricky, but since Arbesman is described on the book jacket as an applied mathematician, I’m not willing to grade him on a curve.

There’s one other confusion in Arbesman’s book that I feel compelled to point out, since it may just be the most insidious (and common) epistemological mistake of all: the conflation of facts, predictions, and models. Arbesman mixes them all together in a short passage. In describing computer simulation of a social network, he says:

“When [the researchers] ran this experiment, they discovered that weak ties aren’t that important to spreading knowledge. While weak ties do in fact hold the network together, much as Granovetter suspected, they aren’t integral for spreading facts.”

Did you catch that? Arbesman went from describing a model (in this case a computer simulation) that generated a prediction (about the spread of information), to asserting a fact about our world (weak ties “aren’t integral for spreading facts”).

Am I just being annoying, noxious, always lingering?
Am I’m being overly fussy (to use the nicer word)? Am I too focused on precise definitions and picky distinctions, at the cost of missing the bigger picture? I don’t think so. The history of scientific progress, and in particular statistics, shows a strong correlation between linguistic and taxonomic advances. We can look back and see how progress is stifled by a lack of common, well-defined terms. For example, some of the early attempts to understand probability disintegrated into confused debates that could have been avoided with a clear stating of terms. More recently, E.T. Jaynes resolved Bertrand Russell’s paradox of the random chord by explicitly defining the characteristics a “random” chord would need to have.

If Arbesman is sloppy with the details, can he at least get credit for presenting the broader story in context? To some extent, I think so. As a general tour of how facts change, there’s no mistaking the basic message: facts do change, and we can be particularly blind (or caught off guard) when it comes to changes which happen at a medium pace. I wish, though, that Arbesman had explicitly connected this broader story with what is, to me, the central lesson: all of our beliefs should come with a measure of doubt!

To understand this doubt mathematically, we use probability theory. To understand it in practice, we use a framework for statistical inference. There are a number of these frameworks available, each with it’s own strengths and weaknesses. Hume said we could never infer anything from anything, giving us a kind of historical “null hyothesis” of inference, one that’s been soundly rejected by the evidence of scientific and technological progress. Fisher and von Mises maintained that probability should be restricted to long term frequencies. Keynes and Jefferies spoke of subjective probabilities and degrees of rational belief. Jaynes viewed probability theory as an extension of logical deduction.

All modern approaches to inference share the assumption that knowledge is not static, and that empirical evidence provides partial information. Full certainty, to the extent that it exists at all, is to be found only in the very long run (mathematically speaking, at the infinite limit). As such, we need to recognize the provisional nature of all facts.


18
Apr 13

Sudden clarity about the null hypothesis

Can’t take credit for this realization (I studied at an “Orthodox” shop), and the “Clarance” wishes to be anonymous, so send all karma points to your favorite (virtual) charity.


27
Mar 13

Minding the reality gap

Officially, unemployment in the US is declining. It’s fallen from a high of 9.1% a couple years ago, to 7.8% in recent months. This would be good news, if the official unemployment rate measured unemployment, in the everyday sense of the word. It doesn’t. The technical definition of “U3″ unemployment, the most commonly reported figure, excludes people who’ve given up looking for work, those who’ve retired early due to market conditions, and workers so part time they clock in just one hour per week.

Most critically, unemployment excludes the 14 million American on disability benefits, a number which has quadrupled over the last 30 years. If you include just this one segment of the population in the official numbers, the unemployment rate would double. On Saturday, This American Life devoted their entire hour to an exploration of this statistic. Russ Robert’s, who’s podcast I’ve recommend in the past, discussed the same topic last year. Despite the magnitude of the program and the scale of the change, these are the only outlets I know of to report on the disability number, and on the implications it has for how we interpret the decline in U3 unemployment.

Targeting the number, not the reality
Statistics, in the sense of numerical estimates, are measures which attempt to condense the complex world of millions of people into a single data point. Honest statistics come with margins of error (the most honest indicate, at least qualitatively, a margin of error for their margin of error). But even the best statistical measures are merely symptoms of some underlying reality; they reflect some aspect of the reality as accurately as possible. The danger with repeated presentations of any statistic (as in the quarterly, monthly, and even hourly reporting of GDP, unemployment, and Dow Jones averages), is that we start to focus on this number by itself, regardless of the reality it was created to represent. It’s as if the patient has a high fever and all anyone talks about is what the thermometer says. Eventually the focus becomes, “How do we get the thermometer reading down?” All manner of effort goes into reducing the reading, irrespective of the short, and certainly long-term, health of the patient. When politicians speak about targeting unemployment figures, this is what they mean, quite literally. Their goal is to bring down the rate that gets reported by the Bureau of Labor Statistics, the number discussed on television and in every mainstream source of media.

Politicians focus on high profile metrics, and not the underlying realities, because the bigger and more complicated the system, the easier it is to tweak the method of measurement or its numeric output, relative to the difficulty of fixing the system itself. Instead of creating conditions which allow for growth in employment (which would likely require a reduction in politicians’ legislative and financial powers), the US has quietly moved a huge segment of its population off welfare, which counts against unemployment, and into disability and prisons — the incarcerated also don’t count in U3, whether they are slaving away behind bars or not.

How metrics go bad
Over time, all social metrics diverge from the reality they were created to reflect. Sometimes this is the result of a natural drift in the underlying conditions; the metric no longer captures the same information it had in the past, or no longer represents the broad segment of society it once did. For example, the number of physical letters delivered by the postal service no longer tracks the level of communication between citizens.

Statistics and the reality they were designed to represent are also forced apart through deliberate manipulation. Official unemployment figures are just one example of an aggressively targeted/manipulated metric. Another widely abused figure is the official inflation rate, or core Consumer Price Index. This measure excludes food and energy prices, for the stated reason that they are highly volatile. Of course, these commodities represent a significant fraction of nearly everyone’s budget, and their prices can be a leading indicator of inflation. The CPI also uses a complex formula to calculate “hedonics,” which mark down reported prices based on how much better the new version of a product is compared to the old one (do a search for “let them eat iPads”).

I don’t see it as a coincidence that unemployment and inflation figures are among the most widely reported and the most actively manipulated. In fact, I take the following to be an empirical trend so strong I’m willing to call it a law: the greater the visibility of a metric, the more money and careers riding on it, the higher the likelihood it will be “targeted.” In this light, the great scandal related to manipulation of LIBOR, a number which serves as pivot point for trillions of dollars in contracts, is that the figure was assumed to be accurate to begin with.

Often the very credibility of the metric, built up over time by its integrity and ability to reflect an essential feature of the underlying reality, is cashed in by those who manipulate it. Such was the case with the credit ratings agencies: after a long run of prudent assessments, they relaxed their standards for evaluating mortgage bundles, cashing in on the windfall profits generated by the housing bubble.

Why we don’t see the gaps
It might seem like the disconnect between a statistic and reality would cause a dissonance that, once large enough to be clearly visible, would lead to reformulation of the statistic, bringing it back in line with the underlying fundamentals. Clearly there are natural pressures in that direction. For example, people laid off at the beginning of a recession are unlikely to believe that the recovery has begun until they themselves go back to work. Their skepticism of the unemployment figure erodes its credibility. Unfortunately, two powerful forces work against the re-alignment of metric and reality: the first related to momentum and our blindness to small changes, the second having to do with the effects of reflexivity and willful ignorance.

In terms of inertia, humans have a built-in tendency to believe that what has been will continue to be. More sharply, the longer a trend has continued, the longer we presume it will continue — if it hasn’t happened yet, how could it happen now? Laplace’s rule of succession is our best tool for estimating probabilities under the assumption of a constant generating process, one that spits out a stream of conditionally independent (exchangeable) data points. But the rule of succession fails utterly, at times spectacularly, when the underlying conditions change. And underlying conditions always change!

These changes, when they come slowly, pass under our radar. Humans are great at noticing large differences from one day to the next, but poor at detecting slow changes over long periods of time. Ever walked by an old store with an awning or sign that’s filthy and falling apart? You wonder how the store owner could fail to notice the problem, but there was never any one moment when it passed from shiny and new to old and decrepit. If you think you’d never be as blind as that shop keeper, look down at your keyboard right now. As with our environment, if the gap between statistic and reality changes slowly, over time, we may not see the changes. Meanwhile, historical use of the statistic lends weight to it’s credibility, reducing the chance that we’d notice or question the change — it has to be right, it’s what we’ve always used!

The perceived stability of slowly changing systems encourages participants to depend on or exploit it. This, in turn, can create long term instabilities as minor fluctuations trigger extreme reactions on the part of participants. Throughout the late 20th century and the first years of the 21st, a large number of investors participated in the “Carry Trade,” a scheme which depended on the long term stability of the Yen, and of the differential between borrowing rates in Japan and interest rates abroad. When conditions changed in 2008, investors “unwound” these trades at full speed, spiking volatility and encouraging even more traders to exit their positions as fast as they could.

These feedback loops are an example of reflexivity, the tendency in some complex systems for perception (everyone will panic and sell) to affect reality (everyone panics and sells). Reflexivity can turn statistical pronouncements into self-fulfilling prophecies, at least for a time. The belief that inflation is low, if widespread, can suppress inflation in and of itself! If I believe that the cash in my wallet and the deposits in my bank account will still be worth essentially the same amount tomorrow or in a year, then I’m less likely to rush out to exchange my currency for hard goods. Conversely, once it’s clear that my Bank of Zimbabwe Bearer Cheques have a steeply declining half-life of purchasing power, then I’m going to trade these paper notes for tangible goods as quickly as possible, nominal price be damned!

Don’t look down

If perception can shape reality, then does the gap between reality and statistic matter? Clearly, the people who benefit most from the status quo do their best to avoid looking down, lest they encourage others to do the same. More generally, though, can we keep going forward so long as we don’t look down, like Wile E. Coyote chasing the road runner off a cliff?

The clear empirical answer to that questions is: “Yes, at least for a while.” The key is that no one knows how long this while can last, nor is it clear what happens when the reckoning comes. Despite what ignorant commentators might have said ex post facto, by 2006 there was wide understanding that housing prices were becoming un-sustainably inflated. In 2008, US prices crashed back down to earth. North of the border, in Canada, the seemingly equally inflated housing market stumbled, shrugged, then continued along at more level, but still gravity-defying trajectory.

The high cost of maintaining the facade
Even as the pressures to close the gap grow along with its size, the larger the divergence between official numbers and reality, the greater the pressures to keep up the facade. If the fictional single entity we call “the economy” appears to be doing better, politicians get re-elected and consumers spend more money. When the music finally stops, so too will the gravy-train for a number of vested interests. So the day of reckoning just keeps getting worse and worse as more and more resources go into maintaining the illusion, into reassuring the public that nothing’s wrong, into extending, pretending, and even, if need be, shooting the messenger.

It’s not just politicians and corporations who become invested in hiding and ignoring the gap. We believe official statistics because we want to believe them, and we act as if we believe them because we believe that others believe them. We buy houses or stocks at inflated prices on the hope that someone else will buy them from us at an even more inflated price.

My (strong) belief is that most economic and political Black Swans are the result of mass delusion, based on our faith in the quality and meaning of prominently reported, endlessly repeated, officially sanctioned statistics. The illustration at the beginning of this post comes from a comic I authored about a character who makes his living off just this gap between official data and the reality on the ground, a gap that always closes, sooner or later, making some rich and toppling others.


3
Dec 12

The surprisingly weak case for global warming

I welcome your thoughts on this post, but please read through to the end before commenting. Also, you’ll find the related code (in R) at the end. For those new to this blog, you may be taken aback (though hopefully not bored or shocked!) by how I expose my full process and reasoning. This is intentional and, I strongly believe, much more honest than presenting results without reference to how many different approaches were taken, or how many models were fit, before everything got tidied up into one neat, definitive finding.

Fast summaries

TL;DR (scientific version): Based solely on year-over-year changes in surface temperatures, the net increase since 1881 is fully explainable as a non-independent random walk with no trend.
TL;DR (simple version): Statistician does a test, fails to find evidence of global warming.

Introduction and definitions

As so often happens to terms which have entered the political debate, “global warming” has become infused with additional meanings and implications that go well beyond the literal statement: “the earth is getting warmer.” Anytime someone begins a discussion of global warming (henceforth GW) without a precise definition of what they mean, you should assume their thinking is muddled or their goal is to bamboozle. Here’s my own breakdown of GW into nine related claims:

  1. The earth has been getting warmer.
  2. This warming is part of a long term (secular) trend.
  3. Warming will be extreme enough to radically change the earth’s environment.
  4. The changes will be, on balance, highly negative.
  5. The most significant cause of this change is carbon emissions from human beings.
  6. Human beings have the ability to significantly reverse this trend.
  7. Massive, multilateral cuts to emissions are a realistic possibility.
  8. Such massive cuts are unlikely to cause unintended consequences more severe than the warming itself.
  9. Emissions cuts are better than alternative strategies, including technological fixes (i.e. iron fertilization), or waiting until scientific advances make better technological fixes likely.

Note that not all proponents of GW believe all nine of these assertions.

The data and the test (for GW1)

The only claims I’m going to evaluate are GW1 and GW2. For data, I’m using surface temperature information from NASA. I’m only considering the yearly average temperature, computed by finding the average of four seasons as listed in the data. The first full year of (seasonal) data is 1881, the last year is 2011 (for this data, years begin in December and end in November).

According to NASA’s data, in 1881 the average yearly surface temperature was 13.76°C. Last year the same average was 14.52°C, or 0.76°C higher (standard deviation on the yearly changes is 0.11°C). None of the most recent ten years have been colder than any of the first ten years. Taking the data at face value (i.e. ignoring claims that it hasn’t been properly adjusted for urban heat islands or that it has been manipulated), the evidence for GW1 is indisputable: The earth has been getting warmer.

Usually, though, what people mean by GW is more than just GW; they mean GW2 as well, since without GW2 none of the other claims are tenable, and the entire discussion might be reduced to a conversation like this:

“I looked up the temperature record this afternoon, and noticed that the earth is now three quarters of a degree warmer than it was in the time of my great great great grandfather.”
“Why, I do believe you are correct, and wasn’t he the one who assassinated James A. Garfield?”
“No, no, no. He’s the one who forced Sitting Bull to surrender in Saskatchewan.”

Testing GW2

Do the data compel us to view GW as part of a trend and not just background noise? To evaluate this claim, I’ll be taking a standard hypothesis testing approach, starting with the null hypothesis that year-over-year (YoY) temperature changes represent an undirected random walk. Under this hypothesis, the YoY changes are modeled as a independent draws from a distribution with mean zero. The final temperature represents the sum of 130 of these YoY changes. To obtain my sampling distribution, I’ve calculated the 130 YoY changes in the data, then subtracted the mean from each one. This way, I’m left with a distribution with the same variance as in the original data. YoY jumps in temperature will be just as spread apart as before, but with the whole distribution shifted over until its expected value becomes zero. Note that I’m not assuming a theoretical distributional form (eg Normality), all of the data I’m working with is empirical.

My test will be to see if, by sampling 130 times (with replacement!) from this distribution of mean zero, we can nonetheless replicate a net change in global temperatures that’s just as extreme as the one in the original data. Specifically, our p-value will be the fraction of times our Monte Carlo simulation yields a temperature change of greater than 0.76°C or less than -0.76°C. Note that mathematically, this is the same test as drawing from the original data, unaltered, then checking how often the sum of changes resulted in a net temperature change of less than 0 or more than 1.52°C.

I have not set a “critical” p-value in advance for rejecting the null hypothesis, as I find this approach to be severely limiting and just as damaging to science as J-Lo is to film. Instead, I’ll comment on the implied strength of the evidence in qualitative terms.

Initial results

The initial results are shown graphically at the beginning of this post (I’ll wait while you scroll back up). As you can see, a large percentage of the samples gave a more extreme temperature change than what was actually observed (shown in red). During the 1000 trials visualized, 56% of the time the results were more extreme than the original data after 130 years worth of changes. I ran the simulation again with millions of trials (turn off plotting if you’re going to try this!); the true p-value for this experiment is approximately 0.55.

For those unfamiliar with how p-values work, this means that, assuming temperature changes are randomly plucked out of a bundle of numbers centered at zero (ie no trend exists), we would still see equally dramatic changes in temperature 55% of the time. Under even the most generous interpretation of the p-value, we have no reason to reject the null hypothesis. In other words, this test finds zero evidence of a global warming trend.

Testing assumptions Part 1

But wait! We still haven’t tested our assumptions. First, are the YoY changes independent? Here’s a scatterplot showing the change in temperature one year versus the change in temperature the next year:

Looks like there’s a negative correlation. A quick linear regression gives a p-value of 0.00846; it’s highly unlikely that the correlation we see (-0.32) is mere chance. One more test worth running is the ACF, or the Autocorrelation function. Here’s the plot R gives us:

Evidence for a negative correlation between consecutive YoY changes is very strong, and there’s some evidence for a negative correlation between YoY changes which are 2 years apart as well.

Before I explain how to incorporate this information into a revised Monte Carlo simulation, what does a negative correlation mean in this context? It tells us that if the earth’s temperature rises by more than average in one year, it’s likely to fall (or rise less than average) the following year, and vice versa. The bigger the jump one way, the larger the jump the other way next year (note this is not a case of regression to the mean; these are changes in temperature, not absolute temperatures). If anything, this is evidence that the earth has some kind of built in balancing mechanism for global temperature changes, but as a non-climatologist all I can say is that the data are compatible with such a mechanism; I have no idea if this makes sense physically.

Correcting for correlation

What effect will factoring in this negative correlation have on our simulation? My initial guess is that it will cause the total temperature change after 130 years to be much smaller than under the pure random walk model, since changes one year are likely to be balanced out by changes next year in the opposite direction. This would, in turn, suggest that the observed 0.76°C change over the past 130 years is much less likely to happen without a trend.

The most straightforward way to incorporate this correlation into our simulation is to sample YoY changes in 2-year increments. Instead of 130 individual changes, we take 65 changes from our set of centered changes, then for each sample we look at that year’s changes and the year that immediately follows it. Here’s what the plot looks like for 1000 trials.

After doing 100,000 trials with 2 year increments, we get a p-value of 0.48. Not much change, and still far from being significant. Sampling 3 years at a time brings our p-value down to 0.39. Note that as we grab longer and longer consecutive chains at once, the p-value has to approach 0 (asymptotically) because we are more and more likely to end up with the original 130 year sequence of (centered) changes, or a sequence which is very similar. For example, increasing our chain from one YoY change to three reduces the number of samplings from 130130 to approximately 4343 – still a huge number, but many orders of magnitude less (Fun problem: calculate exactly how many fewer orders of magnitude. Hint: If it takes you more than a few minutes, you’re doing it wrong).

Correcting for correlation Part 2 (A better way?)

To be more certain of the results, I ran the simulation in a second way. First I sampled 130 of the changes at random, then I threw out any samplings where the correlation coefficient was greater than -0.32. This left me with the subset of random samplings whose coefficients were less than -0.32. I then tested these samplings to see the fraction that gave results as extreme as our original data.

Compared to the chained approach above, I consider this to be a more “honest” way to sample an empirical distribution, given the constraint of a (maximum) correlation threshold. I base this on E.T. Jaynes’ demonstration that, in the face of ignorance as to how a particular statistic was generated, the best approach is to maximize the (informational) entropy. The resulting solution is the most likely result you would get if you sampled from the full space (uniformly), then limited your results to those which match your criteria. Intuitively, this approach says: Of all the ways to arrive at a correlation of -0.32 or less, which are the most likely to occur?

For a more thorough discussion of maximum entropy approaches, see Chapter 11 of Jaynes’ book “Probability Theory” or his “Papers on Probability” (1979). Note that this is complicated, mind-blowing stuff (it was for me, anyway). I strongly recommend taking the time to understand it, but don’t bother unless you have at least an intermediate-level understanding of math and probability.

Here’s what the plot looks like subject to the correlation constraint:

If it looks similar to the other plots in terms of results, that’s because it is. Empirical p-value from 1000 trials? 0.55. Because generating samples with the required correlation coefficients took so long, these were the only trials I performed. However, the results after 1000 trials are very similar to those for 100,000 or a million trials, and with a p-value this high there’s no realistic chance of getting a statistically significant result with more trials (though feel free to try for yourself using the R code and your cluster of computers running Hadoop). In sum, the maximum entropy approach, just like the naive random walk simulation and the consecutive-year simulations, gives us no reason to doubt our default explanation of GW2 – that it is the result of random, undirected changes over time.

One more assumption to test

Another assumption in our model is that that YoY changes have constant variance over time (homoscedasticity). Here’s the plot of the (raw, uncentered) YoY changes:

It appears that the variance might be increasing over time, but just looking at the plot isn’t conclusive. To be sure, I took the absolute value of the changes and ran a simple regression on them. The result? Variance is increasing (p-value 0.00267), though at a rate that’s barely perceptible; the estimated absolute increase in magnitude of the YoY changes is 0.046. That figure is in hundreths of degrees Celsius, so our linear model gives a rate of increase in variability of just 4.6 ten-thousands of a degree per year. Over the course of 130 years, that equates to an increase of six hundredths of a degree Celsius (margin of error of 3.9 hundredths at two std deviations). This strikes me as a miniscule amount, though relative to the size of the YoY changes themselves it’s non-trivial.

Does this increase in volatility invalidate our simulation? I don’t think so. Any model which took into account this increase in volatility (while still being centered) would be more likely to produce extreme results under the null hypothesis of undirected change. In other words, the bigger the yearly temperature changes, the more likely a random sampling of those changes will lead us far away from our 13.8°C starting point in 1881, with most of the variation coming towards the end. If we look at the data, this is exactly what happens. During the first 63 years of data the temperature increases by 42 hundredths of a degree, then drops 40 hundredths in just 12 years, then rises 80 hundredths within 25 years of that; the temperature roller coaster is becoming more extreme over time, as variability increases.

Beyond falsifiability

Philosopher Karl Popper insisted that for a theory to be scientific, it must be falsifiabile. That is, there must exist the possibility of evidence to refute the theory, if the theory is incorrect. But falsifiability, by itself, is too low a bar for a theory to gain acceptance. Popper argued that there were gradations and that “the amount of empirical information conveyed by a theory, or it’s empirical content, increases with its degree of falsifiability” (emphasis in original).

Put in my words, the easier it is to disprove a theory, the more valuable the theory. (Incorrect) theories are easy to disprove if they give narrow prediction bands, are testable in a reasonable amount of time using current technology and measurement tools, and if they predict something novel or unexpected (given our existing theories).

Perhaps you have already begun to evaluate the GW claims in terms of these criteria. I won’t do a full assay of how the GW theories measure up, but I will note that we’ve had several long periods (10 years or more) with no increase in global temperatures, so any theory of GW3 or GW5 will have to be broad enough to encompass decades of non-warming, which in turn makes the theory much harder to disprove. We are in one of those sideways periods right now. That may be ending, but if it doesn’t, how many more years of non-warming would we need for scientists to abandon the theory?

I should point out that a poor or a weak theory isn’t the same as an incorrect theory. It’s conceivable that the earth is in a long-term warming trend (GW2) and that this warming has a man-made component (GW5), but that this will be a slow process with plenty of backsliding, visible only over hundreds or thousands of years. The problem we face is that GW3 and beyond are extreme claims, often made to bolster support for extreme changes in how we live. Does it make sense to base extreme claims on difficult to falsify theories backed up by evidence as weak as the global temperature data?

Invoking Pascal’s Wager

Many of the arguments in favor of radical changes to how we live go like this: Even if the case for extreme man-made temperature change is weak, the consequences could be catastrophic. Therefore, it’s worth spending a huge amount of money to head off a potential disaster. In this form, the argument reminds me of Pascal’s Wager, named after Blaise Pascal, a 17th century mathematician and co-founder of modern probability theory. Pascal argued that you should “wager” in favor of the existance of God and live life accordingly: If you are right, the outcome is infinitely good, whereas if you are wrong and there is no God, the most you will have lost is a lifetime of pleasure.

Before writing this post, I Googled to see if others had made this same connection. I found many discussions of the similarities, including this excellent article by Jim Manzi at The American Scene. Manzi points out problems with applying Pascal’s Wager, including the difficulty in defining a stopping point for spending resources to prevent the event. If a 20°C increase in temperature is possible, and given that such an increase would be devastating to billions of people, then we should be willing to spend a nearly unlimited amount to avert even a tiny chance of such an increase. The math works like this: Amount we should be willing to spend = probability of 20°C increase (say 0.00001) * harm such an increase would do (a godzilla dollars). The end result is bigger than the GDP of the planet.

Of course, catastrophic GW isn’t the only potential threat can have Pascal’s Wager applied to it. We also face annihilation from asteroids, nuclear war, and new diseases. Which of these holds the trump card to claim all of our resources? Obviously we need some other approach besides throwing all our money at the problem with the scariest Black Swan potential.

There’s another problem with using Pascal’s Wager style arguments, one I rarely see discussed: proponents fail to consider the possibility that, in radically altering how we live, we might invite some other Black Swan to the table. In his original argument, Pascal the Jansenist (sub-sect of Christianity) doesn’t take into account the possibility that God is a Muslim and would be more upset by Pascal’s professed Christianity than He would be with someone who led a secular lifestyle. Note that these two probabilities – that God is Muslim who hates Christians more than atheists, or that God is Christian and hates atheists – are incommesurable! There’s no rational way to weigh them and pick the safer bet.

What possible Black Swans do we invite by forcing people to live at the same per-capita energy-consumption level as our forefathers in the time of James A. Garfield?

Before moving on, I should make clear that humans should, in general, be very wary of inviting Black Swans to visit. This goes for all experimentation we do at the sub-atomic level, including work done at the LHC (sorry!), and for our attempts to contact aliens (as Stephen Hawking has pointed out, there’s no certainty that the creatures we attract will have our best interests in mind). So, unless we can point to strong, clear, tangible benefits from these activities, they should be stopped immediately.

Beware the anthropic principle

Strictly speaking, the anthropic principle states that no matter how low the odds are that any given planet will house complex organisms, one can’t conclude that the existence of life on our planet is a miracle. Essentially, if we didn’t exist, we wouldn’t be around to “notice” the lack of life. The chance that we should happen to live on a planet with complex organisms is 1, because it has to be.

More broadly, the anthropic principle is related to our tendency to notice extreme results, then assume these extremes must indicate something more than the noise inherent in random variation. For example, if we gathered together 1000 monkeys to predict coin tosses, it’s likely that one of them will predict the first 10 flips correctly. Is this one a genius, a psychic, an uber-monkey? No. We just noticed that one monkey because its record stood out.

Here’s another, potentially lucrative, most likely illegal, definitely immoral use of the anthropic principle. Send out a million email messages. In half of them, predict that a particular stock will go up the next day, in the other half predict it will go down. The next day, send another round of predictions to just those emails that got the correct prediction the first time. Continue sending predictions to only those recipients who receive the correct guesses. After a dozen days, you’ll have a list of people who’ve seen you make 12 straight correct predictions. Tell these people to buy a stock you want to pump and dump. Chances are good they’ll bite, since from their perspective you look like a stock-picking genius.

What does this have to do with GW? It means that we have to disentangle our natural tendency to latch on to apparent patterns from the possibility that this particular pattern is real, and not just an artifact of our bias towards noticing unlikely events under null hypotheses.

Biases, ignorance, and the brief life, death, and afterlife of a pet theory

While the increase in volatility seen in the temperature data complicates our analysis of the data, it gives me hope for a pet theory about climate change which I’d buried last year (where does one bury a pet theory?). The theory (for which I share credit with my wife and several glasses of  wine) is that the true change in our climate should best be described as Distributed Season Shifting, or DSS. In short, DSS states that we are now more likely to have unseasonably warm days during the colder months, and unseasonably cold days during the warmer months. Our seasons are shifting, but in a chaotic, distributed way. We built this theory after noticing a “weirdening” of our weather here in Toronto. Unfortunately (for the theory), no matter how badly I tortured the local temperature data, I couldn’t get it to confess to DSS.

However, maybe I was looking at too small a sample of data. The observed increase in volatility of global YoY changes might also be reflected in higher volatility within the year, but the effects may be so small that no single town’s data is enough to overcome the high level of “normal” volatility within seasonal weather patterns.

My tendency to look for confirmation of DSS in weather data is a bias. Do I have any other biases when it comes to GW? If anything, as the owner of a recreational property located north of our northern city, I have a vested interest in a warmer earth. Both personally (hotter weather = more swimming) and financially, GW2 and 3 would be beneficial. In a Machiavellian sense, this might give me an incentive to downplay GW2 and beyond, with the hope that our failure to act now will make GW3 inevitable. On the other hand, I also have an incentive to increase the perception of GW2, since I will someday be selling my place to a buyer who will base her bid on how many months of summer fun she expects to have in years to come.

Whatever impact my property ownership and failed theory have on this data analysis, I am blissfully free of one biasing factor shared by all working climatologists: the pressures to conform to peer consensus. Don’t underestimate the power of this force! It effects everything from what gets published to who gets tenure. While in the long run scientific evidence wins out, the short run isn’t always so short: For several decades the medical establishment pushed the health benefits of a low fat, high carb diet. Alternative views are only now getting attention, despite hundreds of millions of dollars spent on research which failed to back up the consensus claims.

Is the overall evidence for GW2 – 9 as weak as the evidence used to promote high carb diets? I have no idea. Beyond the global data I’m examining here, and my failed attempt to “discover” DSS in Toronto’s temperature data, I’m coming from a position of nearly complete ignorance: I haven’t read the journal articles, I don’t understand the chemistry, and I’ve never seen Al Gore’s movie.

Final analysis and caveats

Chances are, if you already had strong opinions about the nine faces of GW before reading this article, you won’t have changed your opinion much. In particular, if a deep understanding of the science has convinced you that GW is a long term, man-made trend, you can point out that I haven’t disproven your view. You could also argue the limitations of testing the data using the data, though I find this more defensible than testing the data with a model created to fit the data.

Regardless of your prior thinking, I hope you recognize that my analysis shows that YoY temperature data, by itself, provides no evidence for GW2 and beyond. Also, because of the relatively long periods of non-warming within the context of an overall rise in global temperature, any correct theory of GW must include backsliding within it’s confidence intervals for predictions, making it a weaker theory.

What did my analysis show for sure? Clearly, temperatures have risen since the 1880s. Also, volatility in temperature changes has increased. That, of itself, has huge implications for our lives, and tempts me to do more research on DSS (what do you call pet theory that’s risen from the dead?). I’ve also become intrigued with the idea that our climate (at large) has mechanisms to balance out changes in temperature. In terms of GW2 itself, my analysis has not convinced me that it’s all a myth. If we label random variation “noise” and call trend a “signal,” I’ve shown that yearly temperature changes are compatible with an explanation of pure noise. I haven’t shown that no signal exists.

Thanks for reading all the way through! Here’s the code:

Code in R

theData = read.table("/path/to/theData/FromNASA/cleanedForR.txt", header=T) 
 
# There has to be a more elegant way to do this
theData$means = rowMeans(aggregate(theData[,c("DJF","MAM","JJA","SON")], by=list(theData$Year), FUN="mean")[,2:5])
 
# Get a single vector of Year over Year changes
rawChanges = diff(theData$means, 1)
 
# SD on yearly changes
sd(rawChanges)
 
# Subtract off the mean, so that the distribution now has an expectaion of zero
changes = rawChanges - mean(rawChanges)
 
# Find the total range, 1881 to 2011
(theData$means[131] - theData$means[1])/100
 
# Year 1 average, year 131 average, difference between them in hundreths
y1a = theData$means[1]/100 + 14
y131a = theData$means[131]/100 + 14
netChange = (y131a - y1a)*100 
 
# First simulation, with plotting
plot.ts(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3, xlab="Year", ylab="Temperature anomaly in hundreths of a degrees Celsius")
 
trials = 1000
finalResults = rep(0,trials)
 
for(i in 1:trials) {
	jumps = sample(changes, 130, replace=T)
 
	# Add lines to plot for this, note the "alpha" term for transparency
	lines(cumsum(c(0,jumps)), col=rgb(0, 0, 1, alpha = .1))
 
	finalResults[i] = sum(jumps)
 
}
 
# Re-plot red line again on top, so it's visible again
lines(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3) 
 
# Fnd the fraction of trials that were more extreme than the original data
( length(finalResults[finalResults>netChange]) + length(finalResults[finalResults<(-netChange)]) ) / trials # Many more simulations, minus plotting trials = 10^6 finalResults = rep(0,trials) for(i in 1:trials) { 	jumps = sample(changes, 130, replace=T) 	 	finalResults[i] = sum(jumps) } # Fnd the fraction of trials that were more extreme than the original data ( length(finalResults[finalResults>netChange]) + length(finalResults[finalResults<(-netChange)]) ) / trials # Looking at the correlation between YoY changes x = changes[seq(1,129,2)] y = changes[seq(2,130,2)] plot(x,y,col="blue", pch=20, xlab="YoY change in year i (hundreths of a degree)", ylab="YoY change in year i+1 (hundreths of a degree)") summary(lm(x~y)) cor(x,y) acf(changes) # Try sampling in 2-year increments plot.ts(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3, xlab="Year", ylab="Temperature anomaly in hundreths of a degrees Celsius") trials = 1000 finalResults = rep(0,trials) for(i in 1:trials) { 	indexes = sample(1:129,65,replace=T) 	 	# Interlace consecutive years, to maintian the order of the jumps  	jumps = as.vector(rbind(changes[indexes],changes[(indexes+1)])) 	 	lines(cumsum(c(0,jumps)), col=rgb(0, 0, 1, alpha = .1)) 	 	finalResults[i] = sum(jumps) } # Re-plot red line again on top, so it's visible again lines(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3)  # Find the fraction of trials that were more extreme than the original data ( length(finalResults[finalResults>netChange]) + length(finalResults[finalResults<(-netChange)]) ) / trials # Try sampling in 3-year increments trials = 100000 finalResults = rep(0,trials) for(i in 1:trials) { 	indexes = sample(1:128,43,replace=T) 	 	# Interlace consecutive years, to maintian the order of the jumps  	jumps = as.vector(rbind(changes[indexes],changes[(indexes+1)],changes[(indexes+2)])) 	 	# Grab one final YoY change to fill out the 130 	jumps = c(jumps, sample(changes, 1)) 	 	finalResults[i] = sum(jumps) } # Fnd the fraction of trials that were more extreme than the original data ( length(finalResults[finalResults>netChange]) + length(finalResults[finalResults<(-netChange)]) ) / trials # The maxEnt method for conditional sampling lines(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3)  trials = 1000 finalResults = rep(0,trials) for(i in 1:trials) { 	theCor = 0 	while(theCor > -.32) {
		jumps = sample(changes, 130, replace=T)
		theCor = cor(jumps[1:129],jumps[2:130])
	}
 
	# Add lines to plot for this
	lines(cumsum(jumps), col=rgb(0, 0, 1, alpha = .1))
 
	finalResults[i] = sum(jumps)
 
}
 
# Re-plot red line again on top, so it's visible again
lines(cumsum(c(0,rawChanges)), col="red", ylim=c(-300,300), lwd=3) 
 
( length(finalResults[finalResults>74]) + length(finalResults[finalResults<(-74)]) ) / trials
 
# Plot of YoY changes over time
plot(rawChanges,pch=20,col="blue", xlab="Year", ylab="YoY change (in hundreths of a degree)")
 
# Is there a trend?
absRawChanges = abs(rawChanges)
pts = 1:130
summary(lm(absRawChanges~pts))

31
Oct 12

Recommendation of the week

“[I]f you have performed any statistical analysis that is more complex than calculating the mean and the standard deviation, you should perform the same analysis on noise to make sure that whatever effect you observe is indeed a unique feature of your data and not an artefact of the analysis.”

Found this one over at Stefan’s sieste blog. I couldn’t agree more, especially now that computers and big data sets entice us to make ever more complex models. Oh, and that’s not a bad thing! As I’ve argued, we’ll need to give up on simple, easy to interpret models in order to get more predictive power.

I’d go even more meta than Stefan and argue that you should re-test your entire model-creating process on noise (perhaps he meant this with his quote). If you started with a data set, then ran a stepwise variable selection algorithm, then added in a new non-linear term to get a better fit, do the same on noise, trying to get the best fit. Are you able to get a statistically significant result? Better still, run the same procedure on different types of noise, not just Gaussian White (I know, sounds like something you’d load into a syringe. Normality, the gateway drug?).