Manifesto

Having no useful door to nail these to, I present them here for general public digestion:

  1. Probability is math. Statistics is (applied) epistemology. The biggest questions in statistics revolve around the limits of our knowledge. What conclusions are justified based on which processes? How should we interpret the results of an experiment? When is data bad or good? These are philosophical questions. Math can help you answer them, but only in the sense that knowledge of mechanical engineering helps you drive a taxi.
  2. In Monte Carlo we trust. Studying equations which converge as “n” goes to infinity can provide great theoretical insight. If you want to know how things work, or might work, in the real world of complicated models and limited trials, you need to run an experiment. Monte Carlo simulations, being the purest of all possible experiments, are often the best map we have between theory and reality.
  3. Check your assumptions. Repeat.
  4. There are no “outliers”, only extreme results. If you remove data from your analysis for reasons other than known error in data collection or transcription, you are no longer doing science. You are ignoring evidence, very often strong evidence. Whenever you hear someone (usually from the world of finance) speak of a “12 sigma event” or some other occurrence that should happen once in a stega-godzillion years, that’s not an outlier. Its a sign they are using a dangerously inappropriate model.
  5. “All models are wrong, but some are useful.” Attributed to statistician George Box. Models are maps, imperfect simplifications of a more complicated underlying reality. In addition to being useful, they can also be powerful, insightful, and lucrative. But if you start saying things like “the data prove that my model is right”, then you’ve failed to understand statistics (see item #1).
  6. Data is information. Useful models reduce entropy. Data isn’t just numbers, or categories. It’s not a series of zeros and ones. A stream of data is a stream of information. It has an information rate and a level of entropy. Better data lowers the effective entropy. Better models or procedures lowers the effective entropy.
  7. Look closely enough, and everything has a distribution. View any data point with a strong enough microscope and it starts to look fuzzy. In math, 2 and 2 sum up to exactly 4. In the real world, there is always a margin of error. The true sample space can never be fully known or bounded or perfectly modeled as a sigma algebra. Real coins land on an their edges every once in a blue moon. Sometimes nothing happens. Sometimes, nothing happens.
  8. It’s all about the evidence. Studies and experiments neither prove nor disprove assertions. Instead, they provide evidence for or against them. Sometimes the evidence is strong, other times it’s weak or mixed. How we evaluate new evidence — judging its absolute strength and integrating it with prior evidence and belief — is therefore the very foundation of all statistical work. Unfortunately, axiomatic attempts to establish the correct way to integrate new evidence have fallen out of favor or languished in obscurity (see Bruno di Finetti or Richard Royall). We are left only with the shadow of a debate, in the form of locked antlers between frequentests and Bayesians.
  9. Revolution is in the air. Or, at the least, it should be. The statistical analyses and processes in current use by scientists were created for mathematical elegance and ease of computation. They are sub-optimal tools which encourage problematic claims (X has no relationship with Y because the data in our model failed to meet an arbitrary p-value cutoff) and questionable assumptions (Normality, independence, outliers can be removed). Meanwhile, ever larger datasets combined with massive increases in computational power require new ways to understand and model data. We can scan through gigabytes of data and test millions of model and parameter combinations in a single afternoon. We have dozens of exotic new data mining concepts (Lasso, Bayesian Classifiers, Simulated neural networks). We have tools of unfathomable power, complexity, and diversity, yet the foundations of our discipline have scarcely evolved since they were laid down many decades ago by one single biologist. Statistics is yet to have its Cantor, its Gödel or Turing. Our world is quantum, our mindset still classical.

Last updated November 9, 2011.