- Arthur Charpentier of Freakonometrics discusses GLM, non-linearity and heteroscedasticity.
- A statistical analysis of the popular TV show “How I Met Your Mother” based on IMDB user ratings. As you may recall, diffuseprior did a similar analysis for The Simpsons earlier this year.
- Christian Robert of Universite Paris-Dauphine, aka Xi’an, has a two part review of
*Machine Learning, A Probabilistic Perspective*by Kevin P. Murphy. - A very short tutorial on how to estimate the number of visitors to a website accurately when some of them have “cookies” disabled.
- For those interested in quantitative finance, here’s a list of blogs to bookmark for future reference.
- New to R? Bright North Lab shares a beginner’s experience of learning R – from basic graphs to performance tuning.
- Here at StatisticsBlog, Matt Asher wrote about The disgrace of the mandatory census and judicial cowardice in the trial of Audrey Tobias.

## October, 2013

28

Oct 13

## The week in stats (Oct. 28st edition)

22

Oct 13

## The disgrace of the mandatory census

In 2011, Audrey Tobias refused to provide Statistics Canada with a filled out copy of her census form, as mandated by law. Her decision, and her decision to stand by that decision, led to a trial in which the 89-year-old faced jail time. Although Tobias stated that her act was protest against the use of US military contractor Lockheed Martin to process the forms, and not against the mandatory nature of the census itself, this was really a trial of the government’s power to compel citizens to provide it with private information. As Tobias’ lawyer, Peter Rosenthal, argued, compelling Tobias to fill out the form on threat of jail was a violation of the Canadian Charter of Rights, and its provisions for freedom of conscience and expression.

The judge in the case, Ramez Khawly, rejected Rosenthal’s argument, but found a way to find Tobias not guilty anyway on the basis of his doubt about her intent in not filling out the form. Perhaps sensing the outrage that might ensue over punishing an octogenarian for a non-violent act of civil disobedience, Khawly was nevertheless too fearful, or obtuse, to uphold an argument that would set a highly inconvenient precedent from the standpoint of the state. The judge both justified and exposed his particular mix of cowardice and compassion by asking, “Could they [the Crown] not have found a more palatable profile to prosecute as a test case?”

I suppose I shouldn’t be surprised by the judge’s politically expedient decision. What shocks me is the reaction of many regular citizens, and in particular of some fellow statisticians. Let me be as clear as possible about this: support for the mandatory census is a moral abomination and a professional disgrace. It *should* go without saying that informed consent is a baseline, a *bare minimum for morality* when conducting experiments with human subjects. Forcing citizens to divulge information they would otherwise wish to keep private, on pain of throwing them in a locked cage, does not qualify as informed consent!

There is no point here in arguing that what’s being requested is a minor inconvenience, or an inconsequential imposition. Informed consent doesn’t mean “what we think you should consent to.” More than anything else, statistics is about understanding the inherent uncertainties in measurement, prediction, and extrapolation. Just because you might not object to answering certain questions, gives no reason to assume the universality of your preferences. Finally, note that to at least a small group of revolutionaries, the right *not* to divulge certain information to authorities was so important that it was written right into the Bill of Rights.

Besides the argument that the census in minimally invasive, I’ve also heard it argued that the value of obtaining complete data outweighs concerns of privacy and choice. To this I say that our desire, as statisticians, for complete and reliable data, isn’t some ethical trump card, nor is it the scientific version of a religious indulgence that purifies our transgressions.

Dealing with incomplete and imprecise data isn’t some unique problem that can be overcome at the point of a gun, it’s the very heart and soul of statistics! In the real world, there is no such thing as indisputably complete or infinity precise data. That’s why we have confidence intervals, likelihood estimates, rules for data cleaning, and a wide variety of sampling procedures. In fact, these sampling procedures, if properly chosen and well executed, can be more accurate than a census.

I call on all those who work for StatsCan or other organizations to *refuse to participate in any non-consensual surveys*, to stand up for their own good name and the good name of the profession, and to focus their energies on finding creative, scientifically sound, non-coercive ways to obtain high quality data.

21

Oct 13

## The week in stats (Oct. 21st edition)

- Spreadsheets are user friendly, but they can also be dangerous. Patrick Burns explains why you should avoid spreadsheets and work with R instead.
- How’s your fantasy team doing? Revolution Analytics compiles a series of Fantasy Football modelling articles by Boris Chen of New York Times.
- Rexer Analytics has been conducting regular polls of data miners and analytics professionals on their software choices since 2007. They presented their results at the 2013 Rexer Analytics Data Miner Survey at last month’s Predictive Analytics World conference in Boston.
- Everyone understands the p-value, except for those who don’t. Here is an example that once again shows the p-value – that workhorse of modern science – continues to be misinterpreted in even the top tiers of the scientific literature.
- Despite all the hype surrounding big data and analytics, Louis Columbus of Forbes argues that the majority of business analysts lack access to the data and tools they need. Columbus explains why and how this should be changed.
- Six Decades of the Most Popular Names for Girls, State-by-State, represented all in one interactive map.

14

Oct 13

## The week in stats (Oct. 14th edition)

- The
*R is my friend*blog publishes a series of four articles on neural networks. This is probably one of the most comprehensive introductions to neural networks in R. If you are in love with neural nets and want to learn even more, here is another tutorial by Saptarsi Goswami. - State-by-state media preferences as revealed by bit.ly.
- Andrew Gelman, Professor of Statistics and Political Sciences at Columbia University, discusses why Bing is preferred to Google by people who aren’t like him.
- Have you heard of Simpson’s Paradox? Here is an interactive visual (using the 1973 Berkeley sex discrimination lawsuit as an example) that explains the paradox in 60 seconds.
- Dan Delany does a visual breakdown of furloughed employees due to the U.S. government shutdown. The main view shows furloughed proportions by department, and there are real time tickers for duration, estimated unpaid salary, and estimated food vouchers unpaid.
- If there is an 82% chance an an event will occur within your life time (and assuming that you live for 70 years), what is the probability that this event will occur on any given day?
- Tableau, the popular interactive data visualization tool, is coming out with a new 8.1 update, and it will include integration with the R language. Learn how to integrate the two in just 30 seconds.
- A short (but not trivial) lesson on data smoothing using R.

7

Oct 13

## The week in stats (Oct. 7th edition)

- The picture above is a very well-known mathematical construction called the fractal cat. Brian Lee Yung Rowe shows how to construct fractal artworks using R.
- Arthur Charpentier of Freakonometrics explains how to construct ROC (
~~rate of change~~Receiver Operating Characteristic) curves in R, as well as how to interpret and plot them. This is a useful for those in fields that frequently encounter longitudinal data, such as finance, engineering or biostatistics. - There are many kinds of intervals in statistics. To name a few of the common ones: confidence intervals, prediction intervals, credible intervals, and tolerance intervals. Each are useful and serve their own purpose. You should not only know their names, but also when to use them and why.
- A map of the most visited website for every country in the world (source: Alexa.com), as well as the internet population of each country.
- Suppose that you drop 5 blue marbles and 5 red marbles randomly (and uniformly) on the interval [0,1]. What is the probability that the marbles will interleave each other?