r


4
Jan 12

Iowa: Was the fix in? (a statistical analysis of the results)

Summary/TL;DR
Either the first precincts to report were widely unrepresentative of Iowa as a whole, or something screwy happened.

Background
Yesterday was the first primary for the 2012 U.S. presidential elections. When I logged off the internet last night, the results in Iowa showed a dead heat between Ron Paul, Mitt Romney, and Rick Santorum. When I woke up this morning and checked the results from my phone, they were very different. So before connecting to the internet, I took a screen shot of what I saw before going to bed. Here it is:

Then I connected to the internet and refreshed that page:

It seemed strange to me that the results should change so dramatically after 25% of the votes had already been recorded. As a statistician, my next question was: how unusual is this? That’s a question that can be tested. In particular, I can test how often you might have a split of voters like the one shown in the first screen shot, if the final split is like the one shown in the other screen shot, given that the first precincts to report were similar to later ones in voter composition.

That’s a lot to digest all at once, so I’m going to repeat and clarify exactly what I’m assuming, and what I’m going to test.

The assumptions
First, I assume the following:
1. That CNN was showing the correct partial results as they became available. Similarly, I am assuming that the amount shown with 99% of votes reported (second screen shot) is the true final tally, give or take some insignificant amount.

2. That the precincts to report their vote totals first were a random sampling of the precincts overall. Given how spread out these appear to be in the first screen shot, this seems like a good assumption. But that might not be the case. See the end of this post for more about that possibility.

3. No fraud, manipulation, or other shenanigans occurred in terms of counting up the votes and reporting them.

The test
Given these three assumptions, I’m going to come up with a numeric value for the following:
1. What is the probability that the split, at 25% of the vote tallied, would show Ron Paul, Mitt Romney, and Rick Santorum all above 6,200 votes.

It’s possible to come up with a theoretical value for this probability using a formal statistical test. If you decide to work this out, make sure to take into account the fact that your initial sample size (25%) is large compared to the total population. You’ll also need to factor in all of the candidates. Could get messy.

For my analysis, I used the tool I trust: Monte Carlo simulation. I created a simulated population of 121,972 votes, with 26,219 who favor Ron Paul, 30,015 who favor Mitt Romney, and so on. Then I sampled 27,009 from them (the total votes tallied as of the first screen shot). Then I looked at the simulated split as of that moment, and saw if the three top candidates at the end are all above 6,200 votes. What about just Ron Paul?

I’ve coded my simulation using the programming language R, you can see my code at the end of this post.

The results
Out of 100,000 simulations, this result came up not even once! In all those trials, Ron Paul never broke 6,067 votes at the time of the split.

I ran this test a couple times, and each time the result was the same.

Conclusion
If my three assumptions are correct, the probability of observing partial results like we saw is extremely small. It’s much more likely that one of the assumptions is wrong. It could be that the early reports were wrong, though that seems unlikely. The other websites showed the same information or very similar, so it seems doubtful that an error occurred in passing along the information.

Was there something odd about the precincts that reported early? This is not something you could tell just by looking at split vs final data. The data clearly show that the later precincts disfavored Ron Paul, but that’s just what we want to know: did they really disfavor him, or was the data manipulated in some way. The question is, were any of the results faked, tweaked, massaged, Diebold-ed?

To answer that question, we’d need to know if these later precincts to report were expected, beforehand, to disfavor Ron Paul relative to the others. It would also help to look at entrance polling from all of the precincts, and compare the ones that were part of the early reporting versus those that were part of the later reports. At this point, I have to ask for help from you, citizen of the internet. Is this something we can figure out?

UPDATE
In case folks are interested, here’s a histogram of the 100,000 simulations. This shows the distribution of votes for Ron Paul as of the split, given the assumptions. As you can see it’s a nice bell curve, which it should be. Also note how far out on the curve 6,240 would be.

The code
Oh, one final possibility is that I messed up my code. You can check it below and see:

# Code for StatisticsBlog.com by Matt Asher
 
# Vote amounts
splits = list()
splits["MR"] = 6297
splits["RS"] = 6256
splits["RP"] = 6240
splits["NG"] = 3596
splits["JRP"] = 2833
splits["MB"] = 1608
splits["JH"] = 169
splits["HC"] = 10
 
finals = list()
finals["MR"] = 30015
finals["RS"] = 30007
finals["RP"] = 26219
finals["NG"] = 16251
finals["JRP"] = 12604
finals["MB"] = 6073
finals["JH"] = 745
finals["HC"] = 58
 
# Get an array with all voters:
population = c()
for (name in names(finals)) {
    population = c(population, rep(name, finals[[name]]))
}
 
# This was the initial split
initialSplit = c()
for (name in names(splits)) {
    initialSplit = c(initialSplit, rep(name, splits[[name]]))
}
 
 
# How many times to pull a sample
iters = 100000
 
# Sample size equal to the size at split
sampleSize = length(initialSplit)
 
successes = 0
justRPsuccesses = 0
 
# Track how many votes RP gets at the split
rpResults = rep(0, iters)
 
for(i in 1:iters) {
	ourSample = sample(population, sampleSize, replace=F)
	results = table(ourSample)
 
	rpResults[i] = results[["RP"]];
 
	if(results[["RP"]]>6200) {
		justRPsuccesses = justRPsuccesses + 1
 
		if(results[["MR"]]>6200 & results[["RS"]]>6200) {
			successes = successes + 1
		}
	}
}
 
cat(paste("Had a total of", successes, "out of", iters, "trials, for a proportion of", successes/iters, "\n"))
cat(paste("RP had a total of", justRPsuccesses, "out of", iters, "trials, for a proportion of", justRPsuccesses/iters, "\n"))

6
Dec 11

My oh my

Noted without comment, visit Biostatistics Ryan Gosling !!! for more gems like the one above.


1
Dec 11

Wasting away again in Martingaleville

Alright, I better start with an apology for the title of this post. I know, it’s really bad. But let’s get on to the good stuff, or, perhaps more accurately, the really frightening stuff. The plot shown at the top of this post is a simulation of the martingale betting strategy. You’ll find code for it here. What is the martingale betting strategy? Imagine you go into a a mythical casino that gives you perfectly fair odds on the toss of a mythically perfect coin. You can bet one dollar or a million. Heads you lose the amount you bet, tails you win tjat same amount. For your first bet, you wager $1. If you win, great! Bet again with a dollar. If you lose, double your wager to $2. Then if you win the next time, you’ve still won $1 overall (lost $1 then won $2). In general, continue to double your bet size until you get a win, then drop your bet size back down to a dollar. Because the probably of an infinite loosing streak is infinitely small, sooner or later you’ll make $1 off of the sequence of bets. Sound good?

The catch (you knew there had to be a catch, right?) is that the longer you use the martingale strategy, the more likely you are to go broke, unless you have an infinitely large bankroll. Sooner or later, a run of heads will wipe out your entire fortune. That’s what the plot at the beginning of this post shows. Our simulated gambler starts out with $1000, grows her pot up to over $12,000 (with a few bumps along the way), then goes bankrupt during a single sequence of bad luck. In short, the martingale stagy worked spectacularly well for her (12-fold pot increase!) right up until the point where it went spectacularly wrong (bankruptcy!).

Pretty scary, no? But I haven’t even gotten to the really scary part. In an interview with financial analyst Karl Denninger, Max Keiser explains the martingale betting strategy then comments:

“This seems to be what every Wall Street firm is doing. They are continuously loosing, but they are doubling down on every subsequent throw, because they know that they’ve got unlimited cash at their disposal from The Fed… Is this a correct way to describe what’s going on?

Karl Denninger replies. “I think it probably is. I’m familiar with that strategy. It bankrupts everyone who tries it, eventually…. and that’s the problem. Everyone says that this is an infinite sum of funds from the Federal Reserve, but in fact there is no such thing as an infinite amount of anything.”

Look at the plot at the beginning of this post again. Imagine the top banking executives in your country were paid huge bonuses based on their firm’s profits, and in the case of poor performance they got to walk away with a generous severance package. Now imagine that these companies could borrow unlimited funds at 0% interest, and if things really blew up they expected the taxpayers to cover the tab through bailouts or inflation. Do you think this might be a recipe for disaster?


20
Oct 11

Queueing up in R, continued

Shown above is a queueing simulation. Each diamond represents a person. The vertical line up is the queue; at the bottom are 5 slots where the people are attended to. The size of each diamond is proportional to the log of the time it will take them to be attended. Color is used to tell one person from another and doesn’t have any other meaning. Code for this simulation, written in R, is here. This is my second post about queueing simulation, you can find the first one, including an earlier version of the code, here. Thanks as always to commenters for their suggestions.

A few notes about the simulation:

  • Creating an animation to go along with your simulation can take a while to program (unless, perhaps, you are coding in Flash), and it may seem like an extra, unnecessary step. But you can often learn a lot just by “watching”, and animations can help you spot bugs in the code. I noticed that sometimes smaller diamonds hung around for much longer then I expected, which led me to track down a tricky little error in the code.
  • As usual, I’ve put all of the configuration options at the beginning of the code. Try experimenting with different numbers of intervals and tellers/slots, or change the mean service time.
  • If you want to run the code, you’ll need to have ImageMagick installed. If you are on a PC, make sure to include the full path to “convert”, since Windows has a built-in convert tool might take precedence. Also, note how the files that represent the individual animation cells are named. That’s so that they are added in the animation in the right order, naming them sequentially without zeros at the beginning failed.
  • I used Photoshop to interlace the animated GIF and resave. This reduced the file size by over 90%
  • The code is still a work in progress, it needs cleanup and I still have some questions I want to “ask” of the simulation.

13
Oct 11

Waiting in line, waiting on R

I should state right away that I know almost nothing about queuing theory. That’s one of the reasons I wanted to do some queuing simulations. Another reason: when I’m waiting in line at the bank, I tend to do mental calculations for how long it should take me to get served. I look at the number of tellers attending, pick an average teller session length (say one or two minutes), then come up with an average wait per person in line. For example, if there are 4 tellers and the average person takes 2 minutes to do her transaction, then new tellers should become available every 30 seconds. If I’m the 6th person in line, I should expect to wait 3 minutes before I’m attended.

However, based on my experience (the much maligned anecdotal), it often takes much longer than expected. My suspicion is that over time the teller’s get “clogged up” with the slowest people, so that even though an average person might take only 2 minutes, the people you actually see being attended right now are much more likely to be those who take a long time.

To explore this possibility, I set up a simulation in R (as usual, full source code provided at end of post). The first graph, at the beginning of this post, shows the length of queues over time for 4 runs of the simulator, all with the same configuration parameters. Often this graph was completely flat. Note though that when things get out of hand in terms of queue length, they can get way out of hand. To get a feel for how long you would have to wait, given that there is a line in front of you, I tracked how long the first person in line had to wait to be served. Given my configuration settings, this wait would be expected to last 5 intervals. It does seem to take longer than 5 intervals, though I want to tweak the model some and do more testing before I’m ready to quantify that wait.

There are, as with any models, things that make this one unrealistic. The biggest may be that people get in line with the same probability no matter how long the line is. What I need is some kind of tendency to abandon the line if it’s too long. That shouldn’t shorten the wait times for those already in line. I could make those worse. If you assume that slower people are prepared to wait longer in line, then the line is more likely to have slow people. Grandpa Jones is willing to spend an hour in line so he can chat for a while with the pretty young teller; but if the line is too long, that 50-year-old business guy will come back later to deposit his check. I would imagine that, from the bank’s perspective, this presents a tricky dilemma. The people whose time is worth the least are probably the most likely to be clogging up your tellers, upsetting the customers you care the most about (I know, I know, Bank of America cares about all of us equally, right?).

Code so far, note that run times can be very long for high intervals if the queue length gets long:

#### Code by Matt Asher. Published at StatisticsBlog.com ####
#### CONFIG ####
# Number of slots to fill
numbSlots = 5
 
# Total time to track
intervals = 1000
 
# Probability that a new person will show up during an interval
# Note, a maximum of one new person can show up during an interval
p = 0.1
 
# Average time each person takes at the teller, discretized exponential distribution assumed
# Times will be augmented by one, so that everyone takes at least 1 interval to serve
meanServiceTime = 25
 
#### INITIALIZATION ####
queueLengths = rep(0, intervals)
slots = rep(0, numbSlots)
waitTimes = c()
leavingTimes = c()
queue = list()
arrivalTimes = c()
frontOfLineWaits = c()
 
 
#### Libraries ####
# Use the proto library to treat people like objects in traditional oop
library("proto")
 
#### Functions ####
# R is missing a nice way to do ++, so we use this
inc <- function(x) {
  eval.parent(substitute(x <- x + 1))
}
 
# Main object, really a "proto" function
person <- proto(
	intervalArrived = 0,
	intervalAttended = NULL,
 
	# How much teller time will this person demand?
	intervalsNeeded = floor(rexp(1, 1/meanServiceTime)) + 1,
	intervalsWaited = 0,
	intervalsWaitedAtHeadOfQueue = 0,
)
 
#### Main loop ####
for(i in 1:intervals) {
	# Check if anyone is leaving the slots
	for(j in 1:numbSlots) {
		if(slots[j] == i) {
			# They are leaving the queue, slot to 0
			slots[j] = 0
			leavingTimes = c(leavingTimes, i)
		}
	}
 
	# See if a new person is to be added to the queue
	if(runif(1) < p) {
		newPerson = as.proto(person$as.list())
		newPerson$intervalArrived = i
		queue = c(queue, newPerson)
		arrivalTimes  = c(arrivalTimes, i)
	}
 
	# Can we place someone into a slot?
	for(j in 1:numbSlots) {
		# Is this slot free
		if(!slots[j]) {
			if(length(queue) > 0) {
				placedPerson = queue[[1]]
				slots[j] = i + placedPerson$intervalsNeeded
				waitTimes = c(waitTimes, placedPerson$intervalsWaited)
				# Only interested in these if person waited 1 or more intevals at front of line
				if(placedPerson$intervalsWaitedAtHeadOfQueue) {
					frontOfLineWaits = c(frontOfLineWaits, placedPerson$intervalsWaitedAtHeadOfQueue)
				}
 
				# Remove placed person from queue
				queue[[1]] = NULL
			}
		}
	}
 
	# Everyone left in the queue has now waited one more interval to be served
	if(length(queue)) {
		for(j in 1:length(queue)) {
			inc(queue[[j]]$intervalsWaited) # = queue[[j]]$intervalsWaited + 1
		}
 
		# The (possibley new) person at the front of the queue has had to wait there one more interval.
		inc(queue[[1]]$intervalsWaitedAtHeadOfQueue) # = queue[[1]]$intervalsWaitedAtHeadOfQueue + 1
	}
 
	# End of the interval, what is the state of things
	queueLengths[i] = length(queue);
}
 
#### Output ####
plot(queueLengths, type="o", col="blue", pch=20, main="Queue lengths over time", xlab="Interval", ylab="Queue length")
# plot(waitTimes, type="o", col="blue", pch=20, main="Wait times", xlab="Person", ylab="Wait time")