13
Oct 12

The unicorn problem

Let’s say your goal is to observe all known species in a particular biological category. Once a week you go out and collect specimens to identify, or maybe you just bring your binoculars to do some spotting. How long will it take you to cross off every species on your list?

I’ve been wondering this lately since I’ve begun to hang out on the Mushrooms of Québec flickr group. So far they have over 2200 species included in the photos. At least one of the contributors has written a book on the subject, which got me wondering how long it might take him to gather his own photos and field observations on every single known species.

My crude model (see R code at the end of this post) assumes that every time Yves goes out, he has a fixed chance of encountering every given species. In other words, if there were 1000 species to find, and he averages 50 species per hunt, then every species is assigned a 1/20 chance of being found per trip. Thus the total found on any trip would have a Poisson distribution with parameter 50.

This model is unrealistic for lots of reasons (can you spot all the bad assumptions?), but it does show one of the daunting problems with this task: the closer you get to the end, the harder it becomes to find the last few species. In particular, you can be stuck at 1 for a depressingly long time. Run the simulation with different options and look at the graphs you get. I’m calling this “The unicorn problem,” after Nic Cage’s impossible-to-rob car in the movie Gone in 60 Seconds.

Do you have a real-life unicorn of one sort or another?

```  species = 1000 findP = 1/20 trials = 100 triesToFindAll = rep(0, trials)       for(i in 1:trials) { triesNeeded = 0   leftToFind = rep(1, species) leftNow = species numberLeft = c()   while (leftNow > 0) {   found = sample( c(0,1), 1000, replace=TRUE, prob = c((1-findP),findP))   leftToFind = leftToFind - found   leftNow = length(leftToFind[leftToFind > 0])   numberLeft = c(numberLeft, leftNow)   triesNeeded = triesNeeded + 1   }   if(i == 1) { plot.ts(numberLeft, xlim=c(0, 2*triesNeeded), col="blue", ylab="Species left to find", xlab="Attempts") } else { lines(numberLeft, col="blue") }   triesToFindAll[i] = triesNeeded }```

6
Dec 11

My oh my

Noted without comment, visit Biostatistics Ryan Gosling !!! for more gems like the one above.

20
Oct 11

Queueing up in R, continued

Shown above is a queueing simulation. Each diamond represents a person. The vertical line up is the queue; at the bottom are 5 slots where the people are attended to. The size of each diamond is proportional to the log of the time it will take them to be attended. Color is used to tell one person from another and doesn’t have any other meaning. Code for this simulation, written in R, is here. This is my second post about queueing simulation, you can find the first one, including an earlier version of the code, here. Thanks as always to commenters for their suggestions.

A few notes about the simulation:

• Creating an animation to go along with your simulation can take a while to program (unless, perhaps, you are coding in Flash), and it may seem like an extra, unnecessary step. But you can often learn a lot just by “watching”, and animations can help you spot bugs in the code. I noticed that sometimes smaller diamonds hung around for much longer then I expected, which led me to track down a tricky little error in the code.
• As usual, I’ve put all of the configuration options at the beginning of the code. Try experimenting with different numbers of intervals and tellers/slots, or change the mean service time.
• If you want to run the code, you’ll need to have ImageMagick installed. If you are on a PC, make sure to include the full path to “convert”, since Windows has a built-in convert tool might take precedence. Also, note how the files that represent the individual animation cells are named. That’s so that they are added in the animation in the right order, naming them sequentially without zeros at the beginning failed.
• I used Photoshop to interlace the animated GIF and resave. This reduced the file size by over 90%
• The code is still a work in progress, it needs cleanup and I still have some questions I want to “ask” of the simulation.

13
Oct 11

Waiting in line, waiting on R

I should state right away that I know almost nothing about queuing theory. That’s one of the reasons I wanted to do some queuing simulations. Another reason: when I’m waiting in line at the bank, I tend to do mental calculations for how long it should take me to get served. I look at the number of tellers attending, pick an average teller session length (say one or two minutes), then come up with an average wait per person in line. For example, if there are 4 tellers and the average person takes 2 minutes to do her transaction, then new tellers should become available every 30 seconds. If I’m the 6th person in line, I should expect to wait 3 minutes before I’m attended.

However, based on my experience (the much maligned anecdotal), it often takes much longer than expected. My suspicion is that over time the teller’s get “clogged up” with the slowest people, so that even though an average person might take only 2 minutes, the people you actually see being attended right now are much more likely to be those who take a long time.

To explore this possibility, I set up a simulation in R (as usual, full source code provided at end of post). The first graph, at the beginning of this post, shows the length of queues over time for 4 runs of the simulator, all with the same configuration parameters. Often this graph was completely flat. Note though that when things get out of hand in terms of queue length, they can get way out of hand. To get a feel for how long you would have to wait, given that there is a line in front of you, I tracked how long the first person in line had to wait to be served. Given my configuration settings, this wait would be expected to last 5 intervals. It does seem to take longer than 5 intervals, though I want to tweak the model some and do more testing before I’m ready to quantify that wait.

There are, as with any models, things that make this one unrealistic. The biggest may be that people get in line with the same probability no matter how long the line is. What I need is some kind of tendency to abandon the line if it’s too long. That shouldn’t shorten the wait times for those already in line. I could make those worse. If you assume that slower people are prepared to wait longer in line, then the line is more likely to have slow people. Grandpa Jones is willing to spend an hour in line so he can chat for a while with the pretty young teller; but if the line is too long, that 50-year-old business guy will come back later to deposit his check. I would imagine that, from the bank’s perspective, this presents a tricky dilemma. The people whose time is worth the least are probably the most likely to be clogging up your tellers, upsetting the customers you care the most about (I know, I know, Bank of America cares about all of us equally, right?).

Code so far, note that run times can be very long for high intervals if the queue length gets long:

```#### Code by Matt Asher. Published at StatisticsBlog.com #### #### CONFIG #### # Number of slots to fill numbSlots = 5   # Total time to track intervals = 1000   # Probability that a new person will show up during an interval # Note, a maximum of one new person can show up during an interval p = 0.1   # Average time each person takes at the teller, discretized exponential distribution assumed # Times will be augmented by one, so that everyone takes at least 1 interval to serve meanServiceTime = 25   #### INITIALIZATION #### queueLengths = rep(0, intervals) slots = rep(0, numbSlots) waitTimes = c() leavingTimes = c() queue = list() arrivalTimes = c() frontOfLineWaits = c()     #### Libraries #### # Use the proto library to treat people like objects in traditional oop library("proto")   #### Functions #### # R is missing a nice way to do ++, so we use this inc <- function(x) { eval.parent(substitute(x <- x + 1)) }   # Main object, really a "proto" function person <- proto( intervalArrived = 0, intervalAttended = NULL,   # How much teller time will this person demand? intervalsNeeded = floor(rexp(1, 1/meanServiceTime)) + 1, intervalsWaited = 0, intervalsWaitedAtHeadOfQueue = 0, )   #### Main loop #### for(i in 1:intervals) { # Check if anyone is leaving the slots for(j in 1:numbSlots) { if(slots[j] == i) { # They are leaving the queue, slot to 0 slots[j] = 0 leavingTimes = c(leavingTimes, i) } }   # See if a new person is to be added to the queue if(runif(1) < p) { newPerson = as.proto(person\$as.list()) newPerson\$intervalArrived = i queue = c(queue, newPerson) arrivalTimes = c(arrivalTimes, i) }   # Can we place someone into a slot? for(j in 1:numbSlots) { # Is this slot free if(!slots[j]) { if(length(queue) > 0) { placedPerson = queue[[1]] slots[j] = i + placedPerson\$intervalsNeeded waitTimes = c(waitTimes, placedPerson\$intervalsWaited) # Only interested in these if person waited 1 or more intevals at front of line if(placedPerson\$intervalsWaitedAtHeadOfQueue) { frontOfLineWaits = c(frontOfLineWaits, placedPerson\$intervalsWaitedAtHeadOfQueue) }   # Remove placed person from queue queue[[1]] = NULL } } }   # Everyone left in the queue has now waited one more interval to be served if(length(queue)) { for(j in 1:length(queue)) { inc(queue[[j]]\$intervalsWaited) # = queue[[j]]\$intervalsWaited + 1 }   # The (possibley new) person at the front of the queue has had to wait there one more interval. inc(queue[[1]]\$intervalsWaitedAtHeadOfQueue) # = queue[[1]]\$intervalsWaitedAtHeadOfQueue + 1 }   # End of the interval, what is the state of things queueLengths[i] = length(queue); }   #### Output #### plot(queueLengths, type="o", col="blue", pch=20, main="Queue lengths over time", xlab="Interval", ylab="Queue length") # plot(waitTimes, type="o", col="blue", pch=20, main="Wait times", xlab="Person", ylab="Wait time")```

12
Jan 11

R: Attack of the hair-trigger bees?

In their book “Complex Adaptive Systems”, authors Miller and Page create a theoretic model for bee attacks, based on the real, flying, honey-making, photogenic stingers. Suppose the hive is threatened by some external creature. Some initial group of guard bees sense the danger and fly off to attack. As they go, they lay down a scent. Other bees join in the attack if their scent sensitivity threshold (SST) is reached. When they join the attack, they send out their own warning scent, perhaps triggering an even bigger attack. The authors make the point that if the colony of bees were homogeneous, and every single one had the same attack threshold, then if that threshold was above the initial attack number, then no one else would join in. If it were below, then everyone goes all at once. Fortunately for the bees, they are a motley lot, which is to say a lot more diverse than you would imagine just from looking at the things. As a result, they exhibit much more complicated behavior. The authors describe a model with 100 bees and call their attack threshold “R”. By assuming a heterogeneous population of 100 with thresholds all equal spaced, they note:

“In the hetrogeneous case, a full-scall attack ensues for $R \geq 1$. This latter result is easy to see, because once at least one bee attacks, then the bee with threshold equal to one will join the fray, and this will trigger the bee with the next highest threshold to join in, and so on…. It is relatively difficult to get the homogeneous hive to react, while the hetrogeneous one is on a hair trigger. Without explicity incorporating the diversity of thresholds, it is difficult to make any kind of accurate prediction of how a given hive will behave.”

I think this last sentence is their way of noting that the exact distribution of sensitivities makes a huge difference in how the bees behave, which indeed it does. I decided to put the bees to the test, so I coded a simulation in the language R (code at the end of this post). I gave 100 virtual apis mellifera a random sensitivity level, chosen from a Uniform(1,100) distribution, then assumed 10 guards decided to attack. How would the others respond? Would a hair-trigger chain reaction occur? The chart at the top shows the results from 1000 trials. Looks pretty chaotic, so here’s a histogram of the results:

As you can see, most of the time the chain reaction dies out quickly, with no more than 20 new bees joining in the attack. However, occasionally the bees do go nuts, sending the full on attack. This happened about 1 percent of the time. Perhaps most interestingly, all of the other attack levels were clearly possible as well, in the flat zone from about 30 to 95. You might want to try playing with the distribution of the sensitivities and see if you get any other interesting distributions. Meanwhile, if you decide to raid a hive, make sure to dip yourself in mud first, that way the bees will think you are nothing but an innocent black rain cloud.

Code in R:

```trials = 1000   go = rep(0,trials) initial = 10   for(i in 1:trials) { bees = sort(runif(100,1,100))   # Everyone who's threshold is less than the inital amount goes. going = length(bees[bees<initial])   # See if this number now goes up as new bees join in if(going > initial) { more = length(bees[bees<going]) while (more > going) { # Keep doing this until it stops going = more more = length(bees[bees<going]) } going = more } go[i] = going }   par(mar=c(4,4,0,0)) plot.ts(go,lwd=1.5,col="blue", bty="n", xlab="Trial", ylab="Number of bees who join attack")   windows() hist(go, breaks=50, col="blue", xlab="Number of bees to join attack", main="Frequency of different attack sizes")```