Posts Tagged: r


14
Dec 12

Let it snow!

A couple days ago I noticed a fun piece of R code by Allan Roberts, which lets you create a digital snowflake by cutting out virtual triangles. Go give it a try. Roberts inspired me to create a whole night sky of snowflakes. I tried to make the snowfall look as organic as possible. There are lots of options to adjust. Here’s the code, have fun and Happy Holidays!

# Code by Matt Asher for statisticsblog.com
# Feel free to modify and redistribute 
 
# How many flakes do you want to fall?
flakes = 100
 
# Width and height of your space
width = 800
height = 600
 
# Initial wind
wind = 0
 
# Setup the background of the plot and margins
par(bg = "black")
par(oma=c(0,0,0,0))
par(mar=c(0,0,0,0))
plot(0, 0, col="black", pch=".", xlim=c(0,width), ylim=c(0,height), axes=F)
 
for(i in 1:flakes) {
    startY = height
    startX = runif(1,1,width)
 
    xPos = startX
    yPos = startY
 
    for(j in 1:height) {
 
		# Optional drift in wind
		wind = wind + rcauchy(1,0,.05)
 
		# Update snowflake position
        xPos = xPos + rnorm(1,.1,1.5)
        yPos = yPos - runif(1,4,20)
 
        # Are we in the space, if so display it
        if(xPos > 0 && xPos <= width && yPos > 0 && yPos <= height) {
            points(round(xPos), round(yPos), col="white", pch=8)
 
            # System dely, slows down the flakes
            Sys.sleep(0.1)
        }
    }
}

13
Oct 12

The unicorn problem

Let’s say your goal is to observe all known species in a particular biological category. Once a week you go out and collect specimens to identify, or maybe you just bring your binoculars to do some spotting. How long will it take you to cross off every species on your list?

I’ve been wondering this lately since I’ve begun to hang out on the Mushrooms of Qu├ębec flickr group. So far they have over 2200 species included in the photos. At least one of the contributors has written a book on the subject, which got me wondering how long it might take him to gather his own photos and field observations on every single known species.

My crude model (see R code at the end of this post) assumes that every time Yves goes out, he has a fixed chance of encountering every given species. In other words, if there were 1000 species to find, and he averages 50 species per hunt, then every species is assigned a 1/20 chance of being found per trip. Thus the total found on any trip would have a Poisson distribution with parameter 50.

This model is unrealistic for lots of reasons (can you spot all the bad assumptions?), but it does show one of the daunting problems with this task: the closer you get to the end, the harder it becomes to find the last few species. In particular, you can be stuck at 1 for a depressingly long time. Run the simulation with different options and look at the graphs you get. I’m calling this “The unicorn problem,” after Nic Cage’s impossible-to-rob car in the movie Gone in 60 Seconds.

Do you have a real-life unicorn of one sort or another?

 
species = 1000
findP = 1/20
trials = 100
triesToFindAll = rep(0, trials)
 
 
 
for(i in 1:trials) {
	triesNeeded = 0
 
	leftToFind = rep(1, species)
	leftNow = species
	numberLeft = c()
 
	while (leftNow > 0) {
 
		found = sample( c(0,1), 1000, replace=TRUE, prob = c((1-findP),findP))
 
		leftToFind = leftToFind - found
 
		leftNow = length(leftToFind[leftToFind > 0])
 
		numberLeft = c(numberLeft, leftNow)
 
		triesNeeded = triesNeeded + 1
 
	}
 
	if(i == 1) {
		plot.ts(numberLeft, xlim=c(0, 2*triesNeeded), col="blue", ylab="Species left to find", xlab="Attempts")
	} else {
		lines(numberLeft, col="blue")
	}
 
	triesToFindAll[i] = triesNeeded
}

6
Dec 11

My oh my

Noted without comment, visit Biostatistics Ryan Gosling !!! for more gems like the one above.


20
Oct 11

Queueing up in R, continued

Shown above is a queueing simulation. Each diamond represents a person. The vertical line up is the queue; at the bottom are 5 slots where the people are attended to. The size of each diamond is proportional to the log of the time it will take them to be attended. Color is used to tell one person from another and doesn’t have any other meaning. Code for this simulation, written in R, is here. This is my second post about queueing simulation, you can find the first one, including an earlier version of the code, here. Thanks as always to commenters for their suggestions.

A few notes about the simulation:

  • Creating an animation to go along with your simulation can take a while to program (unless, perhaps, you are coding in Flash), and it may seem like an extra, unnecessary step. But you can often learn a lot just by “watching”, and animations can help you spot bugs in the code. I noticed that sometimes smaller diamonds hung around for much longer then I expected, which led me to track down a tricky little error in the code.
  • As usual, I’ve put all of the configuration options at the beginning of the code. Try experimenting with different numbers of intervals and tellers/slots, or change the mean service time.
  • If you want to run the code, you’ll need to have ImageMagick installed. If you are on a PC, make sure to include the full path to “convert”, since Windows has a built-in convert tool might take precedence. Also, note how the files that represent the individual animation cells are named. That’s so that they are added in the animation in the right order, naming them sequentially without zeros at the beginning failed.
  • I used Photoshop to interlace the animated GIF and resave. This reduced the file size by over 90%
  • The code is still a work in progress, it needs cleanup and I still have some questions I want to “ask” of the simulation.

13
Oct 11

Waiting in line, waiting on R

I should state right away that I know almost nothing about queuing theory. That’s one of the reasons I wanted to do some queuing simulations. Another reason: when I’m waiting in line at the bank, I tend to do mental calculations for how long it should take me to get served. I look at the number of tellers attending, pick an average teller session length (say one or two minutes), then come up with an average wait per person in line. For example, if there are 4 tellers and the average person takes 2 minutes to do her transaction, then new tellers should become available every 30 seconds. If I’m the 6th person in line, I should expect to wait 3 minutes before I’m attended.

However, based on my experience (the much maligned anecdotal), it often takes much longer than expected. My suspicion is that over time the teller’s get “clogged up” with the slowest people, so that even though an average person might take only 2 minutes, the people you actually see being attended right now are much more likely to be those who take a long time.

To explore this possibility, I set up a simulation in R (as usual, full source code provided at end of post). The first graph, at the beginning of this post, shows the length of queues over time for 4 runs of the simulator, all with the same configuration parameters. Often this graph was completely flat. Note though that when things get out of hand in terms of queue length, they can get way out of hand. To get a feel for how long you would have to wait, given that there is a line in front of you, I tracked how long the first person in line had to wait to be served. Given my configuration settings, this wait would be expected to last 5 intervals. It does seem to take longer than 5 intervals, though I want to tweak the model some and do more testing before I’m ready to quantify that wait.

There are, as with any models, things that make this one unrealistic. The biggest may be that people get in line with the same probability no matter how long the line is. What I need is some kind of tendency to abandon the line if it’s too long. That shouldn’t shorten the wait times for those already in line. I could make those worse. If you assume that slower people are prepared to wait longer in line, then the line is more likely to have slow people. Grandpa Jones is willing to spend an hour in line so he can chat for a while with the pretty young teller; but if the line is too long, that 50-year-old business guy will come back later to deposit his check. I would imagine that, from the bank’s perspective, this presents a tricky dilemma. The people whose time is worth the least are probably the most likely to be clogging up your tellers, upsetting the customers you care the most about (I know, I know, Bank of America cares about all of us equally, right?).

Code so far, note that run times can be very long for high intervals if the queue length gets long:

#### Code by Matt Asher. Published at StatisticsBlog.com ####
#### CONFIG ####
# Number of slots to fill
numbSlots = 5
 
# Total time to track
intervals = 1000
 
# Probability that a new person will show up during an interval
# Note, a maximum of one new person can show up during an interval
p = 0.1
 
# Average time each person takes at the teller, discretized exponential distribution assumed
# Times will be augmented by one, so that everyone takes at least 1 interval to serve
meanServiceTime = 25
 
#### INITIALIZATION ####
queueLengths = rep(0, intervals)
slots = rep(0, numbSlots)
waitTimes = c()
leavingTimes = c()
queue = list()
arrivalTimes = c()
frontOfLineWaits = c()
 
 
#### Libraries ####
# Use the proto library to treat people like objects in traditional oop
library("proto")
 
#### Functions ####
# R is missing a nice way to do ++, so we use this
inc <- function(x) {
  eval.parent(substitute(x <- x + 1))
}
 
# Main object, really a "proto" function
person <- proto(
	intervalArrived = 0,
	intervalAttended = NULL,
 
	# How much teller time will this person demand?
	intervalsNeeded = floor(rexp(1, 1/meanServiceTime)) + 1,
	intervalsWaited = 0,
	intervalsWaitedAtHeadOfQueue = 0,
)
 
#### Main loop ####
for(i in 1:intervals) {
	# Check if anyone is leaving the slots
	for(j in 1:numbSlots) {
		if(slots[j] == i) {
			# They are leaving the queue, slot to 0
			slots[j] = 0
			leavingTimes = c(leavingTimes, i)
		}
	}
 
	# See if a new person is to be added to the queue
	if(runif(1) < p) {
		newPerson = as.proto(person$as.list())
		newPerson$intervalArrived = i
		queue = c(queue, newPerson)
		arrivalTimes  = c(arrivalTimes, i)
	}
 
	# Can we place someone into a slot?
	for(j in 1:numbSlots) {
		# Is this slot free
		if(!slots[j]) {
			if(length(queue) > 0) {
				placedPerson = queue[[1]]
				slots[j] = i + placedPerson$intervalsNeeded
				waitTimes = c(waitTimes, placedPerson$intervalsWaited)
				# Only interested in these if person waited 1 or more intevals at front of line
				if(placedPerson$intervalsWaitedAtHeadOfQueue) {
					frontOfLineWaits = c(frontOfLineWaits, placedPerson$intervalsWaitedAtHeadOfQueue)
				}
 
				# Remove placed person from queue
				queue[[1]] = NULL
			}
		}
	}
 
	# Everyone left in the queue has now waited one more interval to be served
	if(length(queue)) {
		for(j in 1:length(queue)) {
			inc(queue[[j]]$intervalsWaited) # = queue[[j]]$intervalsWaited + 1
		}
 
		# The (possibley new) person at the front of the queue has had to wait there one more interval.
		inc(queue[[1]]$intervalsWaitedAtHeadOfQueue) # = queue[[1]]$intervalsWaitedAtHeadOfQueue + 1
	}
 
	# End of the interval, what is the state of things
	queueLengths[i] = length(queue);
}
 
#### Output ####
plot(queueLengths, type="o", col="blue", pch=20, main="Queue lengths over time", xlab="Interval", ylab="Queue length")
# plot(waitTimes, type="o", col="blue", pch=20, main="Wait times", xlab="Person", ylab="Wait time")