Posts Tagged: r


28
Feb 13

Statistical computation in JavaScript — am I nuts?

Over the past couple weeks, I’ve been considering alternatives to R. I’d heard Python was much faster, so I translated a piece of R code with several nested loops into Python (it ran an order of magnitude faster). To find out more about Mathematica 9, I had an extended conversation with some representatives from Wolfram Research (Mathematica can run R code, I’ll post a detailed review soon). And I’ve been experimenting with JavaScript and HTML5′s “canvas” feature.

JavaScript may seem like an unlikely competitor for R, and in may ways it is. It has no repository of statistical analysis packages, doesn’t support vectorization, and requires the additional layer of a web browser to run. This last drawback, though, could be it’s killer feature. Once a piece of code is written in JavaScript, it can be instantly shared with anyone in the world directly on a web page. No additional software needed to install, no images to upload separately. And unlike Adobe’s (very slowly dying) Flash, the output renders perfectly on your smartphone. R has dozens of packages and hundreds of options for charts, but the interactivity of these is highly limited. JavaScript has fewer charting libraries, but it does have some which produce nice output.

Nice output? What matters is the content; the rest is just window dressing, right? Not so fast. Visually pleasing, interactive information display is more than window dressing, and it’s more in demand than ever. As statisticians have stepped up their game, consumers of data analysis have come to expect more from their graphics. In my experience, users spend more time looking at graphs that are pleasing, and get more out of charts with (useful) interactive elements. Beyond that, there’s a whole world of simulations which only provide insight if they are visual and interactive.

Pretty legs, but can she type?
Alright, so there are some advantages to using JavaScript when it comes to creating and sharing output, but what about speed? The last time I used JavaScript for a computationally intensive project, I was frustrated by its slow speed and browser (usually IE!) lockups. I’d heard, though, that improvements had been made, that a new “V8″ engine made quick work of even the nastiest js code. Could it be true?

If there’s one thing I rely on R for, it’s creating random variables. To see if JavaScript could keep up on R’s home court, I ran the following code in R:

start = proc.time()[3]
x = rnorm(10^7,0,1)
end = proc.time()[3]
cat(start-end)

Time needed to create 10 million standard Normal variates in R? About half-a-second on my desktop computer. JavaScript has no native function to generate Normals, and while I know very little about how these are created in R, it seemed like cheating to use a simple inverse CDF method (I’ve heard bad things about these, especially when it comes to tails, can anyone confirm or deny?). After some googling, I found this function by Yu-Jie Lin for generating JS Normals via a “polar” method:

function normal_random(mean, variance) {
  if (mean == undefined)
    mean = 0.0;
  if (variance == undefined)
    variance = 1.0;
  var V1, V2, S;
  do {
    var U1 = Math.random();
    var U2 = Math.random();
    V1 = 2 * U1 - 1;
    V2 = 2 * U2 - 1;
    S = V1 * V1 + V2 * V2;
  } while (S > 1);
 
  X = Math.sqrt(-2 * Math.log(S) / S) * V1;
//  Y = Math.sqrt(-2 * Math.log(S) / S) * V2;
  X = mean + Math.sqrt(variance) * X;
//  Y = mean + Math.sqrt(variance) * Y ;
  return X;
  }

So how long did it take Yu-Jie’s function to run 10 million times and store the results into an array? In Chrome, it took about half-a-second, same as in R (in Firefox it took about 3 times as long). Got that? No speed difference between R and JS running in Chrome. For loops, JS seems blazing fast (compared to R). Take another look at the demo simulation I created. Each iteration of the code requires on the order of N-squared operations, and the entire display area is re-rendered from scratch. Try adding new balls using the “+” button and see if your browser keeps up.

It’s only a flesh wound!
So have I found the Holy Grail of computer languages for statistical computation? That’s much too strong a statement, especially given the crude state of JS libraries for even basic scientific needs like matrix operations. For now, R is safe. In the long term, though, I suspect the pressures to create easily shared, interactive interfaces, combined with improvements in speed, will push more people to JS/HTML5. Bridges like The Omega Project (has anyone used this?) might speed up the outflow, until people pour out of R and into JavaScript like blood from a butchered knight.


14
Dec 12

Let it snow!

A couple days ago I noticed a fun piece of R code by Allan Roberts, which lets you create a digital snowflake by cutting out virtual triangles. Go give it a try. Roberts inspired me to create a whole night sky of snowflakes. I tried to make the snowfall look as organic as possible. There are lots of options to adjust. Here’s the code, have fun and Happy Holidays!

# Code by Matt Asher for statisticsblog.com
# Feel free to modify and redistribute 
 
# How many flakes do you want to fall?
flakes = 100
 
# Width and height of your space
width = 800
height = 600
 
# Initial wind
wind = 0
 
# Setup the background of the plot and margins
par(bg = "black")
par(oma=c(0,0,0,0))
par(mar=c(0,0,0,0))
plot(0, 0, col="black", pch=".", xlim=c(0,width), ylim=c(0,height), axes=F)
 
for(i in 1:flakes) {
    startY = height
    startX = runif(1,1,width)
 
    xPos = startX
    yPos = startY
 
    for(j in 1:height) {
 
		# Optional drift in wind
		wind = wind + rcauchy(1,0,.05)
 
		# Update snowflake position
        xPos = xPos + rnorm(1,.1,1.5)
        yPos = yPos - runif(1,4,20)
 
        # Are we in the space, if so display it
        if(xPos > 0 && xPos <= width && yPos > 0 && yPos <= height) {
            points(round(xPos), round(yPos), col="white", pch=8)
 
            # System dely, slows down the flakes
            Sys.sleep(0.1)
        }
    }
}

13
Oct 12

The unicorn problem

Let’s say your goal is to observe all known species in a particular biological category. Once a week you go out and collect specimens to identify, or maybe you just bring your binoculars to do some spotting. How long will it take you to cross off every species on your list?

I’ve been wondering this lately since I’ve begun to hang out on the Mushrooms of Québec flickr group. So far they have over 2200 species included in the photos. At least one of the contributors has written a book on the subject, which got me wondering how long it might take him to gather his own photos and field observations on every single known species.

My crude model (see R code at the end of this post) assumes that every time Yves goes out, he has a fixed chance of encountering every given species. In other words, if there were 1000 species to find, and he averages 50 species per hunt, then every species is assigned a 1/20 chance of being found per trip. Thus the total found on any trip would have a Poisson distribution with parameter 50.

This model is unrealistic for lots of reasons (can you spot all the bad assumptions?), but it does show one of the daunting problems with this task: the closer you get to the end, the harder it becomes to find the last few species. In particular, you can be stuck at 1 for a depressingly long time. Run the simulation with different options and look at the graphs you get. I’m calling this “The unicorn problem,” after Nic Cage’s impossible-to-rob car in the movie Gone in 60 Seconds.

Do you have a real-life unicorn of one sort or another?

 
species = 1000
findP = 1/20
trials = 100
triesToFindAll = rep(0, trials)
 
 
 
for(i in 1:trials) {
	triesNeeded = 0
 
	leftToFind = rep(1, species)
	leftNow = species
	numberLeft = c()
 
	while (leftNow > 0) {
 
		found = sample( c(0,1), 1000, replace=TRUE, prob = c((1-findP),findP))
 
		leftToFind = leftToFind - found
 
		leftNow = length(leftToFind[leftToFind > 0])
 
		numberLeft = c(numberLeft, leftNow)
 
		triesNeeded = triesNeeded + 1
 
	}
 
	if(i == 1) {
		plot.ts(numberLeft, xlim=c(0, 2*triesNeeded), col="blue", ylab="Species left to find", xlab="Attempts")
	} else {
		lines(numberLeft, col="blue")
	}
 
	triesToFindAll[i] = triesNeeded
}

6
Dec 11

My oh my

Noted without comment, visit Biostatistics Ryan Gosling !!! for more gems like the one above.


20
Oct 11

Queueing up in R, continued

Shown above is a queueing simulation. Each diamond represents a person. The vertical line up is the queue; at the bottom are 5 slots where the people are attended to. The size of each diamond is proportional to the log of the time it will take them to be attended. Color is used to tell one person from another and doesn’t have any other meaning. Code for this simulation, written in R, is here. This is my second post about queueing simulation, you can find the first one, including an earlier version of the code, here. Thanks as always to commenters for their suggestions.

A few notes about the simulation:

  • Creating an animation to go along with your simulation can take a while to program (unless, perhaps, you are coding in Flash), and it may seem like an extra, unnecessary step. But you can often learn a lot just by “watching”, and animations can help you spot bugs in the code. I noticed that sometimes smaller diamonds hung around for much longer then I expected, which led me to track down a tricky little error in the code.
  • As usual, I’ve put all of the configuration options at the beginning of the code. Try experimenting with different numbers of intervals and tellers/slots, or change the mean service time.
  • If you want to run the code, you’ll need to have ImageMagick installed. If you are on a PC, make sure to include the full path to “convert”, since Windows has a built-in convert tool might take precedence. Also, note how the files that represent the individual animation cells are named. That’s so that they are added in the animation in the right order, naming them sequentially without zeros at the beginning failed.
  • I used Photoshop to interlace the animated GIF and resave. This reduced the file size by over 90%
  • The code is still a work in progress, it needs cleanup and I still have some questions I want to “ask” of the simulation.