stats


15
Oct 15

Random samples in JS using R functions

For a JavaScript-based project I’m working on, I need to be able to sample from a variety of probability distributions. There are ways to call R from JavaScript, but they depend on the server running R. I can’t depend on that. I need a pure JS solution.

I found a handful of JS libraries that support sampling from distributions, but nothing that lets me use the R syntax I know and (mostly) love. Even more importantly, I would have to trust the quality of the sampling functions, or carefully read through each one and tweak as needed. So I decided to create my own JS library that:

  • Conforms to R function names and parameters – e.g. rnorm(50, 0, 1)
  • Uses the best entropy available to simulate randomness
  • Includes some non-standard distributions that I’ve been using (more on this below)

I’ve made this library public at Github and npm.

Not a JS developer? Just want to play with the library? I’ve setup a test page here.

Please keep in mind that this library is still in its infancy. I’d highly recommend you do your own testing on the output of any distribution you use. And of course let me know if you notice any issues.

In terms of additional distributions, these are marked “experimental” in the source code. They include the unreliable friend and its discrete cousin the FML, a frighteningly thick-tailed distribution I’ve been using to model processes that may never terminate.


28
Sep 15

Arbitrage anyone?

I’m looking to put together a small crew to take on a large arbitrage project. The (rough) model for this would be “Hong Kong Syndicate” which took on the horse betting market. To be involved you have to be willing to make a large commitment in terms of time or money (I plan to contribute both). I have a set of proposed guidelines for identifying potential arbitrage targets, send me an email for more info.

UPDATE (Oct 5): Lots of interest in this, hope to finalize a core team this week. Let me know right away if you’re interested. Also, this would not necessarily be targeted at horse racing.


11
Dec 14

Can pregnant women intuit the sex of their children?

“So let’s start with the fact that the study had only 100 people, which isn’t nearly enough to be able to make any determinations like this. That’s very small power. Secondly, it was already split into two groups, and the two groups by the way have absolutely zero scientific basis. There is no theory that says that if I want a girl or if I want a boy I’m going to be better able at determining whether my baby is in fact a girl or a boy.”

– Maria Konnikova, speaking on Mike Pesca’s podcast, The Gist.

Shown at top, above the quote by Konnikova, is a simulation of the study in question, under the assumption that the results were completely random (the null hypothesis). As usual, you’ll find my code in R at the bottom. The actual group of interest had just 48 women. Of those, 34 correctly guessed the sex of their gestating babies. The probability that you’d get such an extreme result by chance alone is represented by the light green tails. To be conservative, I’m making this a two-tailed test, and considering the areas of interest to be either that the women were very right, or very wrong.

The “power” Konnikova is referring to is the “power of the test.” Detecting small effects requires a large sample, detecting larger effects can be done with a much smaller sample. In general, the larger your sample size, the more power you have. If you want to understand the relationship between power and effect size, I’d recommend this lovely video on the power of the test.

As it turns out, Konnikova’s claims notwithstanding, study authors Victor Shamas and Amanda Dawson had plenty of power to detect what turns out to be a very large effect. Adding together the two green areas in the tails, their study has a p-value of about 0.005. This a full order of magnitude beyond the generally used threshold for statistical significance. Their study found strong evidence that women can guess the sex of their babies-to-be.

Is this finding really as strong as it seems? Perhaps the authors made some mistake in how they setup the experiment, or in how they analyzed the results.

Since apparently Konnikova failed not only to do statistical analysis, but also basic journalism, I decided to clean up on that front as well. I emailed Dr. Victor Shamas to ask how the study was performed. Taking his description at face value, it appears that the particular split of women into categories was based into the study design; this wasn’t a case of “p-value hacking”, as Konnikova claimed later on in the podcast.

Konnikova misses the entire point of this spit, which she says has “absolutely zero scientific basis.” The lack of an existing scientific framework to assimilate the results of the study is meaningless, since the point of the study was to provide evidence (or not) that that our scientific understanding lags behind what woman seem to intuitively know.

More broadly, the existence of causal relationships does not depend in any way on our ability to understand or describe (model) them, or on whether we happen to have an existing scientific framework to fit them in. I used to see this kind of insistence on having a known mechanism as a dumb argument made by smart people,  but I’m coming to see it in a much darker light. The more I learn about the history of science, the more clear it becomes that the primary impediment to the advancement of science isn’t the existence of rubes, it’s the supposedly smart, putatively scientific people who are unwilling to consider evidence that contradicts their worldview, their authority, or their self-image. We see this pattern over and over, perhaps most tragically in the unwillingness of doctors to wash their hands until germ theory was developed, despite evidence that hand washing led to a massive reduction in patient mortality when assisting with births or performing operations.

Despite the strength of Shamas and Dawson’s findings, I wouldn’t view their study as conclusive evidence of the ability to “intuit” the sex of your baby. Perhaps their findings were a fluke, perhaps some hidden factor corrupted the results (did the women get secret ultrasounds on the sly?). Like any reasonable scientist, Shamas wants to do another study to replicate the findings, and told me that has a specific follow-up in mind.

Code in R:

trials = 100000
results = rep(0,trials)
for(i in 1:trials) {
	results[i] = sum(sample(c(0,1),48,replace=T))
}

extremes = length(results[results<=14]) + length(results[results>=34]) 
extremes/trials

dat <- data.frame( x=results, above=((results <= 14) | (results >= 34)))
library(ggplot2)
qplot(x,data=dat,geom="histogram",fill=above,breaks=seq(1,48))

1
Sep 14

Labor day distribution fun

Pinned, entropy augmented, digitally normal distribution, of no particular work-related use and thus perfectly suitable for today. Code in R:

iters = 1000
sd = 2
precision = 20

results = rep(0,iters)

for(i in 1:iters) {
	x = floor(rnorm(20,5,sd) %% 10)
	results[i] = paste(c('.',x),sep="",collapse="")
}

results = as.numeric(results)

plot(density(results,bw=.01),col="blue",lwd=3,bty="n")

12
Jun 14

A new way to visualize content

Right now I’m working on a project that involves new ways to view units of content and the relationships between them. I’ve posted the comic I worked on, it has a number of stats references throughout. This is early alpha stages for the software, you may run into issues. To see the relationships, go to the puffball menu and make sure that “Show relationships” is clicked.


30
Jan 14

Probability Podcast

I’ve produced a pilot episode of a “Probability Podcast”. Please have a listen and let me know if you’d be interested in hearing more episodes. Thanks!

The different approaches of Fermat and Pascal
Pascal’s solution, which may have come first (we don’t have all of the letters between Pascal and Fermat, and the order of the letters we do have is the matter of some debate), is to start at a point where the score is even and the next point wins, then work backwards solving a series of recursive equations. To find the split at any score, you would first note that if, at a score of (x,x), the next point for either player results in a win, then the pot at (x,x) would be split evenly. The pot split for player A at (x-1,x) would be the chance of his winning the next game, times the pot amount due him at (x,x). Once you know the split in the case where player A (or B) lacks a point, you can then solve for the case where a player is down by two and so on.

Fermat took a combinatorial approach. Suppose that the winner is the first person to score N points, and that Player A has a points and Player B has b points when the game is stopped. Fermat first noted that the maximum number of games left to be played was 2N-a-b-1 (supposing both players brought their score up to N-1, and then a final game was played to determine the winner). Then Fermat calculated the number of distinct ways these 2N-a-b-1 might play out, and which ones resulted in a victory for player A or player B. Each of these combinations being equally likely, the pot should be split in proportion to the number of combinations favoring a player, divided by the total number of combinations.

To understand the two approaches to solving the problem of points I have created the diagram shown at right.

Suppose each number in parenthesis represents the score of players A and B, respectively. The current score, 3 to 2, is circled. The first person to score 4 points wins. All of the paths that could have led to the current score are shown above the point (3,2). If player A wins the next point then the game is over. If player B wins, either player can win the game by winning the next point. Squares represent games won by player A, the star means that player B would win. The dashed lines are paths that make up combinations in Fermat’s solution, even though these points would not be played out.

Pascal’s solution for the pot distribution at (3,2) would be to note that if the score were tied (3,3), then we would split the pot evenly. However, since we are at point (3,2), there is only a one-in-two chance that we will reach point (3,3), at which point there is a one-in-two chance that player A will win the game. Therefore the proportion of the pot that goes to player A is 1/2+1/2 (1/2)=3/4 whereas player B is due 1/2 (1/2)=1/4.

Fermat’s approach would be to note that there are a total of 4 paths that lead from point (3,2) to the level where a total of 7 points have been played:

(3,2)→(4,2)→(5,2)
(3,2)→(4,2)→(4,3)
(3,2)→(3,3)→(4,3)
(3,2)→(3,3)→(3,4)

Of these, 3 represent victories for player A and 1 is a victory for player B. Therefore player A should get 3/4 of the pot and player B gets 1/4 of the pot.

As you can see, both Pascal and Fermat’s solutions yield the same split. This is true for any starting point. Fermat’s approach is generally agreed to be superior, as the recursive equations of Pascal can become very complicated. By contrast, Fermat’s combinatorial method can be solved quickly using what we now call Pascal’s Triangle or its related equations. However, both approaches are important for the development of probability theory.


2
Dec 13

The week in stats (Dec. 2nd edition)


11
Nov 13

The week in stats (Nov. 11th edition)


21
Oct 13

The week in stats (Oct. 21st edition)

  • Spreadsheets are user friendly, but they can also be dangerous. Patrick Burns explains why you should avoid spreadsheets and work with R instead.
  • How’s your fantasy team doing? Revolution Analytics compiles a series of Fantasy Football modelling articles by Boris Chen of New York Times.
  • Rexer Analytics has been conducting regular polls of data miners and analytics professionals on their software choices since 2007. They presented their results at the 2013 Rexer Analytics Data Miner Survey at last month’s Predictive Analytics World conference in Boston.
  • Everyone understands the p-value, except for those who don’t. Here is an example that once again shows the p-value – that workhorse of modern science – continues to be misinterpreted in even the top tiers of the scientific literature.
  • Despite all the hype surrounding big data and analytics, Louis Columbus of Forbes argues that the majority of business analysts lack access to the data and tools they need. Columbus explains why and how this should be changed.
  • Six Decades of the Most Popular Names for Girls, State-by-State, represented all in one interactive map.

14
Oct 13

The week in stats (Oct. 14th edition)