June, 2010

Jun 10

Entropy augmentation the modulo way

Long before I had heard about the connection between entropy and probability theory, I knew about it from the physical sciences. This is most likely how you met it, too. You heard that entropy in the universe is always increasing, and, if you’re like me, that made very little sense. Then you may have heard that entropy is a measure of disorder: over times things fell apart. This makes a little more sense, especially to those teenagers tasked with cleaning their own rooms. Later on, perhaps you got a more precise, mathematical definition of entropy that still didn’t fully mesh with the world as we observe it. Here on earth, we see structures getting built up over time: plants convert raw energy to sunflowers, bees build honeycombs, humans build roads. Things do sometimes fall apart. More precisely, levels of complexity tend to grow incrementally over long periods of time, then collapse very quickly. This particular asymmetry seems to be an ironclad rule for our word, which I assume everyone understands, at least implicitly, though I can’t remember anywhere this rule is written down as such.

In the world of probability, entropy is a measure of unpredictability. Claude Shannon, who created the field of Information Theory, gave us an equation to measure how much, or little, is known about an incoming message a prori. If we know for sure exactly what the message will be, our entropy is 0. There is no uncertainty. If we don’t know anything about the outcome except that it will be one of a finite number of possibilities, we should assume uniform probability for any one of the outcomes. Uncertainty, and entropy, is maximized. The more you look into the intersection of entropy and statistics, the more you find surprising, yet somehow obvious in retrospect, connections. For example, among continuous distributions with fixed mean and standard deviation, the Normal distribution has maximal entropy. Surprised? Think about how quickly a sum of uniformly distributed random variables converges to the Normal distribution. Better yet, check it out for yourself:

n = 4
tally = rep(0,10000)
for(i in 1:n) {
	tally = tally + runif(10000)
hist(tally, breaks=50, col="blue")

Try increasing and decreasing “n” and see how quickly the bell curve begins to appear.

Lately I’ve been thinking about how to take any general distribution and increase the entropy. The method I like best involves chopping off the tails and “wrapping” these extreme values back around to the middle. Here’s the function I created:

smartMod <- function(x, mod) {
	sgn = sign(x)
	x = abs(x)
	x = x %% mod
	return(sgn * x)

Now is a perfect time to use a version of our “perfect sample” function:

perfect.sample <- function(dist, n, ...) {
	match.fun(paste('q', dist, sep=''))((1:n) / (n+1), ...)

The image at the top of this post shows the Chi Square distribution on 2 degrees of freedom, with Modulo 3 Entropy Enhancement (see how nice that sounds?). Here’s the code to replicate the image:

hist(smartMod(perfect.sample("chisq",10000,2),3),breaks=70,col="blue",main="Entropy enhanced Chi-Square distribution")

Here’s another plot, using the Normal distribution and Modulo 1.5:

One nice property of this method of increasing entropy is that you get a smooth transition with logical extremes: As your choice of Mod goes to infinity, the distribution remains unchanged. As your Mod number converges to 0, entropy (for that given width) is maximized. Here are three views of the Laplace, with Mods 5, 1.5, and 0.25, respectively. See how nicely it flattens out? (Note you will need the library “VGAM” to sample from the Laplace).

It’s not clear to me yet how entropy enhancement could be of practical use. But everyone loves enhancements, right? And who among us doesn’t long for a  little extra entropy for time to time, no?

Jun 10

Weekend art in R (Part 2)

I put together four of the best looking images generated by the code shown here:

# More aRt
iters = 500
for(i in 1:iters) {
	center = runif(2)
	size = 1/rbeta(2,1,3)
	# Let's create random HTML-style colors
	color = sample(c(0:9,"A","B","C","D","E","F"),12,replace=T)
	fill = paste("#", paste(color[1:6],collapse=""),sep="")
	brdr = paste("#", paste(color[7:12],collapse=""),sep="")
	points(center[1], center[2], col=fill, pch=20, cex=size)
	points(center[1], center[2], col=fill, pch=21, cex=size,lwd=runif(1,1,4))

Weekend art Part 1 is here.

Jun 10

A rare event in tennis


That was the final score in the final set, won by John Isner over Nicolas Mahut at Wimbledon. In the final set of a match the winner has to beat the other by at least 2 games, so as in baseball matches can (theoretically) go on indefinitely. In practice even final sets that go into “extra innings” tend to end quickly. The previous record for number of games in a final set was 1/3 as many. For some commentary on the odds related to this extraordinary match check out:

A mathematician watches tennis II
Terence Tao on Buzz

Wikipedia has a list of Longest tennis match records.

Jun 10

Data is dumb, try to be smart

I blurred out my personal information, but otherwise the check shown above is exactly as I received it. Yahoo! really did send me a check for ZERO DOLLARS 00 CENTS. Obviously, mailing out checks for nothing is wasteful and pointless (shall I try cashing it? Will the bank “credit” me for nothing?). It could be that this check isn’t even “legal”, if I’m to believe the Best Answer by Yahoo!’s crack team of expert volunteers. But instead of making this a “Ha! Ha! Big Company Does Something Stupid” post, I thought I’d dig a little deeper into the story behind the check.

Until recently, Yahoo! had an advertising program similar to Google’s Adsense. Webmasters could show ads from Yahoo!’s network and share in the profits when people clicked on the links. Although basically similar, Yahoo!’s version was much more restrictive: fewer websites were accepted into their network and webmasters could only display ads to visitors from the United States. I can understand why Yahoo! would want to limit payment to clicks coming from a single (huge and wealthy) country, but the way they implemented this requirement struck me as odd. It was the webmaster’s responsibility to make sure no one from outside the US saw their ads. This meant that I had to install Geo-IP software on my server, which maps the unique numerical identifiers of visitors to their country of origin. Installing this software is a pain, and you have to update it regularly with the latest free or paid versions of the database, or else the information becomes inaccurate. Yahoo!, with all its infrastructure and CPU power and economies of scale, was making thousands of individual webmasters perform a task best handled by a giant IT company, like, say, Yahoo!

The other strange, and occasionally frustrating, thing about Yahoo!’s program was the high volatility in payment amounts. On average, the earnings from their ad network were good, but I would see days when a website made $50 and others when it made $2.11, sometimes from the same number of visitors or even the same number of clicks. Variability is built in to programs like this, but I found Yahoo!’s results to be way more inconsistent that Google’s or even some of the much smaller companies with their own advertising networks. You might think that only the average matters, but high volatility leads to poorer decision making. Imagine the following two scenarios: in one, you see the revenue from a advertising program slowly and steadily decline over the course of months. Time to find a different program. In the other scenario, payouts vary widely from month to month. Some are fantastic, some very poor. How can you tell when to leave the program, and when to expand your use of it? There are ways to measure non-obvious trends with time series analysis. But no matter which tool you use, the more “noise” in your data stream, the worse your predictions will be.

High volatility in Yahoo!’s payouts was so prolonged that it couldn’t have been just a matter of growing pains. Either no-one at Yahoo! was looking into payment variability at the level of individual users (it’s possible that the total revenue generated by their program, summed across all users, was much more stable), or they didn’t see why this was a problem.

Finally, we have the issue of the worthless check. When Yahoo! closed their program, they had to send out final checks, even to those users with balances below the regular minimum threshold (most programs require $50 or so in earnings before they send you a check). One possibility is that Yahoo!’s algorithm never looked to see if the amount owed was positive before cutting a check. That would be really dumb. Another, slightly-more-favorable explanation is that they really owed me 0.5 cents, so their algorithm saw this as a positive balance. It looked like I should be getting a check, but then Yahoo!’s check-printing software rounded down to the nearest penny and I end up with $0.00. Either way, I see this as a final indicator that Yahoo! just isn’t thinking very hard about their data. By contrast, Google and other companies seem to understand that their business depends on managing data with a huge amount of human intelligence and understanding. Data is dumb, Yahoo! needs to be smarter.

Jun 10

Reaching escape velocity

Sample once from the Uniform(0,1) distribution. Call the resulting value x. Multiply this result by some constant c. Repeat the process, this time sampling from Uniform(0, x*c). What happens when the multiplier is 2? How big does the multiplier have to be to force divergence. Try it and see:

iters = 200
locations = rep(0,iters)
top = 1
multiplier = 2
for(i in 1:iters) {
	locations[i] = runif(1,0,top)
	top = locations[i] * multiplier
# Optional save as movie, not a good idea for more than a few hundred iterations. I warned you!
# library("animation")
# saveMovie(for (i in 1:iters) plot(locations[1:i],1:i,pch=20,col="blue",xlim=c(0,max(locations)),ylim=c(0,iters),xlab="Location",ylab="Iteration"),loop=1,interval=.1)