“So let’s start with the fact that the study had only 100 people, which isn’t nearly enough to be able to make any determinations like this. That’s very small power. Secondly, it was already split into two groups, and the two groups by the way have absolutely zero scientific basis. There is no theory that says that if I want a girl or if I want a boy I’m going to be better able at determining whether my baby is in fact a girl or a boy.”

– Maria Konnikova, speaking on Mike Pesca’s podcast, The Gist.

Shown at top, above the quote by Konnikova, is a simulation of the study in question, under the assumption that the results were completely random (the null hypothesis). As usual, you’ll find my code in R at the bottom. The actual group of interest had just 48 women. Of those, 34 correctly guessed the sex of their gestating babies. The probability that you’d get such an extreme result by chance alone is represented by the light green tails. To be conservative, I’m making this a two-tailed test, and considering the areas of interest to be either that the women were very right, or very wrong.

The “power” Konnikova is referring to is the “power of the test.” Detecting small effects requires a large sample, detecting larger effects can be done with a much smaller sample. In general, the larger your sample size, the more power you have. If you want to understand the relationship between power and effect size, I’d recommend this lovely video on the power of the test.

As it turns out, Konnikova’s claims notwithstanding, study authors Victor Shamas and Amanda Dawson had plenty of power to detect what turns out to be a very large effect. Adding together the two green areas in the tails, their study has a p-value of about 0.005. This a full order of magnitude beyond the generally used threshold for statistical significance. Their study found *strong evidence* that women can guess the sex of their babies-to-be.

Is this finding really as strong as it seems? Perhaps the authors made some mistake in how they setup the experiment, or in how they analyzed the results.

Since apparently Konnikova failed not only to do statistical analysis, but also basic journalism, I decided to clean up on that front as well. I emailed Dr. Victor Shamas to ask how the study was performed. Taking his description at face value, it appears that the particular split of women into categories was based into the study design; this wasn’t a case of “p-value hacking”, as Konnikova claimed later on in the podcast.

Konnikova misses the entire point of this spit, which she says has “absolutely zero scientific basis.” The lack of an existing scientific framework to assimilate the results of the study is meaningless, since the point of the study was to provide evidence (or not) that that our scientific understanding lags behind what woman seem to intuitively know.

More broadly, *the existence of causal relationships does not depend in any way on our ability to understand or describe (model) them*, or on whether we happen to have an existing scientific framework to fit them in. I used to see this kind of insistence on having a known mechanism as a dumb argument made by smart people, but I’m coming to see it in a much darker light. The more I learn about the history of science, the more clear it becomes that the primary impediment to the advancement of science isn’t the existence of rubes, it’s the supposedly smart, putatively scientific people who are unwilling to consider evidence that contradicts their worldview, their authority, or their self-image. We see this pattern over and over, perhaps most tragically in the unwillingness of doctors to wash their hands until germ theory was developed, despite evidence that hand washing led to a massive reduction in patient mortality when assisting with births or performing operations.

Despite the strength of Shamas and Dawson’s findings, I wouldn’t view their study as conclusive evidence of the ability to “intuit” the sex of your baby. Perhaps their findings were a fluke, perhaps some hidden factor corrupted the results (did the women get secret ultrasounds on the sly?). Like any reasonable scientist, Shamas wants to do another study to replicate the findings, and told me that has a specific follow-up in mind.

*Code in R:*

trials = 100000 results = rep(0,trials) for(i in 1:trials) { results[i] = sum(sample(c(0,1),48,replace=T)) } extremes = length(results[results<=14]) + length(results[results>=34]) extremes/trials dat <- data.frame( x=results, above=((results <= 14) | (results >= 34))) library(ggplot2) qplot(x,data=dat,geom="histogram",fill=above,breaks=seq(1,48))

Tags: pregnancy, r, simulations

Hi Matt,

Thanks for the post. It made me think.

I’m genuinely curious about your comment re causal relationships and the need or otherwise for a plausible scientific explanation of them. No doubt it’s true that causal relationships exist independently of our ability to explain them, but I’m not sure how we can, practically, safely separate “causal and inexplicable” results from “artefact of chance” results if we don’t overlay the current state of our knowledge.

If, to take another example, we found that some drug appeared to be efficacious in some randomly controlled trial despite the existence of any plausible causal mechanism for its effect, should we be comfortable prescribing it to people? I’m not sure I’d be happy to take such a drug, but I acknowledge that others might feel differently.

Keen to hear (well, read) your thoughts.

Thanks again,

Tony

Hi Tony! I think at this point in the history of science we’ve already plucked almost all the low hanging fruit of causality. The basics of physics (like Newton’s laws) and the broad rules of economics (higher cost reduces demand) have long since been replaced by much more nuanced theories that yield more accurate results, but are much more probabilistic than deterministic. Quantum mechanics says we can only know the probability of particle decay, not the exact moment (or why it decayed at that exact moment). Economics is becoming econometrics is becoming like a financial version of weather forecasting.

The systems we now study are highly complex (the human body & nutrition), often self-reflexive. Our vector of attack is data and correlation. We learn that people who eat certain foods watch certain movies. Does one thing cause the other? To some extent it doesn’t matter. Usually, the most important thing is *prediction*, and often that’s the only thing we can hope to achieve in these systems.

To get back to your question about the drug we don’t understand, I think we need to realize that our understanding of any drug, especially the long term effects, happens because we accumulate lots of data over long periods of time. We can guess at a molecule’s impact based on it’s shape, but our knowledge is highly limited until we see many different dose sizes bouncing around in many test subjects. To the extent that we are someday able to examine drugs without live trials, I suspect that will be because we will have developed highly complex simulators that do a virtual job of bouncing around the new molecule along with representations of our nervous systems, bacteria, blood types, etc. Our takeaway from these simulations will be related to the likelihood of various outcomes, not any kind of succulent description of mechanism.

This reply gets me going on a much longer essay (or even book) I hope to write about our need to abandon our fixation on causality and succinct models, in favor of correlation and complex simulation.

Really solid article. The counter-arguments by Konnikova are far from repeatable and sound more emotional.

The one thing I would argue is this statement:

“Adding together the two green areas in the tails, their study has a p-value of about 0.005. This a full order of magnitude beyond the generally used threshold for statistical significance. Their study found strong evidence that women can guess the sex of their babies-to-be.”

In general, I’m a strict p-value purist, in the sense that it’s not wise to judge the influence of a predictor based upon the p-value. You need to set a hypothesis significance level and then evaluate. If your significance level is 0.05 and you get p=10^-5, you can’t say you reject the null with any more confidence than a p=0.043.

In conclusion, p=0.005 is no stronger than p=0.049. Using the word stronger is slightly misleading.

But again, I agree with your assessment. There is existence of a relationship. Maybe it’s just this cohort and they caught a fluke. But you can’t argue that the science is wrong.

With a such a small sample size, why not go the route of cross-validation or bootstrapping?

Hi Adam! Perhaps because I’m not a “p-value purist”, I find the idea that a smaller p-value doesn’t increase confidence intriguing, but hard to understand. Imagine you were hired by the NFL to test whether their super-bowl coin is fair. You have time for 20 tosses, then you have to make a recommendation about whether they should replace the coin. Wouldn’t you feel more confident telling the league to replace their coin if the results landed well into the (pardon me) tails of the distribution? What if replacing the coin is expensive and complicated, is your pre-chosen significance level the cutoff and the exact value irrelevant?

One of my arguments is that you need to specify your significance before any analysis, and then give a binary accept/reject result. One problem with using the magnitude of the p-value to identify significance is that people will tend to make comparisons using them. If the NFL conducted two experiments (one with the AFC and one with the NFC) and found the significance level was 0.03 for the AFC and 0.0001 for the NFC, would you say we should choose the NFC coin? Is the AFC coin, if repeated, capable of producing 0.0001 significance?

The biggest danger is in comparing p-values. If you had a regression model with 10 variables and examined p-values, why not just turn it into a linear regression with the single highest p-value?

Thanks, Matt, for a wonderful blog. We often see in science that people hold findings to a much higher standard when those findings do not fit their conceptual framework.

When Amanda and I presented our initial findings in a departmental colloquium here at the University of Arizona, a visiting neuropsychologist objected, arguing that our findings could have been due to chance. I found it pretty amusing given that he had made his career on the study of a phenomenon called blindsight, which is based on the observation of an occurrence 59% of the time (compared to chance, which is 50%). Our effect size was over 70%, and there was no question about statistical significance.

As you point out, there are ALWAYS methodological issues that can be raised with any study. We researchers try to do the best we can in designing our studies, but as far as I know, the perfect research study has yet to be designed. There is always more to know and questions to be answered. The most beautiful thing in science is the power of replicability. If you doubt what we did, you can run your own experiment. In this case, I have the original survey and would more than welcome someone to replicate our now 16-year-old findings.

Then there is the question of mechanism to be addressed. If in fact these women did have accurate intuitions concerning the sex of their unborn children, what was the basis of those intuitions? I don’t have a definitive answer to that question but am more than happy to offer my speculation, which is perfectly legitimate with respect to the scientific process. After all, this type of speculation is how new hypotheses are generated.

I love the use of the blogosphere for keeping journalists honest. Thank you!

The biological mechanism is only one item in the Bradford Hill Criteria (http://en.wikipedia.org/wiki/Bradford_Hill_criteria). I know there are plenty of treatments we don’t entirely understand the mechanics of but are strongly recommended. And I’m not even a doctor! Take tuberculosis treatment for example. Why does WHO recommend 6 months? Why don’t the drugs kill off everything in 6 weeks? Why not 10 months? We don’t know how those little bugs hide and elude the drugs, but we do know relapse rates climb quickly for less than 6 months treatment. We’ll figure out why someday, but we’re confident to act accordingly in the meantime.

@Victor Shamas, you can have 4 more datapoints for your hypothesis generation. My wife and I predicted the sex on all 4 of our children (2 boys, 2 girls) before even the first ultrasound. It’s actually quite easy. She vomits about 100x as frequently with a girl as with a boy.

Where can I find a fuller report on this study (eg the journal article or such)?

Great read!

Do you have a link to the actual study? The link seems to just be an old USA today article. And Google Scholar doesn’t turn much up:

http://scholar.google.com/scholar?q=author%3AAmanda-Dawson+author%3Avictor-shamas&btnG=&hl=en&as_sdt=0%2C33

Makes me a bit more skeptical there isn’t anything else to the study design.

Skepticism is always warranted. Here’s want Shamas told me about publishing:

“Our study was not published in a peer-review journal because I had intended to do a follow-up study that would serve to replicate and extend these findings. In the meantime, the birth center was sold and the director who had been amenable to our research left her position; the graduate student transferred to another program; and my own career interests took me in other directions.”

I have to be cynical and think that a few of the women had access to ultrasound data beforehand and used that as a guess. That said, I would love to see a follow up study.

It’s a wonder to observe that the distribution became a normal one. Awaited follow-up study