The incredible difficulty of making sense of medical data – Sackler Colloquim on Reproducibility Field Report 2

Here’s my second update from the Sackler Colloquium on Reproducibility in Research.

For me, the highlight of the first day was David Madigan, who is a statistician at Columbia.

David discussed the foibles of observational medical research.  Health care systems generate enormous sets of medical records which can be minded to try to identify patterns (e.g. are there more complications when patients are treated with drug X vs. drug Y?).  Although observation like this cannot prove causation, it can certainly help generate ideas, and sometimes ethical or financial constraints mean that observational studies are the only ones feasible.  On the plus side, the databases can be huge, offering millions of records…a really rich data source.  Because of this, observational medical research has become very popular, and very influential–lots of studies are published and they often have a big effect on on how medicine is practiced.

All this sounds great, right?  It’s complicated.  David showed in compelling fashion that making sense of this data is much more tricky than it seems.  First, he showed that there are some huge problems in the field:

  • There is overwhelming publication bias, with ~85% of published studies showing significant effects.  This likely represents cherry picking of results.
  • Studies routinely find the exact opposite treatment effects, sometimes even with the exact same medical database!
  • It is nearly impossible to check the work of other researchers based solely on published papers.  Access to data is complicated, and the medical database continues to grow from the time of publication.  Most importantly, there are many many difficult decisions to be made (how to deal with outliers, how to deal with missing numbers, if adjustments for covariates should be linear, etc.).  Even what seems like a thorough methods section can’t explain all the decisions that have been made.  Thus, trying to recreate the same analyses ends up in frustration–David claimed his group tried on 50 different papers and couldn’t get a precise re-calculation on any of them!

How could there be such enormous problems?  David identified two key issues:

  1.  Results are far more sensitive to design and analysis decisions than one might expect.  His team ran through the same data using different decisions (dichotomize a predictor or keep it continuous, adjust for age or not, etc.).  They found that often there is enormous variation in the treatment effect observed based on what one would hope would be relatively inconsequential design and analysis decisions.  This is an big problem–there is no ground truth to benchmark against, and there is no algorithm for deciding which approach might be ‘right’.

    Madigan, D., Ryan, P. B., & Schuemie, M. (2013). Does design matter ? Systematic evaluation of the impact of analytical choices on effect estimates in observational studies.

  2. Commonly used designs are much less certain than they might appear.  David and his group identified negative control treatments for a number of different outcomes.  That is, they selected an outcome (say diabetes) and then identified 50-60 different types of treatments normally given in hospitals that they could be nearly certain don’t themselves change the risk of the outcome (they did in a number of clever ways).  Then they ran the negative controls through the analysis pipelines for observational research.  What they found is that the negative controls often do not follow the mathematical distribution predicted by a standard null hypothesis.  Instead, the distribution of true negatives is often much, much wider.  This means that a p value or a CI against a mathematical null hypothesis is dramatically overstating the evidence for an effect.  David thus reccomends calibrated CIs–CIs made by estimating sampling error from the distribution of known negative effects rather than a mathematical null (one can obtain calibrated p values as well).  These are oten *much* wider than traditional CIs, but that’s the point–the techniques can’t perfectly control for all the entaglements in the data, so using an empirical null distribution gives you a way to much better context for assessing putative patterns identified in the data.  REALLY cool approach.

As if that wasn’t enough, David ended by showing how one could really put observational research through its paces–by taking a large set of potential research questions, running them through the analysis pipeline in lots of different ways, and then comparing the results to the distributions of both negative controls and spike-in positive controls, with CIs adjusted to obtain correct 95% capture of these controls.  *Then* you have a comprehensive context from which you can assess the strength of any given effect of research interest.  As he aptly put it: “Do all the studies“.

Cool! I especially appreciated David’s emphasis on effect sizes and uncertainty, rather than dichotomous yes/no decisions about different interventions.  He’s clearly leading the way.  It’s incredible to think that what seems like a ‘gift’ of a massive medical records database is actually a minefield of potential false inferences.

I (Bob) hadn’t heard of David before attending the colloquium…but a quick trip to Google Scholar told me that I’m just uninformed–he has almost 15,000 citations!



I'm a teacher, researcher, and gadfly of neuroscience. My research interests are in the neural basis of learning and memory, the history of neuroscience, computational neuroscience, bibliometrics, and the philosophy of science. I teach courses in neuroscience, statistics, research methods, learning and memory, and happiness. In my spare time I'm usually tinkering with computers, writing programs, or playing ice hockey.

Posted in Uncategorized

Leave a Reply