Tuesday, April 10, 2012

Replication in functional MRI

Daniel Bor started a long discussion by blogging about flawed neuroimaging papers.

Nancy Kanwisher sent a post to the discussion, in which she said this:
I have occasionally asked respected colleagues what percent of published neuroimaging findings they think would replicate, and the answer is generally very depressing. My own guess is *way* less than 50%.
I realized that I had never asked myself that question in a direct way.  In the same discussion I phrased the question like this:
Let us say you took a random sample of papers using functional MRI over the last five years. For each study in the sample, you repeated the same experiment. What proportion of your repeat experiments would substantially replicate the main findings of the original paper.
I guessed that the answer to this question was 30%.  Nancy Kanwisher gave her guess first, so I can't claim that my guess was entirely independent.  I also started asking my colleagues, and got guesses between 20% and 50%.

One obvious criticism that I got was that I had just made that figure up.  Of course that is true, and I didn't try to justify my guess.   Then someone emailed me to ask me why I guessed 30%.

I like public discussion rather than private, so here is my answer, in all its sad ignorance.

The evidence that I have comes from my own experience.  I have a moderate amount of experience of functional MRI (FMRI) analysis for my own studies and studies that I have collaborated on.  I have advised other people on FMRI analysis almost since I started functional imaging myself in 1996.  I have been teaching functional imaging analysis since about 1998.  I have reviewed a moderate number of papers but these have nearly all been methods papers which are not typical of FMRI papers. I have attended many lab meetings and presentations of experiments that have typically been to do with higher-order motor tasks or attention and memory.  To my shame I do not read many FMRI applications papers.  For the last 4 years or so I have worked nearly full time on code for imaging analysis.  That's the "nipy" in the title of this blog.

My own experience of analysis is that we make a huge number of mistakes.  There are many stupid and not so stupid mistakes to make, from getting the data off the scanner to making the final activation maps.  I  would claim to be more careful than average with my analysis (if only because I am interested in analysis) but I have made many mistakes, sometimes the same mistake several times in the same analysis.  I gave a talk on mistakes in functional MRI analysis at the Human Brain Mapping conference.  I gave examples of real mistakes, and in order to be fair, I only used my own mistakes as examples.  I didn't have much difficulty filling the time.

One of the problems is that there are many steps in the analysis.  Some of these steps are automated and some are half-automated.  For example, when I am doing an analysis I am likely to pick up some of my old scripts from a previous analysis.  I wrote these scripts, but I forget the hacks I put into them to make them work with the previous data.  The problem is much worse for new lab members and researchers who do not write code.  They often find that someone has given them a script that they don't understand, perhaps with some help in modifying it.  It's very easy to get lost inside these scripts and it's also easy to find yourself applying parameters that are not right for your own data without knowing it.

A recent example was slice-timing correction on Siemens scanners.  It's probably more common than not to use interleaved slice acquisition for FMRI.   Typically this means that the bottom slice in space (slice 1) is collected first then all the odd-numbered slices from 3 to the top of the brain.  After that you acquire the even slices starting at the second from bottom in space.  This is the classic 1,3 ... 2, 4 ... interleaved acquisition order. However, that's not right for some Siemens scanners; if there are an even number of slices, it turns out that Siemens acquires 2, 4, ... 1, 3 ...   Who knew?  It turned out some people did, but many people, including me, did not, and I'd been analyzing Siemens data for a long time.  If the TR is the time to acquire one whole brain volume, then getting the slice timing wrong means that every slice in your volume is half a TR wrong in time (plus or minus).  Now imagine you'd been using someone else's analysis script, and that the person who wrote the script didn't know either.  The problem here is that we often use scripts we don't fully understand or haven't reviewed recently.  This makes it less likely that we will go and check.

Then there are simple logic errors and typos.  Working on code has helped me understand just how prone we are to error.  Good practice in coding means testing every line of your code to convince yourself and others that it does what you say it does. When I do this, I find lots of errors.   I never did this when I was writing analysis scripts, and I don't think many researchers do.  I conclude that there must have been a considerable number of errors.  I would have found the errors that made the data look funny, but I would likely have missed errors that left the data looking plausible.

To deal with this we would need to teach ourselves good practices for using and writing software, because that is what we are doing.  But, this is almost never taught.  I was certainly not taught that, and stumbled across the basics after I had been doing a lot of coding for a long time.  Knowing what I know now, I would not let a student loose on FMRI data without a good basic knowledge of software engineering.   I hope very much that that becomes routine over the next 10 years.

The last aspect of analysis in FMRI is just how many different times we tend to analyze the data.  There are so many different things to try and we find ourselves taking many paths through the analysis. This is particularly marked when we get to the statistics.  There are many different statistical models to apply to the data, and we often end up trying a large number.  The great risk is that we will stop analyzing when we see a result we like.   This must occur often in practice.   That makes it very difficult to know whether the result is a real one or the result of trying many different analyses on data that has no real signal.

The difficulties of the analysis are compounded by the fact that we don't teach analysis very well.  The people doing analysis are often people like me in that they have very little background in engineering mathematics.  To understand the  theory of the analysis means you have to understand some of Fourier analysis, numerical optimization, image processing, brain anatomy, filter theory, linear algebra and statistical inference.   That is hard to teach in a short period, and we haven't had long to get this right - FMRI only started in 1992. The result is that many people doing imaging feel overwhelmed by their lack of background knowledge.  I certainly suffered from that and still do.

Last there is the culture surrounding FMRI.  FMRI is new.  If is fashionable.  It is expensive.  It gets high profile publications and gets you into the news.

John P. A. Ioannidis wrote a famous paper called "Why Most Published Research Findings Are False". Here are his 6 corollaries:

Corollary 1: The smaller the studies conducted in a scientific field, the less likely the research findings are to be true.
Corollary 2: The smaller the effect sizes in a scientific field, the less likely the research findings are to be true.
Corollary 3: The greater the number and the lesser the selection of tested relationships in a scientific field, the less likely the research findings are to be true.
Corollary 4: The greater the flexibility in designs, definitions, outcomes, and analytical modes in a scientific field, the less likely the research findings are to be true.
Corollary 5: The greater the financial and other interests and prejudices in a scientific field, the less likely the research findings are to be true.
Corollary 6: The hotter a scientific field (with more scientific teams involved), the less likely the research findings are to be true.
I believe that anyone working in FMRI would recognize all 6 of these.

That is why I guessed 30%.