NeuRA Ahead of the Open Science Curve

I had great fun yesterday visiting NeuRA (Neuroscience Research Australia), a large research institute in Sydney. I was hosted by Simon Gandevia, Deputy Director, who has been a long-time proponent of Open Science and The New Statistics.

Neura’s Research Quality page describes the quality goals they have adopted, at the initiative of Simon and the Reproducibility & Quality Sub-Committee, which he leads. Not only goals, but strategies to bring their research colleagues on board–and to improve the reproducibility of NeuRA’s output. My day started with a discussion with this group. They described a whole range of projects they are working on to strengthen research at NeuRA–and to assess how quality is (they hope!) increasing.

For example, Martin Heroux described the Quality Output Checklist and Content Assessment (QuOCCA) tool that they have developed, and are now applying to recent past research publications from NeuRA. In coming years they plan to assess future publications similarly–so they can document the rapid improvement!

I should mention that Martin and Joanna Diong run a wonderful blog, titled Scientifically Sound–Reproducible Research in the Digital Age.

It was clear that the R&Q folks, at least, were very familiar with Open Science issues. Would my talk be of sufficient interest for them? Its title was Improving the trustworthiness of neuroscience research (it should have been given by Bob!), and the slides are here. The quality of the questions and discussion reassured me that at least many of the folks in the audience were (a) on board, but also (b) very interested in the challenges of Open Science.

After lunch my ‘workshop’ was actually a lively roundtable discussion, in which I sometimes managed to explain a bit more about Significance Roulette (videos are here and here), demo a bit of Bob’s new esci in R, or join in brainstorming strategies for researchers determined to do better. My slides are here.

Yes, great fun for me, and NeuRA impresses as working hard to achieve reproducible research. Exactly what the world needs.

Geoff

I Join an RCT: A View From the Other Side

In ITNS we discuss randomized control trials (RCTs) and I’ve taught about them since whenever. If done well, they should provide gold standard evidence about the benefits and harms of a therapy. So I was particularly interested to be invited to join a large RCT. My wife, Lindy, and I both elected to join and we are now into the daily ritual of taking a little white tablet, each of us not knowing whether we have the dud or the active version. Weird!

The RCT is StaREE, A Clinical Trial of STAtin Therapy for Reducing Events in the Elderly. Yep, I’m officially ‘elderly’ and have been for a couple of years! It’s an enormous multimillion dollar project aiming for 10,000 participants over something like 8 years. It’s publicly funded, no drug company money involved.

StaREE project description and justification (taken from the registration site)

Statin therapy has been shown to reduce the risk of vascular events in younger individuals with manifest atherosclerotic disease or at high risk of vascular events. However, data derived from meta-analyses of existing trials suggests that the efficacy of statins may decline sharply amongst those over 70-75 years of age. Insufficient patients of this age group have been included in major trials to be certain of the benefit. Within this age group part of the benefit of statin therapy may be offset by adverse effects including myopathy, development of diabetes, cancer and cognitive impairment, all of which are more prevalent in the elderly in any event.

The use of statins in the over 70 age group raises fundamental questions about the purpose of preventive drug therapy in this age group. When a preventive agent is used in the context of competing mortality, polypharmacy and a higher incidence of adverse effects its use should be justified by an improvement in quality of life or some other composite measure that demonstrates that the benefit outweighs other factors.

STAREE will determine whether taking daily statin therapy (40 mg atorvastatin) will extend the length of a disability-free life, determined from survival outside permanent residential care, in healthy participants aged 70 years and above.

Background Reading

There’s a big 2016 review in The Lancet on open access here. A more recent review and meta-analysis is here. These seem to me to support the need for StaREE. Atorvastatin, the cholesterol-lowering drug under study seems safe and effective, while being cheap and widely-known. But most research to date has focussed on folks who have already had some cardiac event, or have high risk factors. There is need for evidence specifically about its possible value for healthy older folks.

My Experience So Far

Of course I first read all that the StaREE website had to offer, and what’s public in the registration of the trial at ClinicalTrials.gov and (more or less the same information) the Australian New Zealand Clinical Trials Registry.

I have asked the researchers for any further information they can give me, including information on:

  • how the sample size of 10,000 was chosen
  • the extent to which the full data and analysis scripts will be open, i.e. on public access
  • further details of the data analysis planned, beyond the long list of measures included in the registration (these first 3 dot points are about Open Science good practice)
  • details of how the safety committee is to operate and, in particular, what criteria will it use if it decides the trial should be stopped. (Such a committee is independent of the researchers and sees progressive results, without blinding, so can monitor any emerging trends that the therapy is clearly way better, or worse, than placebo. What evidence would lead it to stop the trial?)
  • budget

No reply yet–I may report further if I find out more.

At my first appointment a nurse trained in the StaREE protocols explained everything in very simple terms. I signed, including agreement that the researchers could have full access to my medical records, past present and future. I did some cognitive tests, mainly of memory. (One of the aims is to assess the extent dementia risk might be reduced by the medication.) Then I started the one-month lead-in period, during which I had to take a tablet every day. This was stated to be placebo, despite one of the reviews (links above) arguing that it’s more informative to use the active tablets for all participants during the lead-in. There was a long questionnaire about my medical history, and a blood test, including cholesterol, was ordered.

Then I needed to see my regular doctor, who shared with me the blood test results. My LDL (‘bad’ cholesterol) and HDL (‘good’ cholesterol) levels were well within the normal range, as I expected. She signed to say that none of the StaREE exclusion criteria applied to me and that I could join the trial.

It was made clear that my own health care is first priority, so my doctor can, if she judges it necessary, be unblinded and perhaps advise me to leave the trial. Of course it should work that way, but that’s just one more complication for the researchers.

At my second appointment there were more cognitive tests. Quite amusing, given that I was familiar with some, having taught about them way back. Even so, it’s not so easy to remember the long list of words at first go, and to minimise Stroop interference while quickly reading out the ink colour of incongruent colour words. (Say ‘RED’ on seeing BLUE in red ink.)

Then the big pack of my tablets arrived in the post and I confirmed online that I was starting. One more thing to remember when packing for any travels, even overnight.

Blinding–but perhaps not of participants

According to the registration, the blinding (actually referred to as ‘masking’) is: “Quadruple (Participant, Care Provider, Investigator, Outcomes Assessor)”. Excellent. At present I’m blinded, but at any stage my doctor might order a blood test, including for cholesterol. A distinct drop in my LDL (and perhaps a boost to my HDL) would strongly suggest that I’m on the statin medication. I would, in effect, be unblinded.

I’ve also read about, and had anecdotal reports of small changes that can sometimes be felt when starting statins–call them minor, short-lived side-effects. These might also unblind a participant, although of course anyone aware of such possibilities might judge them to occur when actually starting placebo!

When I return for my next StaREE appointment after 12 months, there will be more cognitive tests and the researchers will order a blood test, but that will specifically not include cholesterol–so the researchers will remain blinded. However, many older folks have blood tests, including for cholesterol, regularly, even annually. And the documentation about Atorvastatin provided by StaREE states that “your cholesterol… levels need to be checked regularly…”. Therefore many participants could potentially know their cholesterol levels, say a year or so into the trial, and therefore potentially be unblinded. I don’t see any way around this. Just one more complication for researchers, and perhaps for the interpretation of results.

Efficacy or Effectiveness?

Efficacy… under ideal circumstances, e.g. in an ideal RCT. Effectiveness… in the real world. See here for A Primer on Effectiveness and Efficacy Trials.

RCTs are easy to criticise for imposing so many restrictions on who can participate–in the interests of minimising nuisance variability–that estimates of treatment benefits may be overestimated compared with what’s realistic to expect in everyday clinical practice, no doubt with a more diverse set of patients. StaREE does have a list of exclusion criteria (see the 10 dot points at the registration site) but this case is a little different: The research question asks about possible treatment benefits for a wide range of generally healthy folks; the exclusion criteria are actually not very restrictive. Maybe the efficacy estimated by StaREE won’t be all that different from the effectiveness to be expected if statin use becomes widespread among the healthy elderly. (That word again!)

Compliance is often notoriously low, especially I would suspect when a drug is to be taken for ever, and for a not very dramatic–even though worthwhile–reduction in risk of some nasty outcome. In StaREE, volunteers who choose to take part in research may be quite compliant–although perhaps the knowledge that there is only a 50-50 chance that the tablets contain the active ingredient might reduce compliance. Conversely, if StaREE does find evidence that statins are worth taking by healthy older folks, then folks won’t have the special context of a research project, but they will be sure that they are getting the good stuff. I’ll be interested to know StaREE compliance, but I’m unsure how well that will predict real-life compliance.

How Do I Feel?

It’s great that such a study has been funded. My experience so far emphasises what an enormous task it is to plan, set up, and run such a usefully large study. Far more difficult and complex than to write a couple of pages in a textbook about how an RCT should be designed! (A year ago I wrote a post about two enormous and expensive RCTs–the SYNTAX and EXEL studies–that compared stents and coronary grafts as treatments for heart disease. The two very different approaches turn out, overall, to be about equally good.)

I’m keen to know how closely this StaREE study aligns with Open Science practices.

Of course I’ll be 100% compliant! I’ll take my little white tablets every single day, even if it’s a coin toss whether they are all duds or not. Yes, for sure! Well, a couple of years on, will I still feel this way? I hope so. But consider: 10,000 participants, taking tablets daily for an average of 5 years or so, is approx 16 million tablet-taking moments. And all those moments are needed to find out some simple information: by how much is the risk reduced for stroke, heart attack, etc by taking the little white tablets that do contain (what we hope is) the good stuff?

Most of us probably owe our lives to modern medicine, partly thanks to participants who have consented to join past studies. I’m glad to have the chance to join what looks like a well-done and worthwhile study now.

Geoff

 

Replications: How Should We Analyze the Results?

Does This Effect Replicate?

It seems almost irresistible to think in terms of such a dichotomous question! We seem to crave an ‘it-did’ or ‘it-didn’t’ answer! However, rarely if ever is a bald yes-no decision the most informative way to think about replication.

One of the first large studies in psychology to grapple with the analysis of replications was the classic RP:P (Replication Project: Psychology) reported by Nosek and many colleagues in Open Science Collaboration (2015). The project identified 100 published studies in social and cognitive psychology then conducted a high-powered preregistered replication of each, trying hard to make each replication as close as practical to the original.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349 (6251) aac4716-1 to aac4716-8.

The authors discussed the analysis challenges they faced. They reported 5 assessments of the 100 replications:

  • in terms of statistical significance (p<.05) of the replication
  • in terms of whether the 95% CI in the replication included the point estimate in the original study
  • comparison of the original and replication effect sizes
  • meta-analytic combination of the original and replication effect sizes
  • subjective assessment by the researchers of each replication study: “Did the effect replicate?”

Several further approaches to analyzing the 100 pairs of studies have since been published.

Even so, the one-liner that appeared in the media and swept the consciousness of psychologists was that RP:P found that fewer than half of the effects replicated: Dichotomous classification of replications rules! For me, the telling overall finding was that the replication effect sizes were, on average, just half the original effect sizes, with large spreads over the 100 effects. This strongly suggests reporting bias, p-hacking or some other selection bias influenced some unknown proportion of the 100 original articles.

OK, how should replications be analyzed? Happily, there has been progress.

Meta-Analytic Approaches to Assessing Replications

Larry Hedges, one of the giants of meta-analysis since the 1980s, and Jacob Schauer recently published a discussion of meta-analytic approaches to analyzing replications:

Hedges, L. V., & Schauer, J. M. (2019). Statistical analyses for studying replication: Meta-analytic perspectives. Psychological Methods, 24, 557-570. http://dx.doi.org/10.1037/met0000189

Abstract
Formal empirical assessments of replication have recently become more prominent in several areas of science, including psychology. These assessments have used different statistical approaches to determine if a finding has been replicated. The purpose of this article is to provide several alternative conceptual frameworks that lead to different statistical analyses to test hypotheses about replication. All of these analyses are based on statistical methods used in meta-analysis. The differences among the methods described involve whether the burden of proof is placed on replication or nonreplication, whether replication is exact or allows for a small amount of “negligible heterogeneity,” and whether the studies observed are assumed to be fixed (constituting the entire body of relevant evidence) or are a sample from a universe of possibly relevant studies. The statistical power of each of these tests is computed and shown to be low in many cases, raising issues of the interpretability of tests for replication.

The discussion is a bit complex, but here are some issues that struck me:

  • We usually wouldn’t expect the underlying effect to be identical in any two studies. What small difference would we regard as not of practical importance? There are conventions, differing across disciplines, but it’s a matter for informed judgment. In other words, how different could underlying effects be, while still justifying a conclusion of successful replication?
  • Should we choose fixed-effect or random-effects models? Random-effects is usually more realistic, and what ITNS and many other books recommend for routine use. However, H&S use fixed-effect models throughout, to limit the complexity. They report that, for modest amounts of heterogeneity their results do not differ greatly from what random-effects would give.
  • Meta-analysis and effect size estimation are the focus throughout, but even so the main aim is to carry out a hypothesis test. The researcher needs to choose whether to place the burden of proof on nonreplication or replication. In other words, is the null hypothesis that the effect replicates, or that it doesn’t?
  • One main conclusion from the H&S discussion and the application of their methods to psychology examples is that replication projects typically need many studies (often 40+) to achieve adequate power for those hypothesis tests, and that even large psychology examples are under-powered.

A further sign that the issues are complex is the comment published immediately following H&S with suggestions for an alternative way to think about heterogeneity and replication:

Mathur, M. B., & VanderWeele, T. J. (2019). Challenges and suggestions for defining replication “success” when effects may be heterogeneous: Comment on Hedges and Schauer (2019). Psychological Methods, 24, 571-575. http://dx.doi.org/10.1037/met0000223

H&S gave a brief reply:

Hedges, L. V., & Schauer, J. M. (2019). Consistency of effects is important in replication: Rejoinder to Mathur and VanderWeele (2019). Psychological Methods, 24, 576-577. http://dx.doi.org/10.1037/met0000237

Our Simpler Approach

I welcome the above three articles and would study them in detail before setting out to design or analyze a large replication project.

In the meantime I’m happy to stick with the simpler estimation and meta-analysis approach of ITNS.

Meta-Analysis to Increase Precision

Given two or more studies that you judge to be sufficiently comparable, in particular by addressing more-or-less the same research question, then use random-effects meta-analysis to combine the estimates given by the studies. Almost certainly, you’ll find a more precise estimate of the effect most relevant for answering your research question.

Estimating a Difference

If you have an original study and a set of replication studies, you could consider (1) meta-analysis to combine evidence from the replication studies, then (2) finding the difference (with CI of course) between the point estimate found by the original study and that given by the meta-analysis. Interpret that difference and CI as you assess the extent to which the replication studies may or may not agree with the original study.

If the original study was possibly subject to publication or other bias, and the replication studies were all preregistered and conducted in accord with Open Science principles, then a substantial difference would provide evidence for such biases in the original study–although other causes couldn’t be ruled out.

Moderation Analysis

Following meta-analysis, consider moderation analysis, especially if DR, the diamond ratio, is more than around 1.3 and if you can identify a likely moderating variable. Below is our example from ITNS in which we assess 6 original studies (in red) and Bob’s two preregistered replications (blue). Lab (red vs. blue) is a possible dichotomous moderator. The difference and its CI suggest publication or other bias may have influenced the red results, although other moderators (perhaps that red studies were conducted in Germany, blue in the U.S.) may help account for the red-blue difference.

Figure 9.8. A subsets analysis of 6 Damisch and 2 Calin studies. The difference between the two subset means is shown on the difference axis at the bottom, with its CI. From d subsets.

My overall conclusions are (1) it’s great to see that meta-analytic techniques continue to develop, and (2) our new-statistics approach in ITNS continues to look attractive.

Geoff

‘Preregistration’ or ‘Registration’?

For years, medicine has urged the ‘registration‘ of clinical trials before data collection starts. More recently, psychology has come to use the term ‘preregistration‘ for this vital component of Open Science. The ‘pre’ puts it in your face that it happens at the start, but should we fall into line and use the long-established term ‘registration’ for consistency and to avoid possible confusion between disciplines? There is already a move in that direction in that some psychology journals are accepting Registered Reports, rather than Pre-.

This recent article makes a spirited argument for consistency:

Rice, D. B., & Moher, D. (2019). Curtailing the use of Preregistration: A misused term. Perspectives on Psychological Science, 14, 1105-1108. doi: 10.1177/1745691619858427

Here’s the abstract:

Abstract
Improving the usability of psychological research has been encouraged through practices such as prospectively registering research plans. Registering research aligns with the open-science movement, as the registration of research protocols in publicly accessible domains can result in reduced research waste and increased study transparency. In medicine and psychology, two different terms, registration and preregistration, have been used to refer to study registration, but applying inconsistent terminology to represent one concept can complicate both educational outreach and epidemiological investigation. Consistently using one term across disciplines to refer to the concept of study registration may improve the understanding and uptake of this practice, thereby supporting the movement toward improving the reliability and reproducibility of research through study registration. We recommend encouraging use of the original term, registration, given its widespread and long-standing use, including in national registries.

Which should we use in ITNS2?

There was a bit of discussion about the issue at the AIMOS Conference, with a range of views presented.

In ITNS we use ‘preregistration’ all through–I’ve been happy with that because the ‘pre-‘ avoids any ambiguity. But what should we use in ITNS2, the second edition we’re just starting to prepare?

No doubt we’ll explain both terms, but my current inclination is to switch and use registration all through. Tomorrow I might think differently. Psychology’s preference may become clearer in coming months.

What do you think? Please comment below or send an email. Thanks!

Geoff

g.cumming@latrobe.edu.au

Congratulations Professor Fiona Fidler!

Just as the fabulous AIMOS Conference — one of Fiona’s most recent triumphs — was wrapping, it was announced officially that Fiona Fidler has been appointed as full PROFESSOR at the University of Melbourne. Wonderful news!

Wow, when Simine Vazire arrives at the University of Melbourne next year, also as professor, the world of non-open science had better watch out!

Hearty congratulations to Fiona!

Geoff

 

AIMOS — The New Interdisciplinary Meta-Research and Open Science Association

Association for Interdisciplinary Meta-Research & Open Science (AIMOS)

I had a fascinating two days down at the University of Melbourne last week for the first AIMOS conference. The program is here and you can click through to see details of the sessions.

Congratulations to Fiona Fidler and her team for pulling off such a terrific event! At least 250 folks attended, and huge ranges of disciplines and talk topics were included.

The Association was formally launched at a meeting with real buzz. The organisers were taken aback (in a good way) to have so many nominations for some office-holder and committee positions that elections were needed. The incoming President is Hannah Fraser (see here and scroll down).

See more about AIMOS and the launch here.

We were told to pay attention to the title of the Association. The ‘A’ does NOT stand for Australia! The ‘I’ stands for interdisciplinary, and we really mean that! Also, meta-research and Open Science are not the same! Phew. But all those points were amply exemplified by the fabulous diversity of speakers and topics. Philosophy to ecology, politics to medicine, economics to statistics, and tons more besides.

A Few Highlights

Haphazardly chosen:

  • Simine Vazire gave a rousing opening keynote, asking whether we want to be credible or incredible. (Breaking news: Simine is joining the University of Melbourne from July next year. Wonderful!)
  • Federal politician Andrew Leigh, author of Randomistas–a great book about using RCTs to develop and guide public policy and about which I blogged last year–spoke about evidence-based policy in the public interest, and how research can shape that. Best one-liner, reflecting on replication: “If at first you do succeed, try, try and try again.” It’s a terrible shame that this year’s election didn’t see him and his colleagues running the country.
  • James Heathers loves naming things. That’s just part of his enthusiastic and highly effective way of communicating. He develops ways to identify errors in published articles, and gives his methods names including GRIM, SPRITE, DEBIT, and RIVETS.
  • Franca Agnoli, from Padua, reported that Bob’s talk (see that link for links to lots of new-statistics goodies) a month or so ago in Cesena, Italy, was terrific.

Estimation: Why and How, now with R

That was the title of my 90 minute workshop. About 22 folks participated, and my slides, which are here, have been accessed by 32 uniques. I loved demonstrating Bob’s part-prototype R module, esci.jmo, which you can download here. It can be side-loaded into jamovi. The full version of esci.jmo will be the key upgrade of ITNS to give the second edition. That’s our task for 2020!

Please be warmly encouraged to sign up to join AIMOS, which is intended to be a global association. Next year’s conference will be in Sydney. You can join a mailing list here–ignore the outdated title, I’m sure the AIMOS sites will be updated shortly.

Geoff

Good Science Requires Unceasing Explanation and Advocacy

Recently in Australia a proposal was made for an “independent science quality assurance agency”. Justification for the proposal made specific reference to “the replication crisis” in science.

Surely we can all support a call for quality assurance in science? Not so fast! First, some context.

Australia’s Great Barrier Reef, one of the wonders of the natural world, is under extreme threat. Warming oceans, increasing acidity, sea level rise, and more, comprise grave threats. Indeed the GBR has suffered two devastating coral bleaching events in recent years, with maybe half the Reef severely damaged.

A recent analysis identified 45 threats to the Reef. Very high on the list is coastal runoff containing high levels of nutrients (primarily from farming) and sediment.

The Queensland State Government is introducing new laws to curb such dangerous runoff.

Now, back to the proposal. It was made by a range of conservative politicians and groups who are unhappy with the new laws. They claim that the laws are based on flawed science–results that haven’t been sufficiently validated and could therefore be wrong.

Fiona Fidler and colleagues wrote a recent article in The Conversation to take up the story and make the argument why the proposed agency is not the way to improve science, and that the proposal is best seen as a political move to discredit science and try to reduce what little action is being taken to protect the Reef. Their title summarises their message: Real problem, wrong solution: why the Nationals shouldn’t politicise the science replication crisis. (The Nationals are a conservative party, which is part of the coalition federal government. This government includes many climate deniers and continues to support development of vast coal and gas projects.)

Fiona and colleagues reiterate the case for a properly-constituted national independent office of research integrity, but that’s a quite different animal. You can hear Fiona being interviewed on a North Queensland radio station here. (Starting at about the 1.05 mark.)

Yes, unending explanation and advocacy, as Fiona and colleagues are doing, is essential if good Open Science practices are to flourish and achieve widespread understanding and support. And if sound evidence-based policy is to be supported.

The proposal by the Nationals is an example of agnotology–the deliberate promotion of ignorance and doubt. The tobacco industry may have written the playbook for agnotology, but climate deniers are now using and extending that playbook, with devastating risk to our children’s and grandchildren’s prospects for a decent life. Shame.

A salute to Fiona and colleagues, and to everyone else who is keeping up the good work, explaining, advocating, and adopting excellent science practices.

Geoff

Open Statistics Conference – Talk and Resources

I had the great pleasure today of discussing the estimation approach (New Statistics) at the Open Statistics / Open Eyes conference in Cesena, University of Bologna.

Here I’m posting some resources for those looking to get started with the New Statistics:

  • Here’s our “Getting Started” page with links to textbooks, videos, and more.
  • Here’s the esci module for R and jamovi (JASP soon, I hope?). Still in beta. Download esci.jmo and then sideload it into jamovi. https://github.com/rcalinjageman/esci
  • Here’s esci for Excel, a free set of tools for estimation in the frequentist approach: https://thenewstatistics.com/itns/esci/
  • Here’s our OSF page on teaching the New Statistics, with lots of resources for instructors: https://osf.io/muy6u/

And here’s my talk: https://www.dropbox.com/s/tpn3sph0mq8olel/Open%20Statistics%20Talk%20-%20RCJ.pptx?dl=0

Registered Reports: Conjuring Up a Dangerous Experiment

Last week I (Bob) had my first Registered Report proposal accepted at eNeuro. It’s another collaboration with my wife, Irina, where we will test two popular models of forgetting. The proposal, pre-registration, analysis script, and preliminary data are all on the OSF: https://osf.io/z2uck/. Contrary to popular practice, we developed our proposal for original research, not for a replication. We opted for a registered report because we wanted to set up a fair and transparent test between two models–it seemed to us both that this required setting the terms of the test publicly in advance and gaining initial peer review that our approach is sensible and valid.

Although having the proposal accepted feels like a triumph, I am sooooo anxious about this. I’m anxious because our proposal represents what Irina calls a “Dangerous Experiment”. She came up with this phrase in grad school when she was running a experiment which had the potential to expose much of her previous work as wrong. It was stomach-churning to collect the data. In fact, someone on her faculty even suggested ways she could present her work that would let her avoid doing the experiment. Irina decided that avoidance was not the right strategy for a scientist (yes, she’s amazing), and that she had to white-knuckle through it. In that first experience with a dangerous experiment she was vindicated.

Since then, we often discuss Dangerous Experiments and we push each other to find them and confront them head on. Sometimes they’ve ended in tears (which is why we no longer study habituation​1​ or use certain “well-established” protocols for Aplysia​2​). Other times we’ve been vindicated, to our great relief and satisfaction​3​. Lately the philosopher of science Deborah Mayo has popularized the idea of a severe test as important in moving science forward. I haven’t finished her book, but I suspect Irina and Mayo would get along.

Our experience has convinced me that Registered Reports will typically yield Dangerous Experiments–that this is their strength and also what makes them so terrifying. For registered reports, though, the danger is not in shattering the research hypothesis–the danger comes from the stress put upon the strength and mastery of your method. A registered report requires you to plan your study very carefully in advance–defining your sample, your exclusions, your research questions, your analyses, and your benchmarks for interpretation. You have to be pretty damn sure you know what you’re doing, because if you fail to anticipate an eventuality then the whole enterprise could collapse. So it’s making a tight rope and then seeing if you can really walk it. Dangerous, indeed. But the payoff is walking across a chasm towards some epistemic firm ground–that mystical place where legend has it you can move the world.

Putting together our registered report required doing a lot of “pre data” work to assure ourselves that we had a design and protocol we could feel confident would be worth executing with fidelity. We simulated outcomes under different models to ensure the analyses we were planning would be sensitive enough to discriminate between them. We developed positive controls that could give us independent assessments of protocol validity. We also expanded our design to include an internal replication to provide an empirical benchmark for data reliability. By mentally stepping through the project and conferring with the reviewers we built a tight rope we *think* is actually a sure bet to cross safely. Time will tell.

The whole process reminds me of something I used to do as a kid when playing Hearts: I used to lay down my first 5 plays on the table (face down) and then turn them up one-by-one as the tricks played out. It drove my siblings crazy. Usually I guessed wrong about how play would go and would have to delay the game to pick up my cards and re-think. Every once in a while, though, I would get to smugly turn the cards over in series like the Thomas Crown of playing cards. Registered reports ask for something like this: Are your protocols and standards well-developed enough that you can sequence them and execute them according to plan and still end up exactly where you want to be?

Does the dangerous nature of a registered report support the the frequent criticism that pre-registration is a glass prison? Perhaps. If this whole endeavor crashes and burns I’ll probably move closer to that point of view. But I can’t help but feel that this is how strong science must be done–that if you can’t point at the target and then hit it you don’t really know what you’re doing. That’s ok, of course–we’re lost in lots of fields and need exploration, theory building, and the like. Not every study needs to be a registered report. But it does seem to me that Registered Reports are the ideal to aspire to–that we can’t really say an effect is “established” or “textbook” or “predicted by theory” until we can repeatedly call our shots and make them. Or so it seems to me at the moment…. check back in 2 months to see what happened.

Oh yeah… if you’re here on this stats blog but curious about the science of forgetting, here’s the study Irina and I are conducting. We have come up with what we think is a very clever test between two long-standing theories of forgetting. Neurobiologists have tended to think of forgetting as a decay process, where entropy (mostly) dissolves the physical traces of the memory. Psychologists, however, argue that forgetting is a retrieval problem, not a storage problem. They contend that memory traces endure (perhaps forever), but becomes inaccessible due to the addition of new memories.

Irina and I are going to test these theories by tracking what happens when a seemingly forgotten memory is re-kindled, a phenomenon called savings memory. If forgetting is a decay process, then savings should involve re-building the memory trace, and it should thus be mechanistically very similar to encoding. If forgetting is retrieval failure, then savings should just re-activate a dormant memory, and this should be a distinct process relative to encoding. Irina and I will track the transcriptional changes activated as Aplysia experience savings and compare this to Aplysia that have just encoded a new memory. We *should* be able to get some pretty clear insight into the neurobiology of both savings and forgetting.

I genuinely have no idea which model will be better supported by the data we collect… depending on the day I can convince myself either way. As I mentioned above, my anxiety is not over which model is right but over if our protocol will actually yield a quality test…. fingers crossed.

  1. 1.
    Holmes G, Herdegen S, Schuon J, et al. Transcriptional analysis of a whole-body form of long-term habituation in Aplysia californica. Learn Mem. 2014;22(1):11-23. https://www.ncbi.nlm.nih.gov/pubmed/25512573.
  2. 2.
    Bonnick K, Bayas K, Belchenko D, et al. Transcriptional changes following long-term sensitization training and in vivo serotonin exposure in Aplysia californica. PLoS One. 2012;7(10):e47378. https://www.ncbi.nlm.nih.gov/pubmed/23056638.
  3. 3.
    Cyriac A, Holmes G, Lass J, Belchenko D, Calin-Jageman R, Calin-Jageman I. An Aplysia Egr homolog is rapidly and persistently regulated by long-term sensitization training. Neurobiol Learn Mem. 2013;102:43-51. https://www.ncbi.nlm.nih.gov/pubmed/23567107.

Transparency of reporting sort of saves the day…

I’m in the midst of an unhappy experience serving as a peer reviewer. The situation is still evolving but I thought I’d put up a short post describing (in general terms) what’s happened because I’d be happy to have some advice/input/reactions. Oh yeah, this is a post by Bob (not Geoff).

I am reviewing a paper that initially seemed quite solid. In the first round of review my main suggestion was to add more detail and transparency: to report the exact items used to measure the main construct, the exact filler items used to obscure the purpose of the experiment, any exclusions, etc.

The authors complied, but on reading the more detailed manuscript I found something really bizarre: the items used to measure the main construct changed from study to study, and often items that would seem to be related to the main construct were deemed filler from one study to the next.

Let’s say the main construct was self-esteem (it was not). In the first experiment there were several items used to measure self-esteem, all quite reasonable. But in a footnote giving the filler items I found not only genuine filler (“I like puppies”) but also items that seem clearly related to self-esteem… things as egregious as “I have high self esteem”. WTF? Then, in the next experiment the authors write that they measured their construct similarly but list different items, including 1 that had been deemed filler from experiment 1. Double-WTF! And, looking at the filler items listed in a footnote I again find items that would seem to be related to their construct. I also find a scale that seemed clearly intended as a manipulation check but which has not been mentioned or analyzed in either version of the manuscript (under-reporting!). The next experiments repeat the same story–described as measured in the same way but always different items and some head-scratching filler items.

There were other problems now detectable with the more complete manuscript. For example, it was revealed (in a footnote) that statistical significance of a key experiment was contingent on removal of a single outlier; something that had not been mentioned before! But the main one that has me upset is what seems to be highly questionable measurement.

One easy lesson I’ve learned from this is how important it is as a reviewer to push for full and transparent reporting. Without key details on how constructs were measured, what else was measured, what participants were excluded, etc. it would have been impossible to detect deficiencies in the evidence presented.

What has me agitated is what happens now. I sent back my concerns to the editor. If the problems are as severe as I thought (I could be wrong), I expect the paper will be rejected. But what happens next? These authors were clearly willing to submit a less-complete manuscript before. What if they submit the original version elsewhere, the one that makes it impossible to detect the absurdity of their measurement approach? The original manuscript seemed amazing; I have no doubt it could be published somewhere quite good. So has my push for transparent really saved the day, or will it just end up helping the authors better know what they should and shouldn’t include in the manuscript to get it published?

At this point, I don’t know. I’m still in the middle of this. But here are some possible outcomes:

  • It’s all just a misunderstanding: The authors could reply to my review and clarify that their measurement strategy was consistent and sensible but not correctly represented in the manuscript. That’d be fine; I’d feel much less agitated.
  • The authors re-analyze the data with consistent measurement and resubmit to the journal, letting the significance chips fall where they may. That’d also be fine. Rooting for this one.
  • The authors shelve the project. Perhaps the authors will just give up on the manuscript. To my mind this is a terrible outcome–they have 3 experiments involving almost hundreds of participants testing an important theory. I’d really like to know what the data says when properly analyzed. The suppression of negative evidence from the literature is the most critical failure of our current scientific norms. I feel like, in some ways, once you submit a paper to review it almost *has* to be published in some way, especially with the warts revealed… wouldn’t that be useful for emptying the file drawer and also deepening how we evaluate each other’s work?
  • The authors submit elsewhere, reverting to the previous manuscript that elided all the embarrassing details and which gave an impression of presenting very solid evidence for the theory. I suspect this is the most likely outcome. Nothing new to this. I remember one advisor in grad school who said (jokingly) that the first submission is to reveal all the mistakes you need to cover up. I guess the frustrating thing here is how uneven transparent reporting still is. I was one of 4 reviewers for this paper and I was the only one who asked for these additional details. If the authors want to go this route, I think they’ll have an easy time finding a journal that doesn’t push them for the details. How long until we plug those gaps? Why are we still reviewing for or publishing journals that don’t take transparent reporting seriously?

I’ll suppose I should organize a betting pool. Any predictions out there? What odds would you give for these different outcomes? Also, I’d be happy to hear your comments and/or similar stories.

Last but not least, here are some questions on my mind from this experience:

  • How much longer before we can consider sketchy practices like this full-out research misconduct? I mean if you are working in this field can you any longer plead ignorance? At this point shouldn’t you clearly know that flexible measurement and exclusions are corrupt research practices? If this situation is as bad as I think it is, does it cross the threshold to actual misconduct? I could forgive those who engaged in this type of work in the past (and I know I did, myself), but at this point I don’t want any colleagues who would be willing to pass off this type of noise-mining as science.
  • Would under-reporting elsewhere transform a marginal case of research misconduct into a clear case? Even if initially submitting a p-hacked manuscript doesn’t yet qualify as a clear-cut case of research misconduct, would re-submitting it elsewhere after the problems have been pointed out to you count as research misconduct?
  • Does treating the review process like a confessional exacerbate these problems? My understanding (which some have challenged on Twitter) is that the review process is confidential and that I can not reveal/publicize knowledge I gained through the review process alone. Based on that, I don’t think I would have any public recourse if the authors were to publish a less-complete manuscript elsewhere. My basis for criticizing it would be my knowledge of which items were and were not considered filler, knowledge I would only have from the review process. So I think my hands would be tied–I wouldn’t be able to notify the editor, write a letter to the editor, post to pubPeer, etc. without breaking the confidentiality of the review. I’m not sure if journal editors are as bound by this..so perhaps the editor at the original journal could say something? I don’t know. This is all still so hypothetical at this point that I don’t plan on worrying about it yet. But if I do eventually see a non-transparent manuscript from these authors in print I’ll have to seriously consider what my obligations and responsibilities are. It would be a terrible shame to have a prominent theory supported and cited in the literature by what I suspect to be nonsense; but I’d have to figure out how to balance that harm against my responsibilities for confidential review.

Ok – had to get that all out of my head. Now back to the sh-tstorm that is the fall 2019 semester. Peace, y’all.

Top