Recently in Australia a proposal was made for an “independent science quality assurance agency”. Justification for the proposal made specific reference to “the replication crisis” in science.
Surely we can all support a call for quality assurance in science? Not so fast! First, some context.
Australia’s Great Barrier Reef, one of the wonders of the natural world, is under extreme threat. Warming oceans, increasing acidity, sea level rise, and more, comprise grave threats. Indeed the GBR has suffered two devastating coral bleaching events in recent years, with maybe half the Reef severely damaged.
The Queensland State Government is introducing new laws to curb such dangerous runoff.
Now, back to the proposal. It was made by a range of conservative politicians and groups who are unhappy with the new laws. They claim that the laws are based on flawed science–results that haven’t been sufficiently validated and could therefore be wrong.
Fiona Fidler and colleagues wrote a recent article in The Conversation to take up the story and make the argument why the proposed agency is not the way to improve science, and that the proposal is best seen as a political move to discredit science and try to reduce what little action is being taken to protect the Reef. Their title summarises their message: Real problem, wrong solution: why the Nationals shouldn’t politicise the science replication crisis. (The Nationals are a conservative party, which is part of the coalition federal government. This government includes many climate deniers and continues to support development of vast coal and gas projects.)
Fiona and colleagues reiterate the case for a properly-constituted national independent office of research integrity, but that’s a quite different animal. You can hear Fiona being interviewed on a North Queensland radio station here. (Starting at about the 1.05 mark.)
Yes, unending explanation and advocacy, as Fiona and colleagues are doing, is essential if good Open Science practices are to flourish and achieve widespread understanding and support. And if sound evidence-based policy is to be supported.
The proposal by the Nationals is an example of agnotology–the deliberate promotion of ignorance and doubt. The tobacco industry may have written the playbook for agnotology, but climate deniers are now using and extending that playbook, with devastating risk to our children’s and grandchildren’s prospects for a decent life. Shame.
A salute to Fiona and colleagues, and to everyone else who is keeping up the good work, explaining, advocating, and adopting excellent science practices.
Last week I (Bob) had my first Registered Report proposal accepted at eNeuro. It’s another collaboration with my wife, Irina, where we will test two popular models of forgetting. The proposal, pre-registration, analysis script, and preliminary data are all on the OSF: https://osf.io/z2uck/. Contrary to popular practice, we developed our proposal for original research, not for a replication. We opted for a registered report because we wanted to set up a fair and transparent test between two models–it seemed to us both that this required setting the terms of the test publicly in advance and gaining initial peer review that our approach is sensible and valid.
Although having the proposal accepted feels like a triumph, I am sooooo anxious about this. I’m anxious because our proposal represents what Irina calls a “Dangerous Experiment”. She came up with this phrase in grad school when she was running a experiment which had the potential to expose much of her previous work as wrong. It was stomach-churning to collect the data. In fact, someone on her faculty even suggested ways she could present her work that would let her avoid doing the experiment. Irina decided that avoidance was not the right strategy for a scientist (yes, she’s amazing), and that she had to white-knuckle through it. In that first experience with a dangerous experiment she was vindicated.
Since then, we often discuss Dangerous Experiments and we push each other to find them and confront them head on. Sometimes they’ve ended in tears (which is why we no longer study habituation1 or use certain “well-established” protocols for Aplysia2). Other times we’ve been vindicated, to our great relief and satisfaction3. Lately the philosopher of science Deborah Mayo has popularized the idea of a severe test as important in moving science forward. I haven’t finished her book, but I suspect Irina and Mayo would get along.
Our experience has convinced me that Registered Reports will typically yield Dangerous Experiments–that this is their strength and also what makes them so terrifying. For registered reports, though, the danger is not in shattering the research hypothesis–the danger comes from the stress put upon the strength and mastery of your method. A registered report requires you to plan your study very carefully in advance–defining your sample, your exclusions, your research questions, your analyses, and your benchmarks for interpretation. You have to be pretty damn sure you know what you’re doing, because if you fail to anticipate an eventuality then the whole enterprise could collapse. So it’s making a tight rope and then seeing if you can really walk it. Dangerous, indeed. But the payoff is walking across a chasm towards some epistemic firm ground–that mystical place where legend has it you can move the world.
Putting together our registered report required doing a lot of “pre data” work to assure ourselves that we had a design and protocol we could feel confident would be worth executing with fidelity. We simulated outcomes under different models to ensure the analyses we were planning would be sensitive enough to discriminate between them. We developed positive controls that could give us independent assessments of protocol validity. We also expanded our design to include an internal replication to provide an empirical benchmark for data reliability. By mentally stepping through the project and conferring with the reviewers we built a tight rope we *think* is actually a sure bet to cross safely. Time will tell.
The whole process reminds me of something I used to do as a kid when playing Hearts: I used to lay down my first 5 plays on the table (face down) and then turn them up one-by-one as the tricks played out. It drove my siblings crazy. Usually I guessed wrong about how play would go and would have to delay the game to pick up my cards and re-think. Every once in a while, though, I would get to smugly turn the cards over in series like the Thomas Crown of playing cards. Registered reports ask for something like this: Are your protocols and standards well-developed enough that you can sequence them and execute them according to plan and still end up exactly where you want to be?
Does the dangerous nature of a registered report support the the frequent criticism that pre-registration is a glass prison? Perhaps. If this whole endeavor crashes and burns I’ll probably move closer to that point of view. But I can’t help but feel that this is how strong science must be done–that if you can’t point at the target and then hit it you don’t really know what you’re doing. That’s ok, of course–we’re lost in lots of fields and need exploration, theory building, and the like. Not every study needs to be a registered report. But it does seem to me that Registered Reports are the ideal to aspire to–that we can’t really say an effect is “established” or “textbook” or “predicted by theory” until we can repeatedly call our shots and make them. Or so it seems to me at the moment…. check back in 2 months to see what happened.
Oh yeah… if you’re here on this stats blog but curious about the science of forgetting, here’s the study Irina and I are conducting. We have come up with what we think is a very clever test between two long-standing theories of forgetting. Neurobiologists have tended to think of forgetting as a decay process, where entropy (mostly) dissolves the physical traces of the memory. Psychologists, however, argue that forgetting is a retrieval problem, not a storage problem. They contend that memory traces endure (perhaps forever), but becomes inaccessible due to the addition of new memories.
Irina and I are going to test these theories by tracking what happens when a seemingly forgotten memory is re-kindled, a phenomenon called savings memory. If forgetting is a decay process, then savings should involve re-building the memory trace, and it should thus be mechanistically very similar to encoding. If forgetting is retrieval failure, then savings should just re-activate a dormant memory, and this should be a distinct process relative to encoding. Irina and I will track the transcriptional changes activated as Aplysia experience savings and compare this to Aplysia that have just encoded a new memory. We *should* be able to get some pretty clear insight into the neurobiology of both savings and forgetting.
I genuinely have no idea which model will be better supported by the data we collect… depending on the day I can convince myself either way. As I mentioned above, my anxiety is not over which model is right but over if our protocol will actually yield a quality test…. fingers crossed.
Bonnick K, Bayas K, Belchenko D, et al. Transcriptional changes following long-term sensitization training and in vivo serotonin exposure in Aplysia californica. PLoS One. 2012;7(10):e47378. https://www.ncbi.nlm.nih.gov/pubmed/23056638.
Cyriac A, Holmes G, Lass J, Belchenko D, Calin-Jageman R, Calin-Jageman I. An Aplysia Egr homolog is rapidly and persistently regulated by long-term sensitization training. Neurobiol Learn Mem. 2013;102:43-51. https://www.ncbi.nlm.nih.gov/pubmed/23567107.
I’m in the midst of an unhappy experience serving as a peer reviewer. The situation is still evolving but I thought I’d put up a short post describing (in general terms) what’s happened because I’d be happy to have some advice/input/reactions. Oh yeah, this is a post by Bob (not Geoff).
I am reviewing a paper that initially seemed quite solid. In the first round of review my main suggestion was to add more detail and transparency: to report the exact items used to measure the main construct, the exact filler items used to obscure the purpose of the experiment, any exclusions, etc.
The authors complied, but on reading the more detailed manuscript I found something really bizarre: the items used to measure the main construct changed from study to study, and often items that would seem to be related to the main construct were deemed filler from one study to the next.
Let’s say the main construct was self-esteem (it was not). In the first experiment there were several items used to measure self-esteem, all quite reasonable. But in a footnote giving the filler items I found not only genuine filler (“I like puppies”) but also items that seem clearly related to self-esteem… things as egregious as “I have high self esteem”. WTF? Then, in the next experiment the authors write that they measured their construct similarly but list different items, including 1 that had been deemed filler from experiment 1. Double-WTF! And, looking at the filler items listed in a footnote I again find items that would seem to be related to their construct. I also find a scale that seemed clearly intended as a manipulation check but which has not been mentioned or analyzed in either version of the manuscript (under-reporting!). The next experiments repeat the same story–described as measured in the same way but always different items and some head-scratching filler items.
There were other problems now detectable with the more complete manuscript. For example, it was revealed (in a footnote) that statistical significance of a key experiment was contingent on removal of a single outlier; something that had not been mentioned before! But the main one that has me upset is what seems to be highly questionable measurement.
One easy lesson I’ve learned from this is how important it is as a reviewer to push for full and transparent reporting. Without key details on how constructs were measured, what else was measured, what participants were excluded, etc. it would have been impossible to detect deficiencies in the evidence presented.
What has me agitated is what happens now. I sent back my concerns to the editor. If the problems are as severe as I thought (I could be wrong), I expect the paper will be rejected. But what happens next? These authors were clearly willing to submit a less-complete manuscript before. What if they submit the original version elsewhere, the one that makes it impossible to detect the absurdity of their measurement approach? The original manuscript seemed amazing; I have no doubt it could be published somewhere quite good. So has my push for transparent really saved the day, or will it just end up helping the authors better know what they should and shouldn’t include in the manuscript to get it published?
At this point, I don’t know. I’m still in the middle of this. But here are some possible outcomes:
It’s all just a misunderstanding: The authors could reply to my review and clarify that their measurement strategy was consistent and sensible but not correctly represented in the manuscript. That’d be fine; I’d feel much less agitated.
The authors re-analyze the data with consistent measurement and resubmit to the journal, letting the significance chips fall where they may. That’d also be fine. Rooting for this one.
The authors shelve the project. Perhaps the authors will just give up on the manuscript. To my mind this is a terrible outcome–they have 3 experiments involving almost hundreds of participants testing an important theory. I’d really like to know what the data says when properly analyzed. The suppression of negative evidence from the literature is the most critical failure of our current scientific norms. I feel like, in some ways, once you submit a paper to review it almost *has* to be published in some way, especially with the warts revealed… wouldn’t that be useful for emptying the file drawer and also deepening how we evaluate each other’s work?
The authors submit elsewhere, reverting to the previous manuscript that elided all the embarrassing details and which gave an impression of presenting very solid evidence for the theory. I suspect this is the most likely outcome. Nothing new to this. I remember one advisor in grad school who said (jokingly) that the first submission is to reveal all the mistakes you need to cover up. I guess the frustrating thing here is how uneven transparent reporting still is. I was one of 4 reviewers for this paper and I was the only one who asked for these additional details. If the authors want to go this route, I think they’ll have an easy time finding a journal that doesn’t push them for the details. How long until we plug those gaps? Why are we still reviewing for or publishing journals that don’t take transparent reporting seriously?
I’ll suppose I should organize a betting pool. Any predictions out there? What odds would you give for these different outcomes? Also, I’d be happy to hear your comments and/or similar stories.
Last but not least, here are some questions on my mind from this experience:
How much longer before we can consider sketchy practices like this full-out research misconduct? I mean if you are working in this field can you any longer plead ignorance? At this point shouldn’t you clearly know that flexible measurement and exclusions are corrupt research practices? If this situation is as bad as I think it is, does it cross the threshold to actual misconduct? I could forgive those who engaged in this type of work in the past (and I know I did, myself), but at this point I don’t want any colleagues who would be willing to pass off this type of noise-mining as science.
Would under-reporting elsewhere transform a marginal case of research misconduct into a clear case? Even if initially submitting a p-hacked manuscript doesn’t yet qualify as a clear-cut case of research misconduct, would re-submitting it elsewhere after the problems have been pointed out to you count as research misconduct?
Does treating the review process like a confessional exacerbate these problems? My understanding (which some have challenged on Twitter) is that the review process is confidential and that I can not reveal/publicize knowledge I gained through the review process alone. Based on that, I don’t think I would have any public recourse if the authors were to publish a less-complete manuscript elsewhere. My basis for criticizing it would be my knowledge of which items were and were not considered filler, knowledge I would only have from the review process. So I think my hands would be tied–I wouldn’t be able to notify the editor, write a letter to the editor, post to pubPeer, etc. without breaking the confidentiality of the review. I’m not sure if journal editors are as bound by this..so perhaps the editor at the original journal could say something? I don’t know. This is all still so hypothetical at this point that I don’t plan on worrying about it yet. But if I do eventually see a non-transparent manuscript from these authors in print I’ll have to seriously consider what my obligations and responsibilities are. It would be a terrible shame to have a prominent theory supported and cited in the literature by what I suspect to be nonsense; but I’d have to figure out how to balance that harm against my responsibilities for confidential review.
Ok – had to get that all out of my head. Now back to the sh-tstorm that is the fall 2019 semester. Peace, y’all.
Are you interested in meta-science? In Open Science? If so, check out the inaugural conference of AIMOS, the Association for Interdisciplinary Research &Open Science. It’s a two-day meeting, on 7 & 8 November, at the University of Melbourne.
There’s an impressive list of confirmed speakers: Click here and scroll down. All the folks I know on that list will definitely be worth hearing.
Here’s how Fiona Fidler and her organising team describe the conference:
What to expect
AIMOS2019 will be a partially unstructured conference. Each of the two days will have a theme, and will start with a series of keynotes or shorter, what we are calling “mini-notes”, followed by a more unstructured part of the day. Check out the draft program here!
We aim for AIMOS2019 to appeal to students and researchers from a range of disciplines with a shared interest in understanding and addressing challenges to replicability, reproducibility and open science. AIMOS2019 will cover a broad range of open science and scientific reform topics, including: pre-registration and Registered Reports; peer review and scientific publishing; using R for analysis; open source experimental programming; meta-research; replicability; improving statistical and scientific inference; diversity in scientific community and practice; and methodological and scientific culture change.
Got questions? You can email the team at email@example.com
Originally, submissions were due by 1 Sep–this Sunday, and beware that Sunday happens earlier down here, maybe approaching 24 hours before you have your Sunday! But now the submission guidelines state that the organisers will consider submissions made up to conference time, and even during the conference, but please get submissions in early to secure a place in the program. I know that already they have received an impressive list of strong proposals, so don’t delay!
There are 55 travel grants of USD400 available for people living far from Melbourne and who are willing to take part in a one-day Replicats workshop on 6 November. For details, go here and scroll down. These workshops have been found absorbing by participants, and contribute mightily to the repliCATS project.
Registration is open now and is not expensive. The link is here.
Launch of AIMOS
There will be a networking event on Thurs 7 November that will include the formal launch of the Association for Interdisciplinary Meta-Research & Open Science.
Estimation workshop proposal
Bob and I have proposed a workshop on estimation. If accepted, it will include a first glimpse of the R goodies that Bob is developing with the aim of moving ITNS into the R age. Exciting!
November is a great time to visit Melbourne. I hope to see you then!
An accompanying commentary from us that explains estimation using neuroscience examples and provides concrete examples of how it can help nudge us toward better inference.
A research paper from the editor which he revised to focus on estimation, a kind of ‘eat your own dog food’ test case where he found it was easy and useful enough to go forward with the new policy.
An invitation to discuss the new policy on the eNEuro blog
I’ve got to admit that I am beyond excited… when I prepped a talk about estimation for neuroscientists last fall I never dreamed it could spur change of this magnitude. I’m excited and hopeful that this test case at eNeuro will really help turn the tide in how we report and think about inference in neuroscience. Word is that JNeurosci will be watching and considering if they want to move in the same direction. Fingers crossed.
I knew good things were happening at the University of Bologna this (northern) summer. Now I know the details. The brochure is here, and this is part of the title page:
What do they mean by ‘Open Statistics’? As I understand it, they will be discussing statistical methods needed in this age of Open Science. Here’s the line-up of speakers for the first day, Sep 30, 2019:
I know more than half of the speakers, and all of these will be well worth hearing. Bob will, no doubt, be waving the flag for estimation, replication, Open Science and other good things.
The Symposium will be held at the Cesena campus of the University of Bologna. Cesena may not be the most exciting place to visit in Italy, but I’m told the nearby beaches and coastal towns are wonderful.
Would you believe, attendance and participation is free. You are warmly invited to sign up here. Below is a bit more info. Enjoy!
This week’s news email from the APS includes this interesting item:
Aha, I thought, they are giving publicity to Tamara and Bob’s wonderful workshop at the APS Convention last month. Great! But the link didn’t go to the materials for that workshop (which you can find here–well worth a look). It went to the six videos of the workshop I gave at the APS Convention in San Francisco, back in 2014.
Five years ago! That was two years before ITNS was published, and only months after the Center for Open Science was established. There have been so many wonderful advances since that 2014 workshop, including, for example
Massive development of the OSF and expansion of its use.
The launch and development of the SIPS, and similar organisations around the world.
The launch and massive take-up of the TOP guidelines, currently boasting 5,000+ signatories, and being used by 1,000+ journals across numerous disciplines. (This may be one of the strongest indications that Open Science is becoming closer to mainstream.)
So, five years on, how do my videos hold up? I’m not the one to attempt an objective assessment, but I feel that they remain relevant, and still provide a reasonable introduction to the new statistics and Open Science.
It’s interesting to see that the title of the five videos is The New Statistics: Estimation and Research Integrity. An early slide in the first video (shown in the pic above) notes that ‘Open Science’ means pretty much the same as ‘Research Integrity’. The term ‘Open Science’ was, back in 2014, becoming used, but had not achieved full dominance.
My tutorial article published in Psychological Science just before, in Jan 2014, uses the term ‘research integrity’ rather than Open Science. (If reading it today, do a mental ‘search and replace’, so you read it as discussing Open Science,)
Ah, so many wonderful Open Science developments in five years. Bring on the next five!
Update 8 June. Some minor tweaks. Addition of the full reference for two papers mentioned.
Of course I would say that, wouldn’t I?! It’s the basis of ITNS and a new-statistics approach. But the latest issue of SERJ adds a little evidence that, maybe, supports my statement above. The article is titled Conceptual Knowledge of Confidence Intervals in Psychology Undergraduate and Graduate Students (Crooks, et al., 2019) and is on open access here.
Reading the paper prompted mixed feelings:
I’m totally delighted to see this review and some new empirical work on the vital topic of how students, at least, understand CIs, and how teaching might be improved.
Most of the previous research that is discussed is quite old, much of it 10 years or more, so dating from well before Open Science and the latest moves to ditch statistical significance. We need more research on statistical cognition and we need it now! Especially on estimation and CIs!
Many of the papers from my research group are mentioned, but one fairly recent one was missed. Read about that one here (Kalinowski et al., 2018).
The reported study is welcome, but the authors acknowledge that it is small (N=21 undergraduates, 19 graduate students). The undergrads had all experienced the same statistics course, and the grad students ditto.
Therefore any conclusion can be only tentative. The authors concluded that, alas, general understanding of CIs was mediocre. In most respects, grad students did a bit better than undergrads (Phew!). Mentioning estimation tended to go with better appreciation of CIs; there was a hint that, perhaps, mentioning NHST went with less good understanding of CIs. But, as I mentioned, the samples were small and hardly representative.
The authors suggest, therefore, that teaching CIs along with an estimation approach is likely to be more successful than via NHST. I totally agree, even if only a little extra evidence is reported in this paper.
I’m delighted there are researchers studying CIs, estimation, teaching, and understanding. We do so need more and better evidence in this space. I look forward to some future version of ITNS being more strongly and completely evidence-based than we have been able to make the first ITNS. I’m confident that our approach will be strongly supported, no doubt with refinements and improvements. Bring that on! Meanwhile, enjoy using ITNS.
Crooks, N. M., Bartel, A. N., & Alibali, M. W. (2019). Conceptual knowledge of confidence intervals in psychology undergraduate and graduate students. Statistics Education Research Journal, 18, 46-62.
Kalinowski, P., Lai, J., & Cumming, G. (2018). A cross-sectional analysis of students’ intuitions when interpreting CIs. Frontiers in Psychology, 16 February. doi: 10.3389/fpsyg.2018.00112
For years I’ve been working on changing my thinking–even when just musing about nothing in particular–from “I wonder whether…” to “I wonder to what extent…”. It has taken a while, but now I usually do find myself thinking in terms of “How big is the effect of…?” rather than “Is there an effect of…?”
I worked on making that change, despite decades of immersion in NHST, because I’ve long felt that overcoming dichotomous thinking has to be at the heart of improving how we do statistics. No more mere black-white, sig-nonsig categorisation of findings!
“Why does dichotomous thinking persist? One reason may be an inherent preference for certainty. Evolutionary biologist Richard Dawkins (2004) argues that humans often seek the reassurance of an either-or classification. He calls this ‘the tyranny of the discontinuous mind’. Computer scientist and philosopher Kees van Deemter (2010) refers to the ‘false clarity’ of a definite decision or classification that humans clutch at, even when the situation is uncertain. To adopt the new statistics we may need to overcome an inbuilt preference for certainty, but our reward could be a better appreciation of the uncertainty inherent in our data.” (pp. 8-9)
Now there’s an article (Fisher & Keil, 2018) reporting evidence that such a binary bias may indeed be widespread:
It’s behind a paywall unfortch, but here’s the abstract
One of the mind’s most fundamental tasks is interpreting incoming data and weighing the value of new evidence. Across a wide variety of contexts, we show that when summarizing evidence, people exhibit a binary bias: a tendency to impose categorical distinctions on continuous data. Evidence is compressed into discrete bins, and the difference between categories forms the summary judgment. The binary bias distorts belief formation—such that when people aggregate conflicting scientific reports, they attend to valence and inaccurately weight the extremity of the evidence. The same effect occurs when people interpret popular forms of data visualization, and it cannot be explained by other statistical features of the stimuli. This effect is not confined to explicit statistical estimates; it also influences how people use data to make health, financial, and public-policy decisions. These studies (N = 1,851) support a new framework for understanding information integration across a wide variety of contexts.
Fisher and Keil reported multiple studies using a variety of tasks. All participants were from the U.S. and recruited via Mechanical Turk. Therefore, an unknown proportion, but perhaps a low proportion, would have had some familiarity with NHST, so the evidence for a binary bias probably does not reflect an influence of NHST–in accord with the authors’ claim that their results support a binary bias as a general human characteristic.
My new MOOC "Improving Your Statistical Questions" has launched! https://t.co/SmraqFyQnk. There are 15 videos and 13 assignments, all freely available. I hope you'll like it! An overview of the contents and a thank you to all who helped in this blog post: https://t.co/nuzPdKzZyS
I have a vague memory of a study that looked at brain correlates of PTSD, Autism, or ADHD, testing findings from small studies against a large dataset. Found few held up. Thought I had it bookmarked, now can't find it.
We're excited to conduct the 'Reproducibility for Everyone' PDW at #SfN19 & are curious about your thoughts on the topic & what you'd like to learn at the event. Please take this pre-survey AND plan to attend the workshop (Sat | 9 - 11 AM)! Please RT!
Linking brain function to genetics is difficult: a failed replication of a prominent linkage and a wise post-mortem suggesting better ways forward. Kudos to @SfNJournals for publishing the strong contribution.