Replication is something of a litmus test for scientific truth, and cancer biologists at the Center for Open Science wanted to see just how many of cancer’s most influential experiments stood up to it. So, for nearly a decade, they worked their way, step-by-step, through 50 experiments from 23 studies toward an answer — but like cancer research writ large, what they found is complicated.
In two new studies published Tuesday in eLife, the center found signs of trouble: 59% of the experiments couldn’t be replicated. Experiments that were replicable had effect sizes 85% smaller on average than the original studies, suggesting the studies’ conclusions may be far dimmer than first thought.
But drawing firm judgments from these findings is tricky.
“Sometimes, it’s just really hard. We do stuff in animals, not humans. Sometimes we’re going to be wrong, and that’s OK,” said Tim Errington, a cancer biologist at the Center for Open Science, a nonprofit dedicated to improving scientific research. “But maybe we’re also tricking ourselves.”
The trouble, Errington said, is that science steams ahead, and doesn’t always pause to parse what’s a tantalizing result worth pursuing and what’s a lucky fluke. Redoing experiments and validating conclusions might tell which studies are onto something real. But replication is hard, imperfect work — and the questions about reproducibility in cancer research extend to the project itself.
“How reproducible were their experiments? That would be a question,” said Atul Butte, a computational health scientist at the University of California, San Francisco, who was not involved in the effort, but whose research was replicated by the project. “Their heart is in the right place. I’m a big fan of reproducibility. I’m just not a huge fan of how they did it.”
Butte pointed to alterations in experimental protocols during the replications, which could influence the results. The only definitive conclusion that scientists seem to agree on from the project is that making sure biology research findings are ironclad is hard — ”even very hard,” Errington said.
The project began in 2013, with the researchers selecting 53 papers published from 2010 to 2012 that had garnered a high number of citations in cancer biology. There were 193 experiments from those papers that the team hoped to replicate, and they started to reconstruct each step of the experiment from the methods sections of the papers. That was the first issue.
Lab work is a bit like baking. Without a clear recipe, it’s hard to know exactly what to do, and Errington found science is rife with incomplete experimental protocols. For example, did “biweekly” mean a drug was to be administered every two weeks or twice a week? “There were tons of experiments with next to no details,” Errington said.
These details can make or break an experiment, said Kornelia Polyak, a cancer biologist at Harvard Medical School and Dana-Farber Cancer Institute who was not involved with the project. She once tried to replicate a procedure to purify breast cancer cells with a collaborator, Mina Bissell at UC Berkeley, but the two simply could not get the experiment to work.
“We thought we were doing the same thing, and we could not get the same result. It was a very frustrating experience,” Polyak said. “So, I sent my postdoc to her lab and I said, ‘go there and do it together.’ It turns out it came down to minor details like how fast you’re stirring a flask.”
The Center for Open Science reached out to the original investigators for every study they tried to replicate, hoping to fill in any gaps, get raw data, and input on how to redo their experiments. Sometimes that worked, Errington said, but often labs just couldn’t remember how they did the work. “They couldn’t find their own stuff,” he said. “They would spend time hunting down people who did the experiments but had since left the lab.” This sometimes forced the team to give up the replication, whittling those 193 experiments down to just 50.
About a third of the time, Errington said scientists either weren’t helpful providing additional details or data or just never responded.
Looking back at the project, Errington said it often felt like a series of miscommunications, missed emails, and long, wild goose chases for data. “It’s been exhausting. We never anticipated it would take this long. It took a lot more effort than we thought it would.”
That was the case for an experiment conducted by Butte and his colleagues at UCSF, which the project tried to replicate. Fraser Tan, a scientist working on the replications, emailed UCSF’s Butte six times for help on replicating an experiment. Butte forwarded one of those emails to a co-author, but it ultimately got lost in the shuffle of other work.
“To be honest with you, those looked like spam emails. I get hundreds of these a day. I never knew how important that protocol email actually was,” said Butte, who missed an email from eLife to review the replication during his move from Stanford University to UCSF. “I never saw the protocol they proposed to reproduce our work until after all the work was done.”
It’s a classic piled-under-emails problem that can happen to anyone. It’s not that people don’t want to help, but life is messy. With so many other pressing problems that need attention, things can just fall off the radar. When it comes to scientific research, though, that might mean a complete picture of how experiments were done doesn’t get fully communicated, making it harder for research to proceed.
Ultimately, Errington’s team was able to reproduce Butte’s experiment, and, as was the case with most of the replications, found a smaller effect size. But like many of the replication experiments, the team had to change some of the methods – including a statistical method used to analyze the data. When the replication paper came out, Butte felt blindsided by the changes.
“They chose an additional statistical test that we did not do,” Butte said. “An independent statistician, Robert Tibshirani, one of the best in the world, commented, saying their process was incorrect. I chased down every credential of every author [on the reproduction] and there was not a single biostatistician on their team,” he added. “Is this reproducibility?”
Independent reviewers approved any modifications to the protocols before they were carried out, Errington said. They also consulted with independent quantitative scientists through the journal eLife’s peer review process on any statistical methods. Still, he acknowledged it’s possible that any modifications may have altered the replications’ results.
“Human biology is very hard, and we’re humans doing it. We’re not perfect, and it’s really tricky,” he said. “None of these replications invalidate or validate the original science. Maybe the original study is wrong — a false positive or false signal. The reverse may be true, too, and the replication is wrong. More than likely, they’re both true, and there’s something mundane about how we did the experiment that’s causing the difference.”
Butte agreed, adding that procedural replication, like the kind attempted by the Center for Open Science, is important. And partly thanks to the Center for Open Science’s efforts, academic journals have made strides to prevent issues in replication from occurring again, Butte said. Because scientific articles are published in online databases, publishers like Science and Nature now allow investigators to include more detailed methods and data in long supplementary files, addressing a longtime limitation in reproducibility research. Recently, the American Association for Cancer Research announced that methods sections will no longer count towards article word lengths, so researchers can wax in depth on their protocols.
Publications are also trying to create more opportunities for scientists who are interested in reproducing experiments, which are typically harder to publish in journals. AACR recently launched a new open-access journal that will consider replication study submissions. “[Replication] won’t make a career,” Errington said. “It’s not the flashy science that people want, not a positive result, because they’re redoing something. So we need to figure out how to balance that as a culture.”
“There are a lot of changes since five years ago. I think you’d have to give [the Center for Open Science] credit for that,” Butte said. “There are a lot of positives here.”
But he added, it’s not everything. Rote, perfectly identical step-by-step replication can only tell you if one experiment can be done again, not whether or not the original conclusions are truly robust, Butte said. Only investigating the same idea using several, very different experiments can tell you that. “You do the exact same experiment to get the same answer,” he said. “But these are all models anyway. We use rats and mice but, to be honest with you, we don’t care about diabetes in rats or mice. So what if you get the same answer twice?”
Instead, Butte said it’d be better to have 100 different scientists testing the same idea with 100 different models — from primates to cells in Petri dishes — and see what they can agree on. “I want to see the 60% that’s in common from all our experiments, right?” he said. “That’s the real reproducibility we should be chasing.”