However, as it is now often practiced, one can make a good case
that computing is the last refuge of the scientific scoundrel.
Some backstory first
A very interesting editorial has appeared recently in Nature magazine. What is striking is that the editorial picks up the same strands of argument that were considered in this blog – of data availability in climate science and genomics. Arising from this post at Bishop Hill, cryptic climate blogger Eli Rabett and encyclopedia activist WM Connolley claimed that the Nature magazine of yore (c1990), required only crystallography and nucleic acid sequence data to be submitted as a condition for publication, (which implied, that all other kinds of data was exempt).
We showed this to be wrong (here and here). Nature, in those days placed no conditions on publication, but instead expected scientists to adhere to a gentleman’s code of scientific conduct. Post-1996, it decided, like most other scientific journals, to make full data availability a formal requirement for publication.
The present data policy at Nature reads:
… a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications in material transfer agreements.
Did the above mean that everything was to be painfully worked out just to be gifted away, to be audited and dissected? Eli Rabett pursued his own inquiries at Nature. Writing to editor Philip Campbell, the blogger wondered: when Nature says ‘make protocols promptly available’, does it mean ‘hand over everything’, as with the case of software code used?
I am also interested in whether Nature considers algorithmic descriptions of protocols sufficient, or, as in the case of software, a complete delivery.
Interestingly, Campbell’s answer addressed something else:
As for software, the principle we adopt is that the code need only be supplied when a new program is the kernel of the paper’s advance, and otherwise we require the algorithm to be made available.
This caused Eli Rabett to be distracted and forget his original question altogether. “A-ha!”. “See, you don’t have to give code” (something he’d assumed, was to be given).
At least something doesn’t have to be given.
A question of code
The Nature editorial carried the same idea about authors of scientific papers making their code available:
Nature does not require authors to make code available, but we do expect a description detailed enough to allow others to write their own code to do a similar analysis.
The above example serves to illustrate how partisan advocacy positions can cause long-term damage in science. In some quarters, work proceeds tirelessly to obscure and befuddle simple issues. The editorial raises a number of unsettling questions that are sought to be buried by such efforts. Journals try to frame policy to accommodate requirements and developments in science, but apologists and obscurantists seek to hide behind journal policy for not providing data.
So, is the Nature position sustainable as its journal policy?
A popularly held notion mistakes publication for science; in other words – it is science in alchemy mode. ‘I am a scientist and I synthesized A from B. I don’t need to describe how, in detail. If you can see that A could have synthesized B without needing explanations, that would prove you are a scientist. If you are not a scientist, why would you need to see my methods anyway?’
It is easy to see why such parochialism and close-mindedness was jettisoned. Good science does not waste time describing every known step or in pedantry. Poor science tries to hide its flaws in stunted description, masquerading as the terseness of scholarly parlance. Curiously, it is often the more spectacular results that are accompanied by this technique. As a result, rationalizations to not provide data or method take on the same form – ‘my descriptions may be sketchy but you cannot replicate my experiment, because, you are just not good enough to understand the science, or follow the same trail’.
If we revisit the case of Duke University genomics researcher Anil Potti, this effect was clearly visible (a brief introduction is here). Biostatisticians Baggerly and Coombes could not replicate Potti et al’s findings from microarray experiments reported in their Nature Medicine paper. Potti et al’s response, predictably, contained the defense: ‘You did not do what we did’.
Unfortunately, they have not followed our methods in several crucial contexts and have made unjustified conclusions in others, and as a result their interpretation of our process is flawed.
Because Coombes et al did not follow these methods precisely and excluded cell lines and experiments with truncated -log concentrations, they have made assumptions inconsistent with our procedures.
Behind the scenes, web pages changed, data files changed versions and errors were acknowledged. Eventually, the Nature Medicine paper was retracted.
The same thing repeated itself in greater vehemence with another paper. Dressman et al published results on microarray research on cancer, in the Journal of Clinical Oncology. Anil Potti and Joseph Nevins were co-authors. The paper claimed to have developed a method of finding out which patients with cancer would not respond to certain drugs. Baggerly et al reported that Dressman et al’s results arose from ‘run batch effects’ – i.e., results that varied solely due to parts of the experiment being done on different occasions.
To “reproduce” means to repeat, following the methods outlined in an original report. In their correspondence, Baggerly et al conclude that they are unable to reproduce the results reported in our study […]. This is an erroneous claim since in fact they did not repeat our methods.
Beyond the specific issues addressed above, we believe it is incumbent on those who question the accuracy and reproducibility of scientific studies, and thus the value of these studies, to base their accusations with the same level of rigor that they claim to address.
To reproduce means to repeat, using the same methods of analysis as reported. It does not mean to attempt to achieve the same goal of the study but with different methods. …
Despite the source code for our method of analysis being made publicly available, Baggerly et al did not repeat our methods and thus cannot comment on the reproducibility of our work.
Is this a correct understanding of scientific experiment? If a method claims to have uncovered a fundamental facet of reality, should it not be robust enough to be revealed by other methods as well, which follow the principle but differ slightly? Obviously, Potti and colleagues are wandering off into the deep end here. The points raised here are unprecedented and go well beyond the specifics of their particular case – not only do the authors say: ‘you did not do what we did and therefore you are wrong’, they go on to say: ‘you have to do exactly what we did, to be right’. In addition they attempt to shift the burden of proof from a paper’s authors to those who critique it.
The Dressman et al authors face round criticism by statisticians Vincent Carey and Victoria Stodden for their approach. They note that a significant portion of Dressman et al results were nonreconstructible – i.e., could not be replicated even with the original data and methods, because of flaws in the data. This was only exposed when attempts were made to repeat their experiments. This defeats the authors’ comments about the rigor of their critics’ accusations. Carey and Stodden take issue with the claim that only the precise original methods can produce true results:
The rhetoric – that an investigation of reproducibility just employ “the precise methods used in the study being criticized” – is strong and introduces important obligations for primary authors. Specifically, if checks on reproducibility are to be scientifically feasible, authors must make it possible for independent scientists to somehow execute “the precise methods used” to generate the primary conclusions.
Arising from their own analysis, they agree firmly with Baggerly et al’s observations of ‘batch effects’ confounding the results. They conclude, making crucial distinctions between experiment reconstruction and reproduction:
The distinction between nonreconstructible and nonreproducible findings is worth making. Reconstructibility of an analysis is a condition that can be checked computationally, concerning data resources and availability of algorithms, tuning parameter settings, random number generator states, and suitable computing environments. Reproducibility of an analysis is a more complex and scientifically more compelling condition that is only met when scientific assertions derived from the analysis are found to be at least approximately correct when checked under independently established conditions.
Seen in this light, it is clear that an issue of ‘we cannot do what you say you did’ will morph rapidly to a ‘does your own methods do what you say they do?’ Intractable disputes arise even with both author and critic being expert, and with much of the data openly available. Full availability of data, algorithm and computer code is perhaps the only way to address both questions.
Therefore Nature magazine’s approach to not ask for software code as a matter of routine, but to obtain everything else, becomes difficult to reconcile.
Results of experiments can hinge just on software, just as it can on the other components of scientific research. The editorial recounts an interesting example of one more instance of bioinformatics findings which were dependent on the version number of commercially available software employed by the authors.
The most bizarre example of software-dependence of results however comes from Hothorn and Leisch’s recent paper ‘Case studies in reproducibility‘ in the journal Breifings in Bioinformatics. The authors recount the example of Pollet and Nettle (2009) reaching the mind-boggling conclusion that wealthy men give women more orgasms. Their results remained fully reproducible – in the usual sense:
Pollet and Nettle very carefully describe the data and the methods applied and their analysis meets the state-of-the-art for statistical analyzes of such a survey. Since the data are publicly[sic] available, it should be easy to fit the model and derive the same conclusions on your own computer. It is, in fact, possible to do so using the same software that was used by the authors. So, in this sense, this article is fully reproducible.
What then was the problem? It turned out that the results were software-specific.
However, one fails performing the same analysis in R Core Development Team. It turns out that Pollet and Nettle were tricked by a rather unfortunate and subtle default option when computing AICs for their proportional odds model in SPSS.
Certainly this type of problem is not confined to one branch of science. Many a time, description of method conveys something, but the underlying code does something else (of which even the authors are unaware), the results in turn seem to substantiate emerging, untested hypotheses and as a result, the blind spot goes unchecked. Veering to climate science and the touchstone of code-related issues in scientific reproducibility— the McIntyre and McKitrick papers, Hothorn and Leisch draw obvious conclusions:
While a scientific debate on the relationship of men’s wealth and women’s orgasm frequency might be interesting only for a smaller group of specialists there is no doubt that the scientific evidence of global warming has enormous political, social and economic implications. In both cases, there would have been no hope for other, independent, researchers of detecting (potential) problems in the statistical analyzes and, therefore, conclusions, without access to the data.
Acknowledging the many subtle choices that have to be made and that never appear in a ‘Methods’ section in papers, McIntyre and McKitrick go as far as printing the main steps of their analysis in the paper (as R code).
Certainly when science becomes data- and computing intensive, issues of how to reproduce an experiment’s results is inextricably linked with its own repeatability or reconstructibility. Papers may be fall into any combination of repeatability and reproducibility, with varying degree of both, and yet be wrong. As Hothorn and Leisch write:
So, in principle, the same issues as discussed above arise here: (i) Data need to be publically[sic] available for reinspection and (ii) the complete source code of the analysis is the only valid reference when it comes to replication of a specific analysis
Why the reluctance?
What reasons can there be, for scientists not willing to share their software code? As always, the answers turn out far less exotic. In 2009 Nature magazine, devoted an entire issue to the question of data sharing. Post-Climategate, it briefly addressed issues of code. Computer engineer Nick Barnes opined in a Nature column on the software angle and why scientists are generally reluctant. He sympathized with scientists – they feel that their code is very “raw”, “awkward” and therefore hold “misplaced concerns about quality”. Other more routine excuses for not releasing code, we are informed, are that it is ‘not common practice’, will ‘result in requests for technical support’, is ‘intellectual property’ and that ‘it is too much work’.
In another piece, journalist Zeeya Merali took a less patronizing look at the problem. Professional computer programmers were less sanguine about what was revealed in the Climategate code.
As a general rule, researchers do not test or document their programs rigorously, and they rarely release their codes, making it almost impossible to reproduce and verify published results generated by scientific software, say computer scientists. At best, poorly written programs cause researchers such as Harry to waste valuable time and energy. But the coding problems can sometimes cause substantial harm, and have forced some scientists to retract papers.
While Climategate and HARRY_READ_ME focused attention on the problem, this was by no means unknown before. Merali reported results from an online survey by computer scientist Greg Wilson conducted in 2008. Wilson noted that most scientists taught themselves to code and had no idea ‘how bad’ their own work was.
As a result, codes may be riddled with tiny errors that do not cause the program to break down, but may drastically change the scientific results that it spits out. One such error tripped up a structural-biology group led by Geoffrey Chang of the Scripps Research Institute in La Jolla, California. In 2006, the team realized that a computer program supplied by another lab had flipped a minus sign, which in turn reversed two columns of input data, causing protein crystal structures that the group had derived to be inverted.
Geoffrey Chang’s story was widely reported in 2006. His paper in Science on a protein structure, had by the time the code error was detected, accumulated 300+ citations, impacted grant applications, caused contrary papers to be bounced off, and resulted in drug development work. Chang, Science magazine reported scientist Douglas Rees as saying, was a hard-working scientist with good data, but the “faulty software threw everything off”. Chang’s group retracted five papers in prominent science journals.
Interestingly enough, Victoria Stodden reports in her blog, that she and Mark Gerstein wrote a letter to Nature, responding to the Nick Barnes and Zeeya Merali articles voicing some disagreements and suggestions. They felt that journals could help tighten the slack:
However, we disagree with an implicit assertion, that the computer codes are a component separate from the actual publication of scientific findings, often neglected in preference to the manuscript text in the race to publish. More and more, the key research results in papers are not fully contained within the small amount of manuscript text allotted to them. That is, the crucial aspects of many Nature papers are often sophisticated computer codes, and these cannot be separated from the prose narrative communicating the results of computational science. If the computer code associated with a manuscript were laid out according to accepted software standards, made openly available, and looked over as thoroughly by the journal as the text in the figure legends, many of the issues alluded to in the two pieces would simply disappear overnight.
We propose that high-quality journals such as Nature not only have editors and reviewers that focus on the prose of a manuscript but also “computational editors” that look over computer codes and verify results.
Nature decided not to publish it. It is now obvious to see why.
Small sparks about scientific code can set off major rows. In a more recent example, the Antarctic researcher Eric Steig wrote in a comment to Nick Barnes that he faced problems with the code of Ryan O’Donnell and colleagues’ Journal of Climate paper. Irked, O’Donnell wrote back that he was surprised Steig hadn’t taken time to run their R code, as reviewer of their paper, a fact which was had remained unknown up-to that point. The ensuing conflagration is now well-known.
In the end, software code is undoubtedly an area where errors, inadvertent or systemic, can lurk and impact significantly on results, as even the meager examples above show, again and again. In his paper on reproducible research in 2006, Randall LeVeque wrote in the journal Proceedings of the International Congress of Mathematicians:
Within the world of science, computation is now rightly seen as a third vertex of a triangle complementing experiment and theory. However, as it is now often practiced, one can make a good case that computing is the last refuge of the scientific scoundrel. Of course not all computational scientists are scoundrels, any more than all patriots are, but those inclined to be sloppy in their work currently find themselves too much at home in the computational sciences.
However, LeVeque was perhaps a bit naivé, when expecting only disciplines with significant computing to attempt getting away with poor description:
Where else in science can one get away with publishing observations that are claimed to prove a theory or illustrate the success of a technique without having to give a careful description of the methods used, in sufficient detail that others can attempt to repeat the experiment? In other branches of science it is not only expected that publications contain such details, it is also standard practice for other labs to attempt to repeat important experiments soon after they are published.
In an ideal world, authors would make their methods, including software code available along with their data. But that doesn’t happen in the real world. ‘Sharing data and code’ for the benefit of ‘scientific progress’ may be driving data repository efforts (such as DataONE), but hypothesis-driven research generates data and code, specific to the question being asked. Only the primary researchers possess such data to begin with. As the “Rome” meeting of researchers, journal editors and attorneys wrote in their Nature article laying out their recommendations (Post-publication sharing of data and tools):
A strong message from Rome was that funding organizations, journals and researchers need to develop coordinated policies and actions on sharing issues.
When it comes to compliance, journals and funding agencies have the most important role in enforcement and should clearly state their distribution and data-deposition policies, the consequences of non-compliance, and consistently enforce their policy.
Modern-day science is mired in rules and investigative committees. What’s to be done naturally in science – showing others what you did – becomes a chore under a regime. However, rightly or otherwise, scientific journals have been drafted into making authors comply. Consequently it is inevitable that journal policies become battlegrounds where such issues are fought out.
N.B. This story appears as a guest post at WUWT.