Data availability and consequences in cancer and climate science

Phil Jones

Last time, I examined the issue of data availability in climate science in the context of Phil Jones’ paper on the urban heat island in Nature. The case of the Jones paper is simple — data supporting conclusions of this important paper are not available and there are serious doubts whether such data was present at the time the paper was written. As first author, Jones has however categorically stated he does not intend to correct the situation or address it in any fashion.

It is notable that Nature magazine’s editorial policy states:

After publication, readers who encounter refusal by the authors to comply with these policies should contact the chief editor of the journal (or the chief biology/chief physical sciences editors in the case of Nature). In cases where editors are unable to resolve a complaint, the journal may refer the matter to the authors’ funding institution and/or publish a formal statement of correction, attached online to the publication, stating that readers have been unable to obtain necessary materials to replicate the findings.

In the present situation, after many years and attempts by several parties, Nature has  reported a direct admission from the paper’s author Phil Jones that making data available is ‘impossible’. In some ways, this is quite unprecedented. A request that Nature publish a formal statement of correction therefore seems logical.

It has to be pointed out, perhaps much to the chagrin of climate science activists, that reproducibility of published studies is an acute concern in science.

Cancer research and the case of Anil Potti and Duke University

Anil Potti

The case of cancer researcher Anil Potti at Duke University is illustrative of the consequences of full data availability. In 2006 Anil Potti published results of using gene microarray data to identify ‘signatures’ of chemosensitivities to anti-cancer drugs. His group published their work in the prestigious journal, Nature Medicine. They claimed to have developed a method to identify patients who would respond better to certain drugs. The paper was hailed as a breakthrough in oncology research.

The story takes a familiar turn at this point. Biostatisticians at the MD Anderson Cancer Center Keith Baggerly and Kevin Coombes attempted replication of the findings. Incomplete data and software code availability made their efforts difficult – they had to ‘recreate what was done, rather than just retest the model’. Eventually, Baggerly and Coombes employed forensic bioinformatics, working backwards to reconstruct prediction models Potti used.  By November 2007, in their letter to Nature Medicine Coombes, Wang and Baggerly pointed out errors in Potti’s software code and column-data which was made available. They concluded:

We do not believe that any of the errors we found were intentional. We believe that the paper demonstrates a breakdown that results from the complexity of many bioinformatics analyses. This complexity requires extensive double-checking and documentation to ensure both data validity and analysis reproducibility. We believe that this situation may be improved by an approach that allows a complete, auditable trail of data handling and statistical analysis.

This produced no visible effect. What was more – Potti and allied researchers published more papers with the same method. Baggerly found further errors in these papers as well, but the journals refused to publish the findings. Matters at Duke had meanwhile evolved to the next stage – clinical trials on patients, employing Potti’s cancer genomic signature tests got underway. The MD Anderson team now changed their approach;  Baggerly had a standalone paper ready, summarizing his findings. A prominent biological journal reportedly rejected their paper because “it was too negative”. In November 2009, they eventually published this paper in a statistics journal, the Annals of Applied Statistics.

As Baggerly and Coombes published their findings, Duke University initiated an internal review of Potti’s methods and data, and stopped the clinical trials.  Surprisingly, the university concluded that its review found Potti’s conclusions were “confirmed”, but chose to keep the report confidential. The trials were restarted. The hidden nature of the review caused an uproar.

Duke administrators accomplished something monumental: they triggered a public expression of outrage from biostatisticians. In a first such action in anyone’s memory, 33 top-level biostatisticians wrote a letter [to NCI Director Harold Varmus] urging a public inquiry into the Potti scandal.

Duke remained unmoved. About 4 months later (May 2010), a FOI request submitted to the National Cancer Institute, finally made the Duke report public. Strangely enough, in contrast to the university’s claims the redacted report seemed to state that it could not confirm Potti’s results.

In the end, revelations that Potti had falsely stated or implied that he was a Rhodes scholar in grant applications resulted in a final suspension of the clinical trials. Writing in Oncology Times magazine, Rubiya Tuma said that the oncology community was “shocked” at the falsification regarding a well-known scholarship as the Rhodes. In November last year, Anil Potti resigned from Duke University. His papers in Nature Medicine and the Journal of Clinical Oncology have been retracted.

Some termed the whole episode ‘Pottigate‘.

Keith Baggerly

Keith Baggerly estimated their team spent 15,000 hours trying to recreate Potti’s methods from published descriptions. What is interesting is that Baggerly and colleagues stumbled upon a recurring problem in science – inadequate description of methods and reluctance to share code. In their paper in Annals of Applied Statistics, Baggerly and Coombes noted:

High-throughput biological assays such as microarrays let us ask very detailed questions about how diseases operate, and promise to let us personalize therapy. Data processing, however, is often not described well enough to allow for exact reproduction of the results, …

Unfortunately, poor documentation can shift from an inconvenience to an active danger when it obscures not just methods but errors.

They were more pointed in their conclusions.

In the case of reproducibility in general, journals and funding agencies already require that raw data (e.g., CEL files) be made available. We see it as unavoidable that complete scripts will also eventually be required.

Speaking about descriptions of methods, David F Ransohoff, professor of cancer epidemiology at University of North Carolina said,

“If you look at the really big picture— and this is the key point—the entire purpose of methods sections in science articles is to let someone else reproduce what you did,” “That is really why it is there. So I can see what you’ve done, build on that, or, if I want, see if it is right or wrong. And what has happened as studies have gotten more complex, is that is harder to do. But we, as a scientific field, may have to decide if the solution to that is to say that we are not going to try anymore, or to try to figure out how we can preserve that goal, which is a very important goal in science.”

From cancer to climate science

Involved statistical errors can have real consequences. Joyce Shoffner, a cancer patient who was enrolled in the Potti Duke University trial felt “betrayed”, reported the Raleigh News Observer. Shoffner had received a drug based on tests that suggested it would work very well against her tumor, but this was not borne out in her clinical course. Shoffner volunteered:  — “There needs to be some kind of auditor of the data”.

For anyone following the climate debate, the parallels between examination of Potti’s results by biostatisticians Kieth Baggerly et al and examination of Michael Mann’s work by Steve McIntyre and Ross McKitrick are glaring. There are startling parallels between how the University of Albany investigated Doug Keenan’s fraud charge against Jones’ co-author Wei-Chyung Wang and Duke University’s internal investigations of Potti as well. The blog Duke.Fact.Checker recalled the story thus:

For four years, some entrenched people at Duke tried to discredit these challenges in any way they could, including disparaging remarks that biostatisticians were not scientists are all, and that the MD degree yields more expertise in the emerging genome field than a Ph.D. At one point a Dean asked aloud who would believe a bunch of internet fools.

Recounting their conclusions, Baggerly and a joint team of 43 biostatisticians wrote in a joint letter to Nature recently:

The independent reanalysis of these signatures took so long because the information accompanying the associated publications was incomplete. Unfortunately, this is common: for example, a survey of 18 published microarray gene-expression analyses found that the results of only two were exactly reproducible (J. P. Ioannidis et al. Nature Genet. 41, 149–155; 2009). Inadequate information meant that 10 could not be reproduced.

To counter this problem, journals should demand that authors submit sufficient detail for the independent assessment of their paper’s conclusions. We recommend that all primary data are backed up with adequate documentation and sample annotation; all primary data sources, such as database accessions or URL links, are presented; and all scripts and software source codes are supplied, with instructions. Analytical (non-scriptable) protocols should be described step by step, and the research protocol, including any plans for research and analysis, should be provided. Files containing such information could be stored as supplements by the journal.

Context in climate science

Philip Campbell

These observations are particularly notable in light of Nature’s current editor-in-chief Philip Campbell’s recent comments in an email to consensus climate blogger Eli Rabett. In his comments, Campbell explains Nature magazine’s data access policy. What is interesting to the writer of this blog, is that Eli Rabett asked Campbell the good question of the history of Nature’s ‘evolving’ data availability policy. Hadn’t he already trundled down to the depths of the ‘dead-tree’ library long ago, to emerge with a version of this history himself?

In the face of a detailed response from Campbell however, Eli Rabett only managed to clutch at straws. He seems focused on one small bit forgetting about the rest, which implies to him, that authors need not submit software code to the journal during publication. The contrast between Eli Rabett’s response and Baggerly’s conclusions above is striking. Activist consensus climate bloggers still are far behind and retrogressive, in terms of their views on data access and research reproducibility.

A more open-eyed reading of Philip Campbell’s letter, however, makes for enlightening reading. Campbell appears to have taken his predecessor John Maddox’s beliefs about data availability and formalized them. Quoting:

During my time as Editor-in-Chief we have consistently promoted the maximal sharing of data and materials associated with papers in Nature and all other Nature journals (the journals generally have common policies).

My own first initiative was to invite Floyd Bloom, then Editor-in-Chief in Science, to undertake a common change of policy insisting that all reduced structure data be deposited for immediate access rather than with a 6-month delayed release.

Finally in a recent editorial on data availability, replication of results and fraud, Nature magazine reiterated its approach with telling comments. Noting that paper retractions were “painful’, it wrote:

The need for quality assurance and the difficulties of doing it are exacerbated when new techniques are rapidly taken up within what is often a highly competitive community. And past episodes have shown the risk that collaborating scientists — especially those who are geographically distant — may fail to check data from other labs for which, as co-authors, they are ultimately responsible.

If we at Nature are alerted to possibly false results by somebody who was not an author of the original paper, we will investigate. This is true even if the allegations are anonymous — some important retractions in the literature have arisen from anonymous whistle-blowing. However, we are well aware of the great damage that can be done to co-authors as a result of such allegations, especially when the claims turn out to be false.

Where do we go?

A well-known Achilles heel in medical molecular biology is that the advancing forefront is oftentimes plagued with experiment irreproducibility. Results of flawed experiments may seem to make perfect clinico-biologic sense and appear mechanistically valid. High-throughput experiments have an added layer of intricacy in requiring complex statistical methodologies to infer or detect significant change. Complex and untested statistical methods are not the preserve of high-throughput nucleic acid research alone.

In climate science, the concept of ‘unprecedented’ global change appears intuitive.  But as Baggerly points out: “Our intuition about what “makes sense” is very poor in high dimensions”. It becomes difficult to troubleshoot complex but flawed scientific and statistical methods producing results synergistic to prior intuition. Full availability of data, including metadata, and good description of method therefore becomes crucial.

Bishop Hill and Climate Audit have reported preliminary details from the UK Commons Science and Technology Select Committee report on the University of East Anglia (UEA) Climategate emails, out due on the 25th of January. Certain conclusions can be drawn about the Jones 1990 UHI paper at this point. It is not possible to say that Phil Jones committed fraud in the publication of his 1990 Nature urban heat island paper. As Jones states, data acquired from geographically distant institutions formed an important part leading to the conclusions of the paper. However, Douglas Keenan points out that Jones’ continued reliance and citation of this paper even as he co-authored other publications that contradicted it, can only amount to fraud. The data that can verify’s Jones’ position is not available either.

In view of the above, retraction of Jones’ paper could remedy this defect.

12 comments

  • Tee hee. How timely given Paul Nurse’s ‘killer’ cancer argument in the recent Horizon program which set out to stiff AGW sceptics.

    Well done, Shub.

  • Tsk. Sorry. I meant ‘programme’ not ‘program’.

  • Geoff Sherrington

    Although these early 1990s graphs are naive, knowing what we know today, they show someting of the depth of analysis used by P Jones in the 1990 Letter to Nature that he will not retract. The graphs here are not the same data but are very similar and from the same time. They are the best comparison I can find. Jones used a small number of East Australian cities, many capital cities.

    http://i260.photobucket.com/albums/ii14/sherro_2008/tasdig91a.jpg

    He rejected a larger number of rural to semi-urban stations –

    http://i260.photobucket.com/albums/ii14/sherro_2008/tasdig91b.jpg

    If, as an exercise, we assume UHI in the top graph and no UHI in the bottom, then a UHI effect is shown. It is about one degree C per century. That is what Jones failed to report at the time and fails to retract now. He claimed that his figures showed negligible UHI.

    I emailed him for an explanation and he did not give one. I also confirmed with the Bureau of Meteorology, the number of sites that were available to Jones for his 1990 Nature paper. Again, there is not an exact match, but it is close and has the same sense as I have shown with these 2 graphs.

  • 2 years ago I had never heard of the Jones Nature 1990 paper on UHI. I have followed Climate Audit, Douglas Keenan, Bishop Hill etc and now understand what an important paper this is to AGW, and yet it is flawed ab initio.

    Keep up the pressure!

  • The question is this:

    Did the editors of Nature, Science, and PNAS and the Presidents of the US National Academy of Sciences and the UK’s Royal Society and the research agencies they control perpetuate misinformation for the past three decades about:

    a.) The Sun’s origin,
    b.) The Sun’s composition,
    c.) The Sun’s source of energy, and
    d.) The Sun’s influence on Earth’s climate?

    Or did they all simply “overlook” the same experimental data shown in this video [http://www.youtube.com/watch?v=sXNyLYSiPO0] and in this paper in press in APEIRON [“Neutron Repulsion”: http://db.tt/9SrfTiZ%5D?

    That is the question. I’ll wager that you cannot get Dr. Campbell or any of the others to give you a direct answer.

    With kind regards,
    Oliver K. Manuel
    Former NASA Principal
    Investigator for Apollo

  • Very well written, sparkling clear, insightful article. How “scientists” can accept non reproducible results and woefully incomplete documentation of data and methods continues to be appalling. My guess is that this is an indicator that the transformation from traditional science to post normal science is much farther along than is widely known.

  • Pingback: Tweets that mention Data availability and consequences in cancer and climate science « Shub Niggurath Climate -- Topsy.com

  • Thanks for this article – the comparison between these two cases is interesting and I would like to add my own – personal – recollections in molecular biology to this.

    In 1999/2000 as the high throughput micro-array work was just becoming available to ‘ordinary researchers’ I attended a workshop in Sydney, Australia that highlighted the main issues listed here – poor reproducibility and lack of statistical rigour in the analysis of the data. In one sense, this was hardly surprising – I don’t think I am unique among molecular biologists in entering this field from a more general genetics background at least in part due to a lack of statistical knowledge. We went into molecular biology because “the bacteria grows or it doesn’t” was the kind of result we liked. It was only with the advent of the high throughout methods that we suddenly had not only large amounts of data, but much of this was quantitative as well ans we had to get our heads around a whole new set of issues with interpretation.

    However, the fact that the need for better statistical modelling was raised such a long time ago (I presume that the workshop in Sydney was hardly the first to address this topic) makes the issues with the Potti study less acceptable. In one sense, it is worth noting that although it has taken a few years, the Potti study has now been refuted and this has been accepted by the mainstream – indicating that medical research can (and regularly does) throw out refuted research. Furthermore, in the past 3-4 years I have noticed a concerted effort on the part of researchers to develop standards for the archiving and access to the raw data from the array-based studies. I noted in Nature Biotech last year (August) a report from MACQ – the MicroArray Qulaity Consortium on their second study of common practices.

    I realise that the MACQ is a consortium made up of a large number of commercial enterprises who have a vested interest in getting the ersults of such studies accepted by the regulatory agencies, but why should this be any different from the climate change researchers?
    Oh yes, I forgot, the IPCC and the agencies that directly support and contribute to it are governments and self-appointed guardians of the planet. The only groups who seem to have applied the necessary rigour in statistical analysis are those with a private sector background.

    To quote Graham Stringer (from the UK parliamentary committee reviewing the Climategate “enquiries”) ““There are proposals to increase worldwide taxation by up to a trillion dollars on the basis of climate science predictions.” It is interesting that there is less critical analysis of this than there is over a cancer treatment which will probably only be offered to a few thousand people.

    Oh yes, I was forgot again, a patient can sue a pharmaceutical company if they get it wrong……

  • In a World without integrity and honor justice is a dream. Twentieth Century man lost much in the course of two World Wars and a Cold one. Do not search for the Grail, it no longer exists dear Knight.

  • Rob, Thanks for your comment.
    It is interesting to note that Philip Campbell mentioned that Nature played an important role in setting up the MIAME standard for microarray data as a condition for publication. MIAME of course, requires raw data and intermediate processed data to be absolutely submitted.

    Compare that with the r2 statistic issue with the Mann papers and the Wahl and Ammann papers, or the Phil Jones Chinese station meta data. The situation is totally absurd. Climate scientists managed to hold off McIntyre/others for close to a decade now, without changing their basic data availability practices, giving all kinds of possible excuses – “raw data is proprietary”, “description of method is enough; intermediate results need not be shared” etc etc.

    Michael Mann and Eric Steig, for example, made a big deal out of making “all data” available for their later papers. One feels like exclaiming: Yes! that is not something special, that is standard practice for high-impact research work. It is standard practice arising from questions of reproducibility too.

  • Pingback: The code of Nature: making authors part with their programs | Watts Up With That?

  • Pingback: Page not found « Shub Niggurath Climate

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s