Quis custodiet ipsos custodes?

1 Background

For lack of anything better to do this week, I spent a fair amount of time reading Leonid Schneider’s blog. It is no exaggeration to say that what I saw shook my faith in the scientific enterprise of biomedical research. Now, I’ve always been a glass-half-empty type of person so it doesn’t take much for me to become cynical, but I had always believed that the scientific process was self-correcting and the truth would find its way out. However, it is no longer clear that this is the case. The most egregious examples of obvious fraud are the most sensational, but I am beginning to suspect that these are just the tip of the iceberg with respect to data manipulation and poor research practice. Indeed, this paper provides a fascinating analysis of how bad science can naturally come to dominate a research field without requiring any explicit fraudulent intent from the participants.

A research field will - or should - collapse if it is built on irreproducible foundations. This has already happened with “social priming” in psychological research, and the fact that cancer research has not joined it is only attributable to our collective desire to be seen to do something about a disease that will eventually kill each one of us. Is this corruption systematic? How much of biomedical research be salvaged? (I speak strictly from a scientific perspective. Financially, the gravy train keeps on rolling with billions of dollars in public and private funding, so there’s no problem there.) The poor reproducibility of published conclusions is not merely an academic problem - for example, the pharmaceutical industry relies on preclinical data to prioritize avenues of research, and false positives in the literature waste time and resources that could be better spent elsewhere to improve patient outcomes.

Salvation of a research field is beyond my pay grade, but I will return to this with some comments at the end of the article. Right now, though, I will consider the opposite challenge - of exploiting the current system to its limits. The following sections contain some of my thoughts on how an unscrupulous individual could abuse the trust-based framework of academic publishing to achieve a successful “scientific” career. Why? Because - much like finding exploits in video games - it’s fun. (As a disclaimer, I should add that, as far as I could tell, my former and current colleagues have only ever acted with the utmost integrity. In fact, I would wager that I was the most - ahem - “morally flexible” individual in any group that I worked in. That I never did anything questionable in my 40-odd publications is attributable more to apathy than strength of character.)

2 Jumping down the slippery slope

I would suspect that most individuals involved in research fraud did not start out with ill intent. At some point, they were probably fresh-faced students wanting to make a positive difference to the world by curing cancer, AIDS, male pattern baldness, etc. But real science is hard and unrewarding, while bent or cooked results can prise open door to money and prestige via publications and grants. Well, at least for principal investigators; in the case of the academic precariat, fudging the numbers may be the only thing keeping a roof over one’s head and food on the table. I daresay that there are few people with the moral fibre required to risk poverty to uphold unenforced ethical standards.

Thus, we will begin our adventures in research misconduct by following our predecessors down the same slippery slope. The first step is to improve the conclusions with some mild \(p\)-hacking and analytical tricks; the second step is to actively modify the data to inflate effect sizes and suppress negative results; and the third step is full-fledged data fabrication. At each step, we increase our active deviation from good practice, but we can simply remind ourselves of the rewards on offer - a CNS paper! A million-euro ERC grant! - to drown out what is left of our conscience. I also note that this process is quite effectively facilitated by peer review, for one can simply adjust their position on the slippery slope to satisfy the reviewers. (To paraphrase Matthew: what the reviewers ask for, they shall receive.)

That said, it would be prudent to stick to the truth as much as possible. Not because we have scruples, of course; this is simply to provide a greater level of plausible deniability if anyone ever attempts to reproduce our published “results”. Indeed, true connoisseurs of the art will do their best work on publications that are (i) just important enough to have an impact on promotion or funding yet (ii) not important enough to warrant an attempt at replication by other researchers. This avoids a visceral reaction from the community in response to an obviously implausible result such as the acid-bath stem cells.

3 Massaging the analysis

The most common form of “analytical adjustment” is to drop outlier samples that do not agree with the hypothesis. In and of itself, this may be acceptable as outliers can be generated by all sorts of technical events that are irrelevant to the phenomenon of interest. The real trick here is to accept a more generous definition of “outlier”, that being any sample whose inclusion causes the \(p\)-value to exceed 0.05. (Indeed, the “outliers” do not even need to be a minority of observations in this approach.) The same approach can be taken for confidence intervals, credibility intervals or posterior probabilities, depending on the flavor of one’s statistics.

Another useful tactic is to omit multiple testing correction when many tests are performed. One then simply presents the data for the significant results as if only that hypothesis was of a priori interest. In genomics, anyone admitting to doing this would be shamed, but it seems that this attitude has yet to catch on to the wider research community. It would be a rare paper indeed that controls the family-wise error rate across all of its comparisons, as I imagine that many borderline differences would become non-significant and thus unpublishable.

The beauty of these methods is two-fold:

They are virtually undetectable; simply present the data sans outlier as the entirety of the original dataset, or the lone significant comparison as the only one that was ever tested. No reader will be capable of proving otherwise, and even determined investigators can be countered by delaying any record keeping until after the analysis complete (so that we know to omit records for the undesired data points or non-significant results).
In the rare case that someone figures out what was done, all is not lost. We could state that it is accepted practice in the field; failing that, we could plead innocence via ignorance of proper statistics; failing even that, we could make an excuse about accidentally overlooking it (typically accompanied by laying the blame on a grad student). Of course, an effective excuse depends on a believable level of obfuscating stupidity, making it difficult to pull off for trained statisticians.

4 Modifying data

The difficulty of data modification depends on the nature of the dataset.

Any numbers that are manually recorded (e.g., from readouts on a machine) can be directly modified without much fuss. However, it is worth checking whether the original data follows empirical laws like Benford’s or Zipf’s law and, if so, tailor the modifications appropriately.
Summaries of genomics data are easily modified by inverting the same processes used to analyze them, e.g., negative binomial models for RNA-seq counts, logistic models for allele frequency analyses. We can fit the model, modify the coefficients and perform quantile-quantile mapping of the observations to the modified model, thus generating new data with the desired effect size.
Sequence data for genomics data is a richer data source and more difficult to modify in an undetectable manner. However, it is unlikely to matter much that any modification is unrealistic, given the rarity with which raw sequence data is subjected to close inspection. We can then redistribute reads or bases as necessary to ensure that the same (modified) summary is obtained.
Flow cytometry data is easy to manipulate through modification of the intensities, though some care may be required to mask it with a light sprinkling of Gaussian noise. Advanced practitioners can cover their tracks by inverting the transformation to modify the raw intensities in the FCS files.
Images are difficult to forge as the human eye is very good at detecting non-random patterns due to artifacts of manipulation. Indeed, the presence of manipulated Western blots (and to a lesser extent, microscopy images) often serves as the basis for detection of scientific fraud on PubPeer. Nonetheless, if these modifications must be performed, it is advisable to cut-and-paste from other figures, only use each cut-and-pasted element once, and become familiar with the “blur” tool in Photoshop to eliminate edge effects upon pasting.

Done carefully, modification of the data can be effectively undetectable. One should always ensure that the upstream paper trail is similarly modified for consistency, e.g., modifying the FASTQ files when working with manipulated summaries like count matrices. This can be problematic if the original files are stored separately by independent entities such as core facilities or CROs; check the relevant policies for your organization. (That said, if investigators are determined enough to get that far, you may have bigger problems on your hands.)

5 Fabricating data

A general principle is to fabricate as close to the source as possible. If an effect needs to be introduced to a sample that does not exhibit it, we might replace that sample with a mixture of cells (or DNA, or extract) from a positive and negative control. This new sample is then run through the rest of the experimental protocol, generating “realistic” data with a genuine paper trail and the desired effect size based on the mixing ratio. We then omit the mixing step from any discussion or record-keeping pertaining to this dataset. This tactic is virtually undetectable but requires additional resources.

Recent advances in machine learning have enabled the generation of artificial data (i.e., deepfakes) by neural networks. Configured correctly, these artificial datasets are indistinguishable from the real thing. However, this requires considerable computational expertise to set up as well as significant computing resources. The only organisations with both the capability and intent of doing this would be for-profit paper mills, allowing them to provide their clients with novel datasets for fabricated manuscripts. Moreover, this approach does not leave a strong paper trail that can withstand a laboratory audit.

6 Solutions

Replicability is the solution. None of the strategies discussed here will hold up to an attempt to independently replicate the result. The question is exactly how a replication framework should be implemented. In some sense, this is already done in an ad hoc manner by the scientific community whenever a result of interest is published, where we can judge the validity of a paper’s conclusions based on the presence of follow-up studies by independent labs. However, this requires independent labs willing to invest effort and time to reproduce a result; furthermore, irreproducible findings cannot be easily identified or pruned from the literature. In the meantime, the fraudsters have moved on to different research fields with new grants, laughing all the way to the bank.

To this end, I read with great interest this article in Nature. (Yes, my hard copy of Nature. For some long-forgotten reason, I have a personal subscription.) The authors discuss the use of “Independent Validation and Verification” (IV&V) teams to replicate studies that are funded by DARPA. The idea is that the funded research team will work with an IV&V team throughout the lifetime of the project, passing on all required protocols and technical know-how to enable the IV&V team to reproduce the results. This strategy ensures that the research is replicable before it hits the literature, avoiding the inefficiency of the informal community-driven validation process. In fact, co-operation with the IV&V team is a condition of funding, an attitude that I found most impressive.

Now, I can already anticipate the major criticism of this approach: who will pay for this? The obvious answer is the same entities that are funding biomedical research right now. It would be fair to say that the scientific community would prefer a smaller number of reliable results compared to twice the results with a who-knows-what level of reproducibility. Of course, it means that the funding rate will be halved as more resources are diverted to IV&V, but this is a feature and not a bug; staff with the requisite skills can be recruited from academic labs into IV&V teams where they might enjoy stable incomes and incentives that are not tied to publication output. Snootier academics may dismiss the ability of IV&V teams to reproduce their work, but if a dedicated team cannot reproduce a result, would anyone be able to? And if no one can reproduce a result, is it science, or is it art?

Regardless of the exact solution, it is clear that the current situation is not sustainable. Vast amounts of money with no accountability inevitably lead to corruption, conscious or otherwise. At some point, it will be more than the public will be willing to bear, and government bodies like ORI will be given teeth to match the strength of regulatory oversight in other industries. Quite simply, either we clean up our mess or the politicians will do it for us.