Integrity of randomized clinical trials: Performance of integrity tests and checklists requires assessment

The integrity of randomized clinical trials (RCT) has become a concern owing to a recent rise in the number of retractions and the repercussions this has for evidence‐based patient care. However, there is little research on the subject of RCT integrity assessment. Recent literature reviews have revealed that journals' authors' instructions concerning integrity and their investigation policies concerning allegations of misconduct are heterogeneous. The judicious use of integrity tests applied to RCT manuscripts is hampered by an absence of data concerning misconduct prevalence (pre‐test probability), a failure to evaluate test performance (validity) and a lack of consensus over a gold standard (against which test accuracy can be evaluated). These deficiencies hinder the post‐publication correction of RCT records, the integrity evaluations in systematic reviews of RCTs and the prospective application of preventive solutions in RCT peer‐review and preprint assessment. Dealing with the current controversy about trustworthiness of RCT evidence requires a strong investment in research, reform and education concerning research integrity. The purpose of this review article is to highlight the current limitations in dealing with trial integrity‐related concerns and to propose solutions to some of these issues.


| Testing for integrity in RCT manuscripts
Conceptually, this is similar to testing in clinical practice. 16Just as clinical tests and checklists must be reliable and valid for confidently screening and diagnosing disease, research integrity tests should also be trustworthy.In the case of RCT integrity assessment, the disease to be detected by testing is scientific misconduct. 14For the purpose of clarity in our review, misconduct is taken to represent falsification (manipulating or omitting data or results) or fabrication (making up data or results).The accuracy with which an integrity test or checklist establishes that an RCT manuscript is genuine, i.e., free of falsified or fabricated data, is the key issue covered by our commentary on literature reviewed.
Returning to the analogy of clinical testing, erroneously ruling in or ruling out disease has consequences for clinical care.Equally, false integrity test results have repercussions.False scientific misconduct allegations are serious both for false-positives (allegation of misconduct when the RCT data are genuine) and false-negatives (a falsified or fabricated RCT passed as a genuine study).The former risks removing genuine evidence that could inform practice.The latter permits the continued use of falsified or fabricated evidence in EBM, particularly in systematic reviews. 17Although the research integrity commentaries tend to focus on how false findings affect scientific careers and reputations 18,19 (accused researchers have been reported to have taken their own lives), 20 in health and life sciences research, both false-positive and false-negative integrity test results are potentially harmful to patients.
Using the above approach to assess RCT manuscripts, the framework for integrity testing requires delineation of index tests (to predict misconduct) and gold standards (to confirm that misconduct did or did not in fact take place, which we cover in detail below).
2][23][24][25][26][27][28][29][30] Editors and peer-reviewers may wish to use these integrity tests on the RCT papers submitted. 8Systematic reviewers may also wish to use them to determine if the findings of published RCT pre-prints or papers are trustworthy. 175][36] We describe here some explanatory examples from the published literature to give context.
In 2017, 25 Carlisle published findings from an integrity test that checked for non-random sampling, which complemented his earlier studies. 37,38The test examined for excessive baseline similarity or dissimilarity between groups.Making the assumption that RCTs apply true randomization, it used simulations to compare betweengroup means and standard deviations to calculate whether the distribution of trial p-values deviated from uniform distribution 25 (Figure 1a).This test was applied to 934 RCTs published in New Eng J Med, a top medical journal, and the resulting p-values were classified as test-positive or test-negative for non-random sampling at various p-value thresholds for interpretation of test results.According to the threshold applied by Carlisle, i.e., one-sided p-value within 0.05 of 0 or 1, 124 New Eng J Med published, non-retracted, RCTs were test-positive. 25The journal disagreed and used a higher threshold, i.e., two-sided P-value of <0.001, at which only 11 RCTs warranted further investigation. 39The journal's investigations led to none of the 11 test-positive papers being retracted except for one, which was subsequently republished. 40,41The New Eng J Med editors clarified that the journal 'published their new report of the study, which describes the protocol deviations and reports reanalyses of the data'. 39This retraction and republication does not imply research misconduct and thus, in this instance, the non-random sampling test produced a virtually 100% false finding rate among test-positives for detecting data falsification or fabrication. 42When faced with questions about the underlying premise of his test, Carlisle himself admitted 'the results of the test do not identify fraud'. 43,44In his defense, he gave various reasons for these unexpected test results, including 'errors by authors / editors / typesetters (most of which will be innocent, a few data fabrication)' and 'chance'. 43ecklists for research integrity generally include a series of individual tests. 21,22,26In one example, Gupta's 2020 editorial 24 put forward an RCT integrity checklist combining seven tests with a score range of 0-11 and a threshold for checklist score positivity ≥5 (Figure 1b).It included authorship as one of the seven tests, with the number of authors ≤3 as the threshold for test positivity.When applied to one of Gupta's own RCTs, 45 this individual test proved positive.In fact, the overall checklist-based integrity score on this 2004 RCT was potentially positive at the ≥5 threshold.Publicly available comments concerning the issues in this trial remain unanswered, neither by the authors nor by the journal (at the time of writing this review; manuscript submitted on 19 August 2022). 46Notably, the then prevailing guideline, CONSORT 2001, 47  Systematic reviews of RCTs tend to rely on risk of bias or study quality assessments. 4With respect to integrity assessment, they exclude retracted studies.In its policy for managing what it calls 'potentially problematic studies,' the Cochrane collaboration provides some 'red flags' as tests for evaluating integrity concerns in included studies. 17It refers to a checklist which it identifies as being unvalidated. 23In fact, a recent published review has summarized 27 individual tests and checklists applicable for RCT integrity evaluation. 26 has concluded, 'apart from the methods to locate textual plagiarism and image manipulation, all other methods, be it theoretical or empirical, are based on examples, are not standardized, and lack formal validation'. 26While careful evaluation of individual participant data can expose non-genuine data such as copied blocks, duplicated cases, and repeating sequences, 6,7 it is still necessary to establish a threshold to identify these findings as other peculiarities can be seen as chance findings. 7and baseline table 49 based on published RCTs, and the simulation based non-random sampling test applied 25 to the baseline table 49 of a genuine RCT; (b) A published RCT paper 45 and a published integrity checklist 24 applied to the RCT paper.All reproduced components of figures have been included with permissions of copyright holders.

| Pre-test probabilities
Media interest in scientific fraud has brought many cases to public attention.For example, in 2011, the New York Times reported about a decade of fraudulent research by a prominent psychologist and former faculty dean at a European University.This investigation led to his resignation and a formal apology to the scientific community. 50,51This is merely a case report, using the clinical evidence analogy.What is required is the determination of a baseline, the underlying rate at which scientific misconduct occurs.In the Netherlands, surveys about research integrity have shown that 'more than half of Dutch scientists regularly engage in questionable research practices … and one in 12 admitted to committing a more serious form of research misconduct within the past 3 years: the fabrication or falsification of research results'. 52,53A metaanalysis of research integrity surveys 54 revealed that a third of scientists admitted to engaging in various questionable research practices, and around 2% admitted to falsifying or fabricating data at least once.These surveys tell us about the rates of misconduct among the scientific community, with the number of scientists surveyed as the denominator.
What is instead required for the evaluation of the performance of RCT integrity tests is the pre-test probability of RCTs with confirmed misconduct in the published literature, with the total number of published RCTs as the denominator.It is sometimes also called the prior, prevalence or baseline rate.For example, a PubMed search capturing citations to RCTs about COVID-19 showed 1557 records in July 2022.On the same day, the retra ction watch.orgdatabase showed that six COVID-19 RCTs were retracted. 55Assuming that retraction is a marker of deficit in RCT integrity, one can compute the pre-test probability as 6/1557 = 0.4% or 4 in 1000.One might argue that the PubMed search overestimates the denominator, and that the retra ction watch.orgdatabase only represents the tip of the iceberg, underestimating the numerator.Whichever way you look at it, one thing is likely to be certain: the pre-test probability of flaws in RCT integrity is likely to be extremely low.It is not zero, and each of the RCTs in the public domain that lack integrity is a looming danger to EBM.However, with the low expected pre-test probability, the serious matter of RCT integrity assessment requires to be backed by serious research into estimating precisely the pre-test probabilities and the validity of testing.
It is important to recognize that the purpose of testing is to raise or lower the pre-test probability of the diagnosis in question, either to confidently confirm or to refute misconduct in RCTs, not just to make allegations.Table 1 shows estimates of the pre-test probability of integrity concerns both in the general published and in RCTs.With low pre-test probability, tests for RCT integrity should perform with a high level of accuracy to avoid false-positive results; otherwise, they would be damaging to genuine research. 56,57This conclusion reinforces the need for research to establish the pre-test probabilities when evaluating the measurement properties of RCT integrity tests to gauge the level of their performance.

| Evaluating the performance of RCT integrity tests
From the above, it should be clear that to say that an integrity test is positive without referring to its measurement properties says nothing about its performance in establishing whether the data obtained from an RCT is genuine or not.Also, combining many unvalidated tests into a checklist is a meaningless assessment.Thus, it is odd that the loud noises raising concerns about RCT integrity 22,24,27 tend not to be accompanied by research about the performance of the integrity tests. 57st evaluation and measurement science are related subjects. 59e methods needed for RCT integrity test evaluation can be developed along the lines taken by clinical measurement initiatives such as consensus-based standards for the selection of health measurement instruments (COSMIN), 60,61 outcome measures in rheumatology (OMERACT), 62,63 core outcome measures in effectiveness trials in women's and newborn health (CROWN), 64 etc.5][36] So, the assessment of its measurement properties shown in Table 2 may come as a surprise to someone not accustomed to detailed methodological and statistical assessments of tests.The reality is that it fails on OMERACT's instrument selection algorithm, 62 being red-flagged for failing on face validity and feasibility, in such a manner that it would be dropped altogether from consideration.Readers should rightly be skeptical, and they do not have to believe the version presented in Table 2.
However, there are many published sources of criticism. 44,67,68,692017 independent critical appraisal of the test published in a peer-reviewed journal by Mascha et al. 70 has summarized concerns about the non-random sampling test as follows: 'Our main findings are (1) independence was assumed between variables in a study, which is often false and would lead to "false positive" findings; (2) an "unusual" result from a trial cannot easily be concluded to represent fraud; (3) utilized cutoff values for determining extreme P values were arbitrary; (4) trials were analyzed as if simple randomization was used, introducing bias; (5) not all P values can be accurately generated from summary statistics in Table 1, sometimes giving incorrect conclusions; (6) small numbers of P values to assess outlier status within studies is not reliable; (7)   utilized method to assess deviations from expected distributions may stack the deck; (8) P values across trials assumed to be independent; (9) P value variability not accounted for; and (10) more detailed methods needed to understand exactly what was done.'   Carlisle, the author of the test himself has stated that 'The analysis I used is fairly 'dumb'… the test makes a number of assumptions that are bound to be wrong on many occasions.The results of the test do not identify fraud'. 43,44,70Yet, the test remains in use.
Face validity of RCT integrity tests has an important aspect that deserves consideration.Common decency and wisdom dictate that awareness of shortcomings in our own work should encourage us to refrain from judging others for the same deficiencies we harbor ourselves.Thus, one would expect that if an integrity test is positive in a researcher's own RCT, they would be deterred from using the same test when making misconduct allegations against other RCTs.Those trialists raising concerns about RCT integrity would be expected to demonstrate a benchmark for ethics and professional conduct in their own RCTs with the integrity tests they deploy in their complaints.There have rightly been calls to 'Stop the blame game' being played in the name of research integrity. 19

| Gold standard for evaluating integrity tests applied to RCT manuscripts
A gold standard is required to establish beyond doubt whether an RCT, in fact, suffers from integrity deficits (i.e., it has fabricated or fraudulent data) or whether it does not (i.e., it has genuine data).It is the comparison against the gold standard that determines how often an RCT integrity test can be truly and falsely positive or negative and estimates its accuracy, e.g., sensitivity and specificity. 16,77This evaluation is also called a criterion validity assessment (as in Table 2).
Independent re-examination of the raw dataset with reference to the source records (case report forms) is an obvious gold standard to ascertain trial authenticity, but, unfortunately, these are not always available when required.With the exception of the totally obvious cases, 7 when dataset re-analysis does not give clear black and white result an assessment may be hard to reach without the original source forms.Thus, there is a need to consider other approaches to setting a gold standard.
Research concerning reasons for retraction of a published paper by a journal has shown that scientific misconduct appears prominently. 78,79This raises the possibility of using retraction as a potential gold standard for estimating the accuracy of RCT integrity tests. 80Some important provisos among the many that apply here include unclear reporting of reasons for retraction, 81 retraction with republication, 41,[82][83][84][85] and author initiated retractions unlinked to dishonesty. 868][89] Such evaluations of RCT integrity tests appear to be missing from the literature.This may, in part, be due to the difficulty in defining a gold standard.
To demonstrate what research could be carried out in RCT integrity assessment, we provide an example of the comparison of the non-random sampling test with a gold standard.Here, it is important to recognize that the objective is to capture the performance of both test-positive and test-negative results.In this regard, note that when the N Eng J Med, investigated 11 RCTs of its 934 RCTs tested TA B L E 1 Estimates of pre-test probabilities or prevalence rate of scientific misconduct in published literature in general and in randomized clinical trials (RCT).Various criteria used to define misconduct.

Pre-test probability expressed as percentage
Publication type: Any Gupta, 2020, 24 Criteria for misconduct: Expressed as 'fabrication fraudulent' without an explicit definition.
Subjective by Carlisle, 25 it only looked at test-positive RCTs. 39The full criterion validity evaluation would use all the test results, both positive and negative, across all thresholds for test interpretation reported, calculating the area under the receiver operating curve (ROC) or precision-recall plot. 90,91

| Need for data sharing
The International Committee of Medical Journal Editors requires that RCT registrations and papers make clear the data sharing TA B L E 2 Evaluation of measurement properties of the non-random sampling test for integrity of randomized clinical trials (RCT).

Content and face validity
Degree to which the test appears to be an adequate reflection of the concept it purports to measure.
The target concept is non-randomness in an RCT, which is tested by examining baseline characteristics for excessive similarity or dissimilarity using continuous variables.
Experts have pointed out that the test fails to capture this concept [67][68]70 as RCTs are invariably always stratified or minimized for key prognostic variables, so they are not truly randomised 43,44,[67][68]70 (Senn, a well-known statistician, states: 'it is a fact that very few trials are completely randomized').44,69 The test makes the assumption that baseline variables are independent, but they are usually correlated.68,71,72 Feasibility Degree to which this test is practical, e.g., in terms of burden, accessibility, translations, length of conduct, etc.
Journals' authors' instruction 13 and statistical guidance 73 are heterogeneous, so reporting in RCTs is not harmonized across published papers.For example, baseline data are subject to the effect of rounding in the decimal fractions 29 and the test result may vary between published baseline characteristics and those obtained via raw data. 70requently, baseline characteristics are reported as medians, ranges, standard errors, confidence intervals or other measures which need to be converted to means and standard deviations, however, the conversion formulae are varied 74 and do not all generate exact data.The test is prone to human error in generating the baseline table, writing the manuscript, preparing the proofs, extracting the data, and entering the data into a spreadsheet.Running the simulations requires statistical expertise.Performing the test can be time consuming and, with simulations involving several hundreds of thousands of repetitions, it is computationally intense.Statistical testing needs development in the area of true random generators. 75

Construct validity
The extent to which the test correctly measures what it is supposed to, e.g., does it have the discriminative capacity to identify groups of fabricated and genuine RCTs.A construct is a theoretical concept or characteristic that cannot be directly observed but can be empirically measured by other indicators that are associated with it.
In an RCT data set with known integrity concerns, a 'highly unusual' pattern has been reported compared with an 'expected' distribution, 76 but this study does not permit test result interpretation against a threshold.In other studies, the test threshold for interpretation of the result as a dichotomous positive or negative finding is disputed. 70or example, the New Eng J Med deploys a higher threshold 39 that gives a lower test positivity rate than the one originally reported. 25Of the 11 RCTs it identified as test-positive, N Eng J Med observed test limitations in five, 39 typographical errors in another five trials 39 (standard errors reported as standard deviations or vice versa) and honest errors in one remaining study that was republished. 40,42,44A statistics expert has blogged that the test on its original "criteria would catch 15% of papers that were retracted, but that means that 85% slipped through". 72iterion validity Test accuracy compared to a gold standard, e.g., as measured by test sensitivity, specificity, receiver operating curve (ROC, where area value 50% represents at test without discriminative capacity), etc.
Test accuracy in detecting misconduct using RCT retraction as a gold standard for misconduct taking data from the Carlisle 2017 25 paper the area under the ROC is 59.8% (95% confidence interval 55.8%-64.8%)(Figure 2, Appendix S2).In a prospective test validation study 57 comparing genuine with fabricated datasets of factorial design studies, the area under the ROC was 50.1% (95% confidence interval 46.8%-53.3%).This shows poor capacity to correctly identify genuine RCTs and those that have integrity concerns.

Note:
The steps for performing this test, 25 shown in Figure 1a and Appendix S1, include extraction of means and standard deviations of baseline continuous variables, input of these data into a spreadsheet, and looking for excessive similarity or dissimilarity between groups by simulations, examining the conformity of the p-values generated with uniform distribution in a statistical package such as R. 66 plans. 92,93When a concern is raised about the integrity of an RCT, investigation frequently demands access to the deidentified raw data. 6,11,14This is because raw data re-analysis can serve as a gold standard.Availability of datasets can expedite investigations when unusual abnormalities are observed that are unlikely to occur in a properly conducted RCT, such as 'copy-pasted data', 'duplicated cases in multiple rows or columns', 'non-sequential randomization dates', etc. 6,7 Unless submitted with the paper, requesting such data from the authors for confirming the findings of their study has until now usually been met with an unsatisfactory response. 94There remains a gap between the recommendations for data sharing and actual practice. 95,96For this reason, an international multi-stakeholder consensus highlights the responsibility RCT authors have to keep deidentified raw data in case they are required to address any post-publication complaints. 97We concur with the Int J Gynecol Obstet about the need for a statement by the authors showing their willingness to share their raw RCT data, 11 though authors could voluntarily go beyond this and share that data at the time of submission along with their preprint.Openness has to be accompanied by protection against scientific harassment as an unintended consequence of this initiative. 18][100] These should, no doubt, be the subject of future research.

| LI M ITATI O N S
Our work is subject to some limitations.

FU N D I N G I N FO R M ATI O N
None.

CO N FLI C T O F I NTE R E S T S TATE M E NT
KSK is a previous editor and former editor-in-chief of British Journal of Obstetrics & Gynecology (BJOG).PFWC is currently a deputy editor-in-chief of BJOG.For detailed disclosures of all authors, see the submitted ICJME disclosure forms in Appendix S3.
specifically required reporting of participant flow (strongly recommending a diagram) from randomization to completion with transparency in reporting of follow-up and data losses.Are these positive integrity test results true-positives or false-positives?It is because of this uncertainty about how to interpret the RCT integrity test results that future research is required in this area.

Figure 1
Figure 1 shows some tests for assessing the integrity of randomized clinical trial (RCT) manuscripts: (a) Some individual integrity tests with an explanation of the steps required for the simulation based non-random sampling test applied to baseline continuous

F
I G U R E 1 (a) Individual tests based on published references, 26-28 flow diagram

3 |
First, it does not evaluate all integrity tests and checklists exhaustively.Instead, it provides the test evaluation with examples, which is intended to highlight future research needs.As it stands, our work is not intended to be generalizable for practice.However, it may inspire other researchers to conduct systematic evaluations of the performance of integrity tests and checklists.Second, we have supported our arguments with expert opinions from non peer-reviewed sources.Such sources, despite being informative and explanatory, lack the scrutiny and review of peer-reviewed publications and thus should be interpreted with caution.Third, our emphasis on data fabrication and falsification restricts the scope of our work, which does not include the broader concept of integrity assessment that promotes responsible research practices across the research lifecycle, e.g., we do not cover ethical aspects of clinical trials.Despite this, our work provides focused insight regarding the multidimensional scientific evaluation of the performance of integrity tests and checklists.Finally, research integrity assessment is a broad picture and detection of whether data are genuine is a specific issue within the integrity spectrum.Our paper is motivated by the background of increasing number of retractions, which typically are the result of misconduct; honest errors are dealt with by issuing erratum or by retraction combined with republication.Thus, our work is focused on data-related integrity.CON CLUS ION RCTs with integrity deficits have been slipping through peer-review and post-publication concerns are being raised using insufficiently validated tests.This dent in confidence about RCT-based EBM has the potential to develop into a crisis impacting patient care.The heterogeneity in journals' authors' instructions concerning integrity and their investigation policies for addressing allegations of misconduct needs to be settled through a consensus view among journal editors, researchers, and other stakeholders.The various RCT integrityrelated issues that have surfaced require careful consideration, not knee-jerk responses, by the science publishing, research funding, and researcher communities.It can be argued that the current integrity tests can be used to screen for possible research misconduct but first the performance of the tests has to be properly established.Any articles screened positive through the well-performing screening tests will further need to be subjected to some form of diagnostic test to confirm or refute the presence of flawed data.A robust and validated confirmatory test is currently lacking.Some of the evidence used to support our analysis is only available as blogs, websites and news articles.The reference material employed for this article will invariably include such sources due to the controversial nature of the topic.Future research into research integrity in RCT manuscripts is urgently required to re-establish confidence in the findings and conclusions of RCTs and their evidence syntheses.For F I G U R E 2 Accuracy of the non-random sampling test as measured by area under the receiver operating curve (ROC) curve.Analysis uses the Carlisle 2017 25 data with retracted papers as the gold standard for scientific misconduct (Appendix S2).Across all the thresholds for interpretation of two-sided p-values reported, there were 5015 unretracted randomized clinical trials (RCTs), 72 retracted RCTs, and 221 positive and 4866 negative test results.See Figure 1a and Table 2 for details of the non-random sampling test.protecting healthcare, research institutions need to prioritize the elimination of data falsification and fabrication by focusing on the underlying causes.Research, reform, and education are the means for the prevention of misconduct.AUTH O R CO NTR I B UTI O N S All the authors (Khalid S Khan, Mohamed Fawzy, Patrick F. W. Chien) made substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; and drafting the work or revising it critically for important intellectual content; and final approval of the version to be published; and agree to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.ACK N OWLED G M ENTS KSK is a Distinguished Investigator at the University of Granada funded by the Beatriz Galindo (senior modality) program of the Spanish Ministry of Education.Open access funding provided by IReL.
Table 2 provides an example evaluation of the measurement properties of an RCT integrity test applied on the lines of the OMERACT and COSMIN methods.
opinion expressed by the author as: 'I estimate that up to 20% of published papers in the world literature are likely to fall into this fabricated fraudulent risk category.'