Best Practice for MRI Diagnostic Accuracy Research With Lessons and Examples from the LI‐RADS Individual Participant Data Group

Medical imaging diagnostic test accuracy research is strengthened by adhering to best practices for study design, data collection, data documentation, and study reporting. In this review, key elements of such research are discussed, and specific recommendations provided for optimizing diagnostic accuracy study execution to improve uniformity, minimize common sources of bias and avoid potential pitfalls. Examples are provided regarding study methodology and data collection practices based on insights gained by the liver imaging reporting and data system (LI‐RADS) individual participant data group, who have evaluated raw data from numerous MRI diagnostic accuracy studies for risk of bias and data integrity. The goal of this review is to outline strategies for investigators to improve research practices, and to help reviewers and readers better contextualize a study's findings while understanding its limitations.

the LI-RADS Evidence and Research Group with the goal of pooling raw data from previously published studies, at the individual liver observation-level rather than patient-level or study-level, to better evaluate the diagnostic performance of LI-RADS. 7This international, collaborative effort has involved outreach to primary study authors, collecting, curating, and validating data, and synthesizing data for metaanalysis.To date, original study data from over 50 separate LI-RADS studies worldwide including over 14,000 individual liver observations have been successfully curated, including studies from Argentina, Canada, China, France, Germany, Italy, Korea, Poland, Switzerland, UK, and the USA.The design of each study has been evaluated for compliance with LI-RADS and for sources of bias.Through the process of conducting IPD meta-analyses, the LI-RADS IPD group has gained insight on study methodology and data collection practices specific to LI-RADS but also broadly applicable to other areas of diagnostic accuracy imaging research.
In this review, aspects of diagnostic accuracy study methodology, data collection, documentation practices, and reporting are discussed with specific examples using LI-RADS.The objective is to help guide investigators, journal readers and reviewers to improve MRI diagnostic accuracy research practices and avoid common pitfalls.

Study Design
Many clinical medical imaging diagnostic accuracy studies, and most LI-RADS studies, have retrospective cohort or case-control designs.These are susceptible to several sources of bias for patient selection that can impact the relative ratios of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), thereby affecting a variety of diagnostic performance measures.

Selection Bias
Case-control studies can incorrectly estimate the diagnostic performance of a test by introducing spectrum bias, which arises when a diagnostic test is studied in a different range of individuals to the intended population for the test. 8,9For example, if a cohort of patients with LR-5 observations on MRI (considered definite HCC) is only compared to a control cohort of patients with biphenotypic tumors (combined HCC-cholangiocarcinoma), the specificity of LR-5 for HCC could be underestimated due to an artificially low proportion of TN relative to FP (Fig. 2). 10,11Similarly, if the control cohort is biased towards patients with a lower risk of HCC than the intended test population, such as by including patients without chronic liver disease, then the number of FP may be reduced, impacting specificity and positive predictive value (PPV) for HCC. 12 Case-control study design can also introduce selection bias, because the prevalence of disease in the study cohort impacts the PPV and negative predictive value. 13In a case-control study, investigators establish the ratio of cases to controls, and disease prevalence can differ from the more comprehensive clinical population. 14For example, one outcome of interest for LI-RADS is the percentage of observations within each category that correspond to HCC.The percentage of HCC, defined as TP/(TP + FP), is the same metric as PPV and is impacted by HCC prevalence in the study population (Fig. 3). 15Using measures of diagnostic accuracy that are not dependent on disease prevalence, such as sensitivity and specificity, are one way to mitigate this issue; however ultimately the outcome of interest should be clinically informed.A consecutive or random sample of patients representative of those seen in clinical practice should be used to reduce the risk of bias related to patient selection (Table 1).Consecutive or random samples are preferred as studies enrolling patients using other approaches may not produce test results reflective of the performance in the clinical population. 2 Sensitivity and specificity can also be affected by selection biases.6][17] This impacts test performance of the remaining categories similar to the example of spectrum bias in case-control studies described above.For example, if LR-1 and LR-2 observations are excluded from the analysis, many benign pathologies are excluded that would otherwise be labeled as TN. 18This will cause the TN value to be markedly reduced relative to FP, resulting in LR-5 appearing less specific for HCC than is the actual case.Removing select categories from a diagnostic test system can impact the performance of the remaining categories.

Threshold Effects
Choice of threshold for differentiating positive and negative cases using a continuous or ordinal index test can impact measures of accuracy.For a reporting system such as LI-RADS, where there are multiple categories indicating probability of HCC, the cutoff used to define a positive and negative test result for one or more categories will impact the others. 2The threshold should be defined prior to initiating the study, not post hoc, to avoid biasing measures of accuracy.For example, if determining the accuracy of MRI for diagnosing HCC using LI-RADS, comparing LR-5 vs. all other categories can be used to assess specificity for the LR-5 category. 19In contrast, comparing LR-5, LR-4, and LR-M (probably or definitely malignant but not HCC specific) vs. all other categories can be used to assess sensitivity, but does not provide an accurate assessment of specificity. 20,21Grouping lower LI-RADS categories together with higher LI-RADS categories will decrease specificity for HCC but improve sensitivity, and vice versa.It is important to recognize the limitations to collapsing the LI-RADS categories in a binary fashion, as information is inevitably lost, and this does not reflect the intended use of LI-RADS in clinical practice.
The application of different thresholds in primary studies can also preclude conducting a meta-analysis to directly calculate sensitivity and specificity in a bivariate model because sensitivity and specificity are impacted by threshold choice. 2 Meta-analysis would instead be limited to summary receiver operating characteristic curves.The threshold used when interpreting and analyzing a continuous or ordinal test should be clinically informed, feasible, and defined in the study protocol.Reporting results with sufficient detail to enable different thresholds of sensitivity and specificity is helpful for meta-analysis.

Reference Standard
The choice of reference standard for establishing a diagnosis can impact the measured performance of a diagnostic test. 22In many diagnostic accuracy imaging studies, histopathology is often the preferred reference standard.Yet in clinical practice, tissue sampling is often not performed for definitely benign or probably benign pathologies on imaging (eg, LI-RADS categories 1 and 2). 23Exclusively requiring histopathologic confirmation can impact the included disease spectrum and heavily bias the sample toward positive cases. 23Using a composite reference standard that incorporates additional criteria to establish a diagnosis may capture more cases leading to a larger sample more reflective of the population, thereby providing a more accurate representation of the truth compared to histopathology. 24,25The specific clinical question and disease type should guide construction of a composite reference standard.For example, pure ground-glass opacity pulmonary nodules harboring adenocarcinoma on CT frequently have a volume doubling time of up to 400 days, so a prolonged period of stability on imaging extending well beyond 400 days would be warranted to confirm benignity. 26Conversely, a liver observation that has been stable in size on imaging for years is unlikely to be an HCC, which have a tumor doubling time of 4-5 months (range 2-11 months). 27Liver observations that disappear or decrease by ≤30% without treatment and not due to resorption of blood products can be confidently considered benign, whereas a LR-5 observation on MRI that is also shown to be LR-5 on CT or contrast-enhanced ultrasound, and demonstrates ≥50% size increase in <6 months, can be presumed to represent HCC. 7Ideally, the composite reference standard is independent of the imaging modality being tested to avoid incorporation bias, unless repeat imaging using the same modality is used in the future to document change over time. 28A composite clinical reference standard such as imaging findings combined with other clinical elements, for example serum markers and follow-up, can often complement a histopathologic reference standard and enable a study sample more reflective of the population.

Interval between Index Test and Reference Standard
A pitfall that can arise when using tissue confirmation in diagnostic test accuracy research occurs when there is a prolonged time interval between the index text and the reference standard, during which a disease can progress. 29For example, if an MRI reveals a LR-3 observation, and a percutaneous needle biopsy 15 months later reveals HCC, it is unclear if the liver observation was in fact an HCC at the time of MRI, or if it was benign but evolved into an HCC in the interim.Interventions occurring between the time of the index test and reference standard are also not ideal and should generally be avoided, as these can change the underlying biology.For example, if a patient received chemotherapy between an index test and reference standard that caused the target condition to change or even resolve, then the reference standard may incorrectly classify the target condition resulting in the index test appearing less accurate.The time delay between an index test and reference standard should minimize the likelihood of a diagnosis change in the interim, and intervening treatments should be accounted for.

Minimizing Study Bias
Tools used to evaluate the risk of bias in diagnostic test accuracy systematic reviews can help inform primary study design.For example, the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool systematically categorizes sources of bias in four domains: patient selection, index test, reference standard, and flow and timing. 30For studies comparing diagnostic tests, the QUADAS-C tool can be considered, which is specifically designed for comparative analyses. 31A key aspect when comparing diagnostic tests is to apply the tests being compared to the same population (paired design), otherwise to randomly allocate the population to a test in an unpaired design. 2

Imaging Research Data Collection
An important aspect of data collection in diagnostic imaging test accuracy research is the choice of interpreters of the findings of the diagnostic test, for example, the "readers," and the information that is made available to them at the time of interpretation.The number of readers, reader training, reader years of experience and a conflict resolution mechanism should be established a priori and be consistent with the study objective. 32Similar to how a sample of participants may differ from an overall population, a sample of readers may differ from the overall reader population, and study results may not necessarily reflect the true accuracy of an index test.For a system that is intended to be widely used by a variety of stakeholders such as LI-RADS, a reasonable question regarding test performance is how a diverse group of readers from a variety of practice settings and different career stages apply the diagnostic test(s) independently; however, this can be challenging. 33,34Consensus reads should generally be avoided unless consensus reads are reflective of the clinical application of a diagnostic test. 35nsuring that readers are blinded to the reference standard result is critical when assessing diagnostic accuracy and is particularly relevant in retrospective imaging research where the reference standard and the original imaging report may be available at the time of imaging examination interpretation. 36urthermore, some LI-RADS research studies are limited to only a few pathologies such as HCC, cholangiocarcinoma, and combined HCC-cholangiocarcinoma (biphenotypic) tumors, which could bias readers if they are aware of this study design aspect. 10,11If a retrospective study seeks to determine a diagnostic test's performance in clinical practice, then information made available to diagnostic test readers should be restricted to what is generally available in clinical practice.
Many diagnostic test accuracy multireader multicase studies fail to account for correlations between and within readers, and also for correlations between lesions within the same patient. 37Reader data should be documented independently in a master database.If readers are interpreting the same finding more than once in a paired-reader design, for example using a comparison diagnostic test or the same diagnostic test at a different timepoint to quantify intraobserver agreement, a minimum time interval should be employed between interpretations to reduce the risk of recall bias. 38,39his minimum time interval between interpretations of imaging modalities is referred to as a washout period and should be sufficiently long such that readers cannot recollect their prior interpretation.The order of cases should be randomized for each reader to avoid reading-order effects, which can also be documented. 40Readers should be blinded to data that were previously collected for the same lesion by themselves or other readers, which can be facilitated by using separate datasheets for each instance of interpretation then later combining onto a single database.
A challenge when using an evidence-based system such as LI-RADS is the release of updated versions incorporating new evidence, which can result in variability in study definitions and generalizability.As an example, a liver observation that had a ≥ 100% increase in size in >6 months would have fulfilled the threshold growth criterion using LI-RADS version 2017 and earlier, but not the most recent version 2018, which is a major feature impacting overall LI-RADS category.However despite multiple versions of a system, data from each can contribute to the overall understanding of the system, and may be translatable to a different version using a conversion strategy. 41

Data Documentation
A variety of tools are available for data documentation including spreadsheets stored on local drives and secure web applications such as the Research Electronic Data Capture (REDCap ® ) system, which provides a shared library repository that can be used to import pre-set data collection instruments to standardize data entry. 42A LI-RADS REDCap ® database was built by the LI-RADS IPD group and is freely available for download for research on the REDCap ® library (Data S1).Thoughtful construction of a database can simplify the steps necessary for statistical analysis and can support data sharing.Protected health information (PHI) should be stored in accordance with Institutional Research Board requirements, which often require PHI to be separated from the main database and stored in a secure directory matching PHI to an anonymized study identifier. 43atasheets including a single row with column headings are generally most suitable for statistical analysis.Ideally, column headings consist of few characters and no spaces as placeholders are often added by statistical analysis software.If further description for a column heading is needed, it can be provided in a separate data dictionary document.The use of picklists is encouraged to ensure readers are documenting data in a consistent fashion.For continuous variables, the number of decimal places as well as a range of possible values should be specified and should be appropriate for the study question (eg, 0.1 mm is likely too specific for the vast majority of medical imaging clinical applications).Optimizing datasheets prior to data collection, including reducing the degrees of freedom for data entry, can reduce entry errors and the likelihood that data will be corrupted or require manual recoding.
5][46] This can impact the calculation of metrics such as odds ratios.For example, LI-RADS can be applied to MRI using extracellular or hepatobiliary contrast agents, however several features of MRI LI-RADS can only be assessed using hepatobiliary agents such as hepatobiliary phase hypointensity.If an extracellular agent was given, then this feature would most appropriately be documented as 'not applicable' rather than "absent."Options for not applicable and not evaluable should be incorporated into datasheets in addition to present and absent when documenting certain imaging features.
If an analysis is at the observation-level rather than patient-level, than the datasheet should clarify whether observations are arising from the same patient, for example by having a column with observation number and a separate column for patient study ID.For patients at high risk of HCC, it is common to have more than one liver observation on MRI, with each assigned a unique LI-RADS category.[49]

Study Reporting
Reporting guidelines serve to promote research transparency, accuracy, reproducibility, and are available for a large number of study designs at the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network (https://www.equator-network.org).In the case of imaging diagnostic test accuracy research, the Standards for Reporting of Diagnostic Accuracy Studies (STARD) 2015 statement often applies. 50,51Adherence to reporting guidelines facilitates evidence synthesis of imaging diagnostic accuracy research. 52Diagnostic test accuracy research manuscripts should be prepared using the STARD 2015 guideline to optimize transparency and reproducibility.
Investigators often publish multiple studies on overlapping patient cohorts.Overlapping cohorts should be clearly stated in the manuscript, even when the objective of each primary study may differ, as this may facilitate future evidence synthesis.Overlap must then be accounted for and resolved either on a study or participant level when constructing a meta-analysis database. 53ost LI-RADS studies do not provide direct access to raw study data, however this could be made available as supplemental material or in an appendix when publishing primary studies. 54,55While investigators may view their collated data as proprietary, sharing data can help advance the science and may create new opportunities for collaboration.Authors are encouraged to make their raw data available in accordance with the governing Institutional Research Board.7][58] More widespread availability of individual patient data may improve research integrity and discourage investigators from falsifying data leading to "zombie" trials, which are studies that are fatally flawed warranting retraction after publication. 59Furthermore, sharing of images can support artificial intelligence research. 60A pre-emptive approach to address future concerns of data or analysis manipulation can include uploading a study protocol to a registry at study onset.A variety of registries exist including clinicaltrials.gov, the EU Clinical Trials Register (www.clinicaltrialsregister.eu), and other national and independent registries. 61The LI-RADS IPD Group has posted several protocols using the Open Science Framework (https://osf.io/tdv7j/).

Conclusion
In this review article, we have outlined strategies and pitfalls relating to diagnostic test accuracy research.Investigators, readers and reviewers may benefit from incorporating these strategies into their research and review processes.An awareness of the potential impact of inclusion and exclusion criteria on measures of diagnostic performance, biases stemming from reference standard design, considerations for data collection, and documentation, and study reporting completeness may improve the quality of diagnostic test accuracy research practices.

FIGURE 2 :
FIGURE 2: Illustration showing the impact of case-control study design that excludes benign cases on specificity.The specificity is shown to decrease in the study sample due to a disproportionate removal of true negatives relative to false positives.FN = false negatives; FP = false positives; HCC = hepatocellular carcinoma; TN = true negatives; TP = true positives.

FIGURE 3 :
FIGURE 3: Illustration of selection bias using LR-3 observations as an example.A disproportionate reduction in the number of benign cases in the study sample leads to an upward distortion of the positive predictive value (PPV).FN = false negatives; FP = false positives; HCC = hepatocellular carcinoma; TN = true negatives; TP = true positives.

TABLE 1 .
Ten Tips When Conducting MRI Diagnostic Accuracy ResearchA consecutive or random sample of patients representative of those seen in clinical practice should be used to reduce the risk of bias related to patient selection