Recovering Misidentified Samples Through Genetic Discordance Clustering

The many logistical and technical challenges associated with sample and data handling in largescale genotyping studies can increase the risk of sample misidentification, which may compromise subsequent analyses. However, the standard quality assurance methods typical for large genotyping arrays can often be further utilized to identify and recover problematic samples. This article emphasizes the importance of identifying and correcting underlying sample misidentification rather than simply excluding known discrepancies, which may potentially include undetected issues. Lastly, we provide a screening protocol to complement standard quality assessments as a guideline for identifying mismatched samples and a tool for assessing the most common causes of sample misidentification. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.


INTRODUCTION
Genome-wide association studies are crucial for investigating the genetic contribution to differences in disease susceptibilities and developmental traits.The surge of these studies in the last decade has provided insight into the interplay between genetic and environmental factors, allowing for a more comprehensive understanding of many complex diseases (Beck et al., 2020;Visscher et al., 2017).Implementing such studies requires a significant investment of resources, often brought forth through international collaboration and motivated by the need for a higher-resolution landscape of genetic influences.However, the logistical challenges associated with sample and data handling can often accumulate errors that counteract the benefits of establishing larger cohorts (Anderson et al., 2010;Laurie et al., 2010;Turner et al., 2011).Such mistakes can originate from several sources (e.g., sampling, biobank management, plating), and previous studies have indicated that >10% of samples in microarray data may be mislabeled (Broman et al., 2015;Khan et al., 2001;Zhang et al., 2009).Sample misidentification often generates noise, reducing the overall statistical power for identifying genetic associations; however, more systematic misidentifications may produce skewed data representations leading to biased findings (Broman et al., 2015).Studies of quantitative or secondary phenotypes are often more sensitive to detecting such biases than case-control studies due to challenges with normalization.Although the quality control of samples often leads to the exclusion of discrepancies, the sources of many issues relating to sample misidentification may not be thoroughly investigated (Ekstrom & Feenstra, 2012;Laurie et al., 2010) or corrected (Hunter-Zinck et al., 2020).This article highlights the importance of identifying discrepancies and correcting underlying issues rather than filtering out flagged samples.
Standard protocols and tools typically assess the quality of genotype data by both the individuals and markers (Fig. 1; Anderson et al., 2010;Purcell et al., 2007).This is a crucial process for minimizing potential biases and errors that may affect the accuracy and reliability of genetic associations.These standard protocols have been detailed extensively elsewhere (Anderson et al., 2010;Hunter-Zinck et al., 2020;Kockum et al., 2023;Turner et al., 2011); however, we have provided a brief overview of the standard procedures for assessing genotype data and a summary of the overall process in Figure 1.
In short, per-individual quality assessments usually include four steps (Anderson et al., 2010).( 1) Discordances between genetic and clinical/self-reported sex are often identified and excluded as cases of sample misidentification.(2) A high rate of missing genotypes or outlying mean heterozygosity rate, which is the proportion of heterozygous genotypes relative to the sample population, likely indicates poor sample quality.Variations in DNA quality can impact the genotype call rate and accuracy.Therefore, individuals with high genotype missingness rate (undetermined genotypes for more than 3% to 7% of all markers) and abnormal heterozygosity should be removed.Heterozygosity rates <23% and >30% indicate inbreeding and contamination, respectively.(3) Relatedness between individuals is often assessed using identity-bydescent (IBD), the proportion of shared alleles among independent SNPs between sample pairs, to identify duplicate samples and related individuals.IBD estimates >0.1875 are often used to identify samples from relatives, and IBD >0.98 indicates duplicates or monozygotic twins.In population-based studies, one individual should be removed from each flagged pair to prevent bias due to overrepresentation.(4) Divergent ancestry is usually assessed by principal component analysis (PCA) to prevent confounding due to population stratification.Population outliers may be excluded to prevent deviations from the study population (e.g., immigration).Furthermore, such genotypic differences due to population origins between cases and controls may gen-erate increased variability and spurious associations if uncorrected in the analysis (Cardon & Palmer, 2003).
Quality control of the genetic markers is also critical to ensure the overall accuracy of genetic findings.Each marker's quality depends on the reliability of its genotyping call and any potential biases that may be introduced with the assessed phenotypes.Permarker quality assessments usually include four steps.However, the cutoffs used to identify low-quality markers are less standardized and may differ between studies.(1) Markers with excessive missing genotypes, with a typical call rate threshold of 95%, indicate low genotyping quality and reliability and are often filtered out.A call rate threshold of up to 99% is sometimes used for markers with lower minor allele frequency (MAF, <5%) (Wellcome Trust Case Control Consortium, 2007).
(2) Markers with a significant deviation from Hardy-Weinberg equilibrium (HWE) may suffer from errors in genotype calling and should be excluded.However, HWE is an indicator of selection pressure, and deviations could be due to the assessed phenotype; therefore, only controls should be used to assess marker quality.(3) Discrepancies in the call rate between comparison groups (e.g., case/control), which may be due to inadequate sample randomization, are assessed to prevent the introduction of bias associated with differences in genotyping quality.(4) Reliability is also assessed and filtered based on a MAF of at least 2%, as low-frequency alleles are more difficult to call accurately.
It is common practice to flag and exclude problematic samples before the analysis, and additional protocols to remove misidentified samples have been proposed (Hunter-Zinck et al., 2020).However, samples flagged by standard quality assessments often represent systematic mistakes affecting other samples that may have gone undetected (Table 1, Fig. 2).Recognizing patterns among misidentified samples during the quality control process may not only salvage excluded samples but can potentially identify and correct additional undetected issues.

CONSIDERATIONS FOR GENETIC QUALITY ASSESSMENT
Cluster analyses of flagged problems are a complementary method that can improve current quality assessment protocols to identify patterns and rescue problematic samples (Fig. 1).
Figure 1 Flow chart of standard quality assessment pipeline and cluster analysis of systematic misidentifications in genotype data.(A) The flow chart outlines the current standard quality assessment protocol used for genotype data included in genetic association studies (Anderson et al., 2010; https://meyer-labcshl.github.io/plinkQC/).The data is assessed for the quality of both markers (yellow) and individuals (blue) to filter out data that could affect the reliability of any genetic association.The advantages of the standard quality assurance protocol are that it removes bias and errors that could influence the genetic association to be tested, it is standardized between studies, and there are tools available that make the process computationally feasible.The disadvantage is that mislabeled samples may not be detected if they are within normal ranges on the quality assessment parameters.(B) Adding cluster analysis of quality flags before standard data analysis can fill this gap.Standard individual-based analytical approaches used for cluster analysis (green) include abnormal heterozygosity rate, discordance between genotype-predicted and reported sex, relatedness between samples, and divergent ancestry.The quality flags are clustered by plate to identify systematic patterns of errors and suggest appropriate corrections.The concordance of individual-based analyses for corrected samples is then tested to determine which misidentified samples can be corrected.(C) The advantages and disadvantages of the different analytical approaches are described.

of 10
Current Protocols in the lab (C) or sorting/reshuffling within the data files (D), resulting in incorrect positioning of samples (H) or identities (I), respectively.The final example illustrates a 180°plate rotation (E), causing all samples to be repositioned (J).The misidentified samples detected by conventional QC methods in these examples range from 23%-34% (F).Samples are colored according to reported sex (pink = female, blue = male).The samples that remain correct after the error occurs are labeled in green; misidentified samples are indicated with black circles; and samples that would be flagged in conventional genetic QC are labeled in red.Sections of rows/columns that are flipped are labeled in gray for original positions and black for new positions.

Clustering patterns of sex discordance
Sex discordance is an effective method for detecting significant systematic issues due to its reliability and frequent availability.Clustering sex discordance by sample handling parameters (e.g., pipetting, plating, batch, sample handler, analytical order) can identify the sources of possible mistakes along with patterns that may assist in deducing the cause of observed discrepancies.For example, we re-examined the genotype quality assurance procedures for a Swedish case control cohort of multiple sclerosis patients and populationmatched controls (International Multiple Sclerosis Genetics Consortium, 2019).By comparing genotype-predicted sex with sex derived from government-issued personal identification numbers, we could confirm sex discordances previously removed as part of the standard protocol.However, further classification by plate showed a clustering of 32 discrepancies within a single plate.
Examining the patterns revealed a flip in the sample layout between columns and rows, leading to systematic misidentification for this plate.The sample identities were then corrected and confirmed using an independent genetic dataset (Olafsson et al., 2017).It is worth noting that only a third of the incorrect samples were initially identified and removed by standard sex concordance analysis, likely leading to the remaining samples introducing  noise in any following association analyses (Fig. 2).

Systematic approaches for identifying sample/data mishandling
To facilitate similar analyses, we provide a supplementary Microsoft Excel-based tool where users can assess phenotype and genotype-derived data in a 96-well plate format (Samples Verification and Quality Control Tool, SVAQC; Huang et al., 2021).Data suspected of containing misidentified samples can be analyzed and corrected for a wide range of common sample and data handling mistakes.The concordance rate following each correction may be cross-compared by plate or sample layouts, along with the matching sample identity for each genotype.As illustrated in the example, the sex concordance rate within a plate can help determine systematic errors.However, the observed rate will depend on the type of mistake and the proportion of males and females on the plate.If we assume samples are randomly distributed and have an equal chance of mismatch, the predicted sex concordance rate (Rsc), given the number (n M /n F ) or proportion (p M /p F ) of males and females, respectively, on the plate would be: This relationship is illustrated in Figure 3.However, inconsistencies may occur in clinical or self-reported sex due to clerical errors.In self-reported sex, inconsistencies may also be due to miscommunication, e.g., reporting gender rather than biological sex, or related to privacy concerns.

Cross-validating sample identity in genetic data
Other phenotypes derived from genetic data, such as ethnicity, may also be useful for determining systematic errors.Genotypes projected onto a known reference, such as the 1000 Genomes Project, can be used to estimate ancestry (Prive et al., 2020).A set of markers can be retyped to confirm the identity of the samples, so-called fingerprinting (Kofanova et al., 2014).The number of SNPs required for fingerprinting depends on desired accuracy and the allele frequency of each SNP.As evidenced in Figure 4, we recommend three to seven SNPs not in linkage disequilibrium and with a minor allele frequency >30%.
Another approach for identifying major sample misidentifications by plate is to construct polygenic risk scores for known traits, such as the primary disease of study, and analyze the correlation between risk scores and the corresponding phenotype within each plate.Although the risk score is not predictive on an individual level for complex traits, a degree of correlation is expected between the risk score and phenotype, a lack of which may indicate broad sample misidentification within the plate.Similarly, the heritability of known traits can be estimated from the genotype data set and compared to expected estimates.A deviation in estimated heritability would suggest the presence of sample misidentification and the magnitude of the error.

Identifying mishandling with sample contamination
An excessive heterozygosity rate may indicate contamination resulting from improper sample handling or overlapping samples during plating.One may identify contaminated plates or systematic errors by assessing for high heterozygosity rates with sample handling parameters.For example, dense clustering of samples with a high heterozygosity rate and inconsistencies between samples and genotypes, such as missing data or empty wells, may indicate shifts during sample plating.In such cases, the identities of affected samples may be predicted and corrected during data preprocessing.

STUDY DESIGN AND ANALYTICAL CONSIDERATIONS
Data sets with corrected sample identities should be tested for concordance with independent parameters.Ideally, overlapping markers from an independent genotype array would provide the ideal check for individual consistency, although this is not commonly available.Furthermore, matching individuals based on genotype, even among those that should not overlap, may help identify potential duplication and sample misplacement.The genotype concordance rate between nonduplicated individuals is typically <70%.In comparison, the same individuals or monozygotic twins tend to have >98% observed genotype concordance with sparse cases in between this range.Therefore, even without a validation dataset, assessing the presence of   Defensive practices and preventative measures can limit mistakes and allow for easier detection of errors.A simple preventive measure is placing blanks in the last two wells of each plate.The asymmetric placement can be useful for identifying systematic mistakes in sample ordering or plate misorientation, including the previous example.Another strategy is to repeat samples between plates as a control measure (Hunter-Zinck et al., 2020), facilitating the correction of misidentified samples due to plate layout errors, although this may be cost-prohibitive.Automated pipetting may provide an appropriate solution to reduce the chance of human error; however, complete automation may not be feasible depending on the situation, as reviewed by Gut et al. (2001).The effort to salvage sample errors can be time-consuming and limited by the data and resources available.However, the further assessment of misidentification patterns and clusters utilizes already generated flags to expose systematic errors that would have remained in the data after standard protocols.As a guide for correcting sample misidentification in future studies, we provided a checklist (Table 2), which accompanies the supplemented tool (SVAQC), for different approaches to assessing the most common causes of misidentification.

CONCLUDING REMARKS
In conclusion, poor quality of genotyping data can increase the risk of producing spurious genetic associations in exploratory studies, which may be erroneously pursued at the cost of both time and resources.Although standard quality control protocols can identify certain discrepancies, it may often overlook many affected samples.Therefore, along with establishing preventative measures, it is important to actively identify sources of sample misidentification and contamination to ensure the overall quality and reliability of findings.

Figure 2
Figure 2 Sample errors clustered by plate to detect misidentified samples.Examples of plates that illustrate common errors resulting in misidentification.Compared to the original plate (A), duplicate samples (purple) can cause off-by-one shifts (B), resulting in sample misidentification for part of the plate [(G), black circles)].Flipped columns and rows can originate from pipetting errors (legend continues on next page)

Figure 3
Figure 3 Predicted sex concordance rate.Heatmap illustrating distribution density of sex concordance rate based on the percentage of male/female in the sample population.Distribution was determined by simulated sex shuffling of 92 samples (p = 1000) for each percentage possibility.Solid and dashed line illustrate the median and 95% confidence interval, respectively.

Figure 4
Figure 4 Predicted genotype discordance rate by number of SNPs.Line plot illustrating the average genotype discordance rate based on the number of compared SNPs and the minor allele frequency for each SNP.Distribution was determined by simulation (1000 samples, p = 100, assuming Hardy-Weinberg equilibrium, no linkage disequilibrium between SNPs).

Table 1
Potential Problems That Can Result in Systematic Misidentification of Samples

Table 2
Checklist for Troubleshooting

Table 2
Checklist for Troubleshooting, continued