Chances and challenges of machine learning based disease classification in genetic association studies illustrated on age-related macular degeneration

Imaging technology and machine learning algorithms for disease classification set the stage for high-throughput phenotyping and promising new avenues for genome-wide association studies (GWAS). Despite emerging algorithms, there has been no successful application in GWAS so far. We established machine learning based disease classification in genetic association analysis as a misclassification problem. To evaluate chances and challenges, we performed a GWAS based on automated classification of age-related macular degeneration (AMD) in UK Biobank (images from 135,500 eyes; 68,400 persons). We quantified misclassification of automatically derived AMD in internal validation data (images from 4,001 eyes; 2,013 persons) and developed a maximum likelihood approach (MLA) to account for it when estimating genetic association. We demonstrate that our MLA guards against bias and artefacts in simulation studies. By combining a GWAS on automatically derived AMD classification and our MLA in UK Biobank data, we were able to dissect true association (ARMS2/HTRA1, CFH) from artefacts (near HERC2) and to identify eye color as relevant source of misclassification. On this example of AMD, we are able to provide a proof-of-concept that a GWAS using machine learning derived disease classification yields relevant results and that misclassification needs to be considered in the analysis. These findings generalize to other phenotypes and also emphasize the utility of genetic data for understanding misclassification structure of machine learning algorithms.


INTRODUCTION 39
Imaging technology allows for non-invasive access to detailed disease features in large studies 40 and genome-wide association studies (GWAS) on such disease phenotypes can be expected to 41 accelerate knowledge gain. However, image-based disease classification can be challenging for 42 large sample sizes due to time-intensive, tiresome manual inspection. This limitation can be 43 overcome by automated disease classification via machine learning and particularly deep 44 learning algorithms. Such emerging approaches 1 can classify diseases effortlessly also for huge 45 sample sizes as needed for GWAS or other -omics approaches. 46 Deep learning algorithms require enormous input data with available gold standard 47 4 specific disease status, where the person-specific disease status is used in the association 116 analysis, and (ii) a person-specific misclassification from a missing disease status in one of the 117 two entities. We thus developed an MLA to account for the fact that we are using an error-prone 118 response * ≔ max( 1 * , 2 * ), 1 * , 2 * ∈ {0,1}, in the association analysis, while the true disease 119 ≔ max( 1 , 2 ), 1 , 2 ∈ {0,1}, is assumed to follow a logistic regression model. 120 Details are provided in Appendix A. The general idea of the MLA is to factorize the 121 likelihood of the observed, error-prone response data into two parts, the model for the association 122 between risk factor and true (but in general unobserved) response (true association model) and 123 a model for the misclassification process (misclassification model). We adapted this well-124 established methodology for analyzing misclassified binary response data 7,8 to the scenario of 125 bilateral disease with a "worse-entity" disease definition (i.e. the person-specific disease status 126 is defined as the status of the worse entity). Under the assumption of independent 127 misclassification for the observed disease in the two entities * , * of an individual i, we derive 128 Linking misclassification theory to machine learning disease classification 262 We here establish the usage of machine learning derived disease classification in genetic 263 association analyses as a response misclassification problem in logistic regression (Methods). 264 We present a newly developed maximum likelihood approach (MLA) for bilateral diseases like 265 AMD (Methods). This includes two versions: (1) assuming non-differential misclassification 266 (MLA1, i.e. no dependency of misclassification probabilities on the covariate of interest, here the 267 genetic variant) and (2) allowing for differential misclassification (MLA2, i.e. dependency on the 268 covariate of interest). There are existing MLAs for considering response misclassification in 269 logistic regression using internal validation data 7,8 : these MLAs refer to classic diseases where 270 the misclassification is on the person-specific disease status. Our developed approach provides 271 a general framework for bilateral diseases with entity-specific misclassification that propagates 272 to person-specific disease misclassification. Our approach also allows for missing classification 273 in one of two entities, which is a second source of bias in association analyses for bilateral 274 diseases as reported previously 16 . We exemplify our approach on machine learning derived AMD 275 compared to manually graded AMD. Since machine learning algorithms for AMD are trained on 276 images with human manual AMD grading as benchmark, we assume the manual classification 277 to be gold standard. 278 We evaluated the performance of our developed MLA1 and MLA2 in a simulation study. 279 By this, we documented substantial bias and lack of type-I error control when the naïve analysis 280 was applied, which was comparable to theory for classic (non-bilateral) diseases 5,7 . We also 281 showed our MLA1 and MLA2 to effectively remove bias and keep type-I error when specified 282 correctly ( Table 1, APPENDIX D, Supplementary Table 1). 283

AMD in UK Biobank based on automated classification and validation data 285
We applied a published convolutional neural network ensemble 14 to automatically derive eye-286 and person-specific AMD classifications for 68,400 UK Biobank participants with fundus images 287 at baseline (135,000 eyes) (Supplemental Table 2a). From this, we derived eye-specific "any 288 AMD" status (i.e. any early AMD stage or advanced AMD versus AMD-free) and person-specific 289 "any AMD" status based on the worse eye (Methods). Among the 68,400 participants, 10,128 290 were ungradable for AMD in both eyes (i.e. missing person-specific AMD status, 14.8%), 4,870 291 were classified as "any AMD" and 53,402 as AMD-free (Supplemental Table 2b To quantify the performance of automated AMD classification, we manually classified 297 AMD in a subset as internal validation data (4,001 images, ≤ 1 image per eye, 2,013 individuals). 298 When comparing automated to manual (true) "any AMD" status, we found an eye-specific 299 sensitivity of 73% and specificity of 90% in the full validation data and a person-specific sensitivity 300 of 77% and specificity of 91% among the participants in the GWAS (Table 2a/b). We found no 301 structural differences between the full validation data and when restricting to the GWAS data 302 (1,327 individuals, Supplemental Table 3a/b). Both, the manual and automated classification 303 included the category "ungradable". Among the 4,001 eyes, 1,101 were manually ungradable, of 304 which the automatic classification yielded 74% as ungradable as well, but classified 9% as AMD 305 and 17% as AMD-free, which raises concerns about these classifications. In summary, we found 306 the automated classification to yield reasonable, but error-prone results. 307 308

GWAS on automated AMD classification in naïve analysis identifies two loci 309
While we have some idea about the extent of the misclassification from validation data and about 310 its impact on genetic association estimates from simulations, it is unclear whether the automated 311 any AMD classification is "good enough" for GWAS. We conducted a GWAS for person-specific 312 automatically derived "any AMD" in UK Biobank (3,544 "any AMD" cases; 44,521 controls) 313 applying logistic regression as usual, which is without accounting for misclassification (naïve 314 analysis). We found 53 variants with genome-wide significance (PGC<5.0x10 -8 ) spread across two Among the reported lead variants of the 34 advanced AMD loci 10 , we had ≥80% power to 327 detect 21 of these with nominal significance (Supplemental Table 5). When comparing effect 328 sizes of these 21 variants from this analysis on "any AMD" in UK Biobank with reported effect 329 sizes for advanced AMD, we found 15 with directional consistency (PBin=0.078) and 7 with 330 directionally consistent nominal significance (PBin=4.9x10 -5 ; Figure 3a, Supplemental Table 4c). 331 The overall smaller effect sizes for automated "any AMD" compared to reported effect sizes for 332 advanced AMD can be explained by a bias from misclassified automated AMD and by smaller 333 effect sizes for early AMD merged into the definition of "any AMD". For the other 13 of the 34 334 variants, we refrained from interpreting results due to lack of power in this analysis 335 (Supplemental Table 4c). Results were similar when adjusting for 20 instead of 2 genetic 336 principal components (data not shown). While the yield of only few known AMD signals in this 337 UK Biobank GWAS may be disappointing, this is not fully unexpected given an effective sample 338 size 25 of 13,130 and a power estimate of ~80% (assuming no misclassification and reported effect 339 sizes) to detect associations with genome-wide significance for only 4 of the 34 established 340 variants (CFH, ARMS2/HTRA1, C3, C2/CFB/SKIV2L, Supplemental Table 5).

345
Applying the developed MLA to account for misclassification for selected variants 346 Due to our simulation results and theory 5,7 , we expected our GWAS on automated (error-prone) 347 AMD to yield biased estimates and, when the misclassification was differential towards the 348 genetic variant, even potentially false signals. We applied our developed MLAs for 26 selected 349 variants: (i) the 3 lead variants detected here with (near) genome-wide significance (CFH: Interestingly, our HERC2 lead variant, rs129138329, is precisely the variant for which the G allele 387 was considered causal for blue eyes 26 . We were able to support this in our AugUR 27,28 study 388 (n=1026; reported "light eye color" for 14%, 36%, or 97% of participants with A/A, G/A, or G/G, 389 respectively). Eye color is discussed as AMD risk factor, but the debate is on blue eyes to 390 increase risk due to increased susceptibility to UV-radiation 29 , which is in contrast to our 391 observation of brown eyes to increase AMD risk and a challenge for interpreting this finding. It 392 was interesting to see the HERC2 rs129138329 association vanish when accounting for 393 rs129138329-associated misclassification. This was in line with the observed strong association 394 of the specificity with this variant (ORspec=0.64 per A allele, Supplemental Table 6a) resulting in 395 3.0%, 1.9%, or 1.2% of false-positive AMD classifications among persons with A/A, A/G, or G/G, 396 respectively. This notion of a larger misclassification among A/A versus G/G individuals was 397 further supported by the larger fraction of manually ungradable images that were deemed 398 gradable by the automatic classification among A/A versus G/G (54.5% versus 38.8%, for A/A had a darker appearance than those for A/G or G/G (Figure 4), which we were able to 401 quantify by means of average gray level per image of 46.4, 49.0, or 53.6, respectively. Therefore, 402 the HERC2 signal appeared to be an artefact due to a larger misclassification for brown eyes 403 linked to darker fundus images. One may hypothesize that the darker eye color had reduced light 404 exposure during fundus photography, which gave rise to darker images and more misclassified 405 AMD-free eyes. The notion of a differential misclassification due to eye color was further 406 supported by the fact that the full HERC2 signal disappeared by modelling a misclassification 407 imaging-based diseaseto our knowledge. We here present a GWAS on machine learning 419 derived AMD in UK Biobank highlighting chances and challenges. By this GWAS on AMD 420 combined with an evaluation of emerging genetic signals via our newly developed MLA, we were 421 able to detect known AMD loci and to distinguish true loci from artefacts. 422 Such artefacts, i.e. false positives, can derive from a misclassification that is associated 423 with a genetic variant. Our data and analyses provide a compelling example for such an artefact: 424 our MLA revealed the HERC2 signal as false positive signal and suggested darker eye color and 425 darker fundus images as a relevant source of misclassification for this machine learning 426 algorithm. It is perceivable that the misclassification process of other algorithms for AMD and for 427 other image-based diseases will depend on one or the other characteristic as well, and that such a characteristic is picked up by some genetic variants due to the abundant range of genetically 429 pinpointed characteristics (see e.g. NHGRI-EBI GWAS Catalog 31 ), which can yield artefact 430 signals when left unaccounted. 431 Our MLA, developed for bilateral diseases, does not only quantify the misclassification 432 and the dependencies, but also guards against bias and artefacts in association analyses. Similar 433 approaches are available for classic diseases 7,8 . Thus, this concept can be generalized to other 434 algorithms and other image-based diseases. Our work here links the theory of misclassification 435 to machine learning derived disease classification, which can be generalized also to 436 measurement error and quantitative phenotypes. 437 We recommend a GWAS combined with a post-GWAS evaluation of emerging genetic 438 effects for non-differential and differential misclassification not only to search for GWAS signals 439 on image-based, machine-learning derived disease phenotypes. We also recommend such a 440 GWAS as a quality control for diseases like AMD, where strong genetic signals are known: a 441 GWAS on AMD ascertained by any classification approach, manual or automatic, should be able 442 to detect at least the two strong known signals around ARMS2/HTRA1 and CFH. When a GWAS 443 does not detect these signals, this indicates issues that can be anything from mis-matched bio-444 samples, analytical errors, or imperfect disease ascertainmentlike from machine learning 445 algorithms as highlighted here. A GWAS can be a quick guide towards phenotype classification 446 quality when genomic data is available. 447 Overall, we illustrate chances and challenges of machine learning derived disease 448 classification in GWAS, and the applicability of our MLA to guard against bias and artefacts.

Appendices 450
Appendix A. MLA to adjust for response misclassification in bilateral diseases. 451 We developed an MLA to adjust for response misclassification from an error-prone, entity-specific 452 disease classification in bilateral diseases. Here we illustrate it based on the example of age-453 related macular degeneration, where AMD can occur in each eye (eye-specific AMD) and the 454 person-specific binary outcome is defined as worse-eye outcome, i.e. "AMD in at least one eye", 455 and modeled using logistic regression. We assume that we have an error-prone, eye-specific 456 AMD classification (e.g. from a machine-learning based automated classification) available for 457 nearly all eyes and true, gold-standard classifications (e.g. manual classification) for a subset of 458 individuals from validation data. 459 Let (Z 1i , Z 2i ) ∈ {0,1} be the true, binary disease stages in the two eyes of study participant i, i.e. 460 (Z 1i = 1, Z 2i = 0) means that participant i suffers from AMD in the left eye and is unaffected from 461 AMD in the right. When estimating the association of person-specific risk factors with AMD, one 462 often defines a binary person-specific disease status as worse-entity AMD,Y i ≔ max(Z 1i , Z 2i ), 463 Z 1i , Z 2i ∈ {0,1}, and uses logistic regression to estimate the association of some covariates X 464 with AMD: the person-specific disease status Y i equals 1, if at least one eye of individual i is 465 classified as AMD, and Y i equals 0, if both eyes are unaffected. As described previously 16 , such 466 a worse-eye disease status can be misclassified because of two reasons: either, because of 467 missing disease information in one of two eyes (in this case disease can be overlooked), or 468 because of error-prone disease status for any of the two eyes. Here we assume that we observed 469 an error-prone, eye-specific disease status (Z 1i * , Z 2i * ) for each of the two eyes of a "main study" 470 participant i and additionally the true disease status in each of the two eyes (Z 1j , Z 2j ) for a subset 471 of study participants j from the "validation study". For all participants from the main study (error-472 prone classifications only) or the validation subset (error-prone and true classification), there is 473 the additional issue that the disease information can be missing in one of two eyes, because of 474 missing or ungradable fundus images. Since the automated (error-prone) and manual (gold 475 standard, "true") classification may judge differently on whether an image is gradable or 476 ungradable, any possible subset of (Z 1i , Z 2i , Z 1i * , Z 2i * ) might be the available information for a 1 specific study participant. To obtain valid estimates for the association of covariates with the true 478 AMD status, we set up a likelihood based on the conditional probabilities of the observed error-479 prone and/or true eye-specific disease classifications given covariates. The product of these 480 conditional probabilities over all individuals forms the likelihood, which has to be numerically 481 optimized with respect to the regression parameters to obtain estimates. The different likelihood 482 contributions for the individuals depend on the available AMD classifications (true and/or error-483 prone for one or both eyes). 484 The general problem of response misclassification when AMD information is missing in one of 485 two eyes and/or the eye-specific classification suffers from misclassification with known 486 classification probabilities has already been evaluated in a previous publication 16 . There, we also 487 derived the corresponding likelihood contributions for the different scenarios of available outcome 488 data. Here, we add the aspect that validation data is available for some study participants or, 489 more specifically, a collection of error-free (gold-standard) classified single eyes, and that we 490 model the eye-specific misclassification process based on information from this validation data. 491 In the following, we describe the general idea and provide formulas for the respective likelihood 492 contributions: 493 The assumed logistic regression model for the true worse-eye disease corresponds to the 494 assumption that max(Z 1i , Z 2i ) = Y i~B ernoulli(π i ), where we model the success probability based 495 on a linear predictor via π i = 1 (1 + exp(−x i ′ β)) ⁄ = Logist(x i ′ β); x i is a vector of observed person-496 specific covariates and β the vector of corresponding regression coefficients. It follows that to be affected in the left but not the right eye and vice versa), the conditional probability mass 505 function of the two-entity disease status distribution can be written concisely as 506 which specifies the true data model. If we look at a single eye selected randomly from both eyes, 507 we can derive (without loss of generality for Z 1i ): 508 P(Z 1i = 1|x i ) = P(Z 1i = 1, Z 2i = 1 |x i ) + P(Z 1i = 1, Z 2i = 0 |x i ) = ( 1 2 + 1 2 δ i ) π i 509 We now assume that we observed potentially misclassified single eye disease stages (Z 1i * , Z 2i * ) 510 for each participant and describe the misclassification process based on the sensitivity and 511 specificity of the classification, 512 P(Z li * = 1|Z li = 1, x i ) = π 1i 513 P(Z li * = 0|Z li = 0, x i ) = π 0i , 514 with l = 1,2; π 1i and π 0i are the person-specific sensitivity and specificity from the eye-specific 515 classification process. We assume that the eye-specific classification process within an individual 516 is independent in the two eyes, i.e.: 517 P(Z 1i * = z 1i * , Z 2i * = z 2i * |Z 1i = z 1i , Z 2i = z 2i , x i ) = P(Z 1i * = z 1i * |Z 1i = z 1i , x i ) × P(Z 2i * = z 2i * |Z 2i = z 2i , x i ). 518 Based on the true data model and the description of the misclassification process via sensitivity 519 and specificity, we can now express the conditional probabilities of all combinations of observed 520 outcomes, by using Bayes' rule and the law of total probability. If all four AMD classifications 521 were observed for an individual (individual with full validation data, true and error-prone disease 522 status for each of the two eyes), we can derive the following (omitting a random variable notation 523 and only using the small z's for the observed data): 524 P(z 1i * , z 2i * , z 1i , z 2i | x i ) = P(z 1i * , z 2i * |z 1i , z 2i , x i ) × P(z 1i , z 2i , |x i ) 525 = P(z 1i * |z 1i , x i ) × P(z 2i * |z 2i , x i ) × P(z 1i , z 2i , |x i ). 526 ) Here, we fraction the conditional probability of the observed data into terms of the eye-specific 527 classification process (depending on sensitivity or specificity when the observed true outcome 528 z li is 1 or 0, respectively, (3)) and the true data model (1). If only the two eye-specific error-prone 529 classifications are observed (individual in the main study, not part of the validation subset), the 530 law of total probability can be used and the conditional probability can be expressed as 531 × P(z 1i , z 2i , |x i ) 532 × P(z 1i , z 2i , |x i ), 533 This again yields an expression that depends on the eye-specific classification probabilities (3) 534 and the true data model (1). 535 If only a classification for one error-prone outcome was observed (e.g. Z 1i * = z 1i * ), the conditional 536 probability is given by 537 where the first terms in each summand depends on the specificity and the sensitivity of the eye-539 specific observation process; an expression for the second was already given above (equation 540 (2)). 541 When three classifications were observed, e.g. (Z 1i , Z 1i * , Z 2i * ) or (Z 1i , Z 2i , Z 1i * ), we can derive 542 classification P(Z li * = 0|Z li = 0, x i ) = π 0i , can potentially vary with person-specific characteristics. 553 We therefore decided to model them based on the logistic function of a linear predictor, where 554 relevant covariates (characteristics) can be specified for each probability. Combining all these 555 expressions, we can set up the whole likelihood based on the derived conditional probabilities 556 and numerically optimize with respect to the regression coefficients of the linear predictors for π i , 557 To sample data mimicking studies on AMD with internal validation data, we performed the 571 following steps: 572 1) We sampled the true binary "worse-eye" AMD data Y for 5000 individuals by sampling from 573 a Bernoulli distribution, where we modelled the success probability based on the logistic 574 function of a linear predictor (corresponding to the assumed data generating process in 575 logistic regression). For the linear predictor, we used an intercept of -0.25 (corresponding to 576 an average probability of person-specific AMD of ~0.44) and a continuous standard normal 577 covariate X. We varied the log OR of X on Y between zero (simulation under H0 of no effect) 578 and one. 579 2) To create the true eye-specific disease data (two binary observations per individual, (Z 1 , Z 2 )) 580 we specified the conditional probability of being affected in both eyes given disease in at least 581 one eye (i.e. Y = 1 based on "worse-eye definition), δ, to be (on average) δ = 1/(1 + 582 exp (−1) ) = 0.73. We assumed this probability to be either constant or varying with the 583 continuous covariate X based on formula δ = 1/(1 + exp (−(1 + 1 × X)) ) = Logist(1 + 1 × X). 584 For all individuals with sampled Y=1, we sampled a Bernoulli variable based on probability δ, 585 to decide whether they were affected in both eyes or not. If they were affected on only one 586 eye, we sampled randomly from the left or right. 587 3) To mimic the situation of missing information in one of two eyes, we sampled a Bernoulli 588 random variable for each individual based on a fixed success probability (e.g. 0.75), to 589 indicate whether information on both eyes was available. If not, we removed the disease 590 information from a randomly selected eye. 591 4) To obtain eye-specific error-prone outcome data (Z 1 * , Z 2 * ), we conditioned on the true, sampled 592 observations (Z 1 , Z 2 ), and sampled the error-prone outcomes based on specified 593 classification probabilities, the sensitivity P(Z * = 1|Z = 1) and specificity P(Z * = 0|Z = 0). 594 for both eyes, we defined Y obs * = max(Z 1 , Z 2 ) or Y obs * = max(Z 1 * , Z 2 * ), respectively; for 608 observations with information only on one eye Z 1 , we used Y obs * = Z 1 or Y obs * = Z 1 * . For 609 individuals from the validation data with information on both eyes, Y obs * = max(Z 1 , Z 2 ) 610 corresponds to the true Y; for all others, Y obs * might be misclassified. 611 For each sampled dataset we estimated three models: 1) standard logistic regression based on 612 the error-prone naïve worse-entity outcome Y obs * , 2) the derived MLA (see above) modelling the 613 probability of person-specific AMD and the probability of AMD in both eyes given AMD in at least 614 one eye, δ, based on covariate X, while assuming a constant eye-specific sensitivity and 615 specificity and accounting for missing information in one of two eyes (MLA1), and 3) the derived 616 MLA allowing for a dependency of sensitivity and specificity on X (MLA2). 617 618

Appendix C. Power analysis for reported lead variants based on UK Biobank sample size. 619
We wanted to evaluate the impact of using the MLA on selected variants including the 34 reported 620 lead variants known for their association with advanced AMD. Given reported effect sizes and 621 effect allele frequencies (EAF), we expected the power to detect some of these 34 associations 622 to be limited in a sample size of approximately 3,500 cases (and more controls). Therefore, we 623 aimed to assess the power to detect reported genetic associations for AMD in the available data 624 of UK Biobank, to focus our analyses with the MLA only on adequately powered reported 625 associations and to avoid overinterpreting results from underpowered analyses. It is, however, 626 not fully straight forward how to compute power for the scenario of "any AMD" from machine 627 learning based disease classification, due to the power-diminishing effect of misclassification and 628 some uncertainty of what effect size to use. We chose to use the reported 10 EAFs in advanced 629 AMD cases and AMD-free controls for the established 34 lead variants and computed the power 630 for a t-Test on EAFs for differently sized groups, given the 3,544 cases and 44,521 controls 631 derived from the automated "any AMD" classification in the UK Biobank GWAS data se diff = √ n case ×eaf case ×(1−eaf case )+n contr ×eaf contr ×(1−eaf contr ) n case + n contr . 635 Based on these power calculations, we selected all lead variants with at least 80% power to yield 636 nominally significant associations in UK Biobank. By this, we made the assumptions that EAFs 637 in advanced AMD cases are transferable to EAFs of "any AMD" cases and that no 638 misclassification was present in the machine learning derived any AMD classification. Therefore, 639 this is probably an overestimate of available power. We performed the power analysis, however, 640 mainly to dismiss variants with an obvious lack of power, while trying to include as many variants 641 as reasonable in our analyses using the MLA. 642 increasingly missing AMD in one of the two eyes, and (iv) a larger bias by decreased specificity 661 than by decreased sensitivity. (Table 1, Supplemental Table 1). 662 In logistic regression, the larger the misclassification probabilities, the larger the bias of 663 estimates 5 , with similar influence of increased probabilities for false-positive and false-negative 664 classifications for balanced data. In the following, we provide an explanation of the findings (iii) 665 and (iv) for bilateral diseases from above. Finding (iii) is explained by the fact that an increased 666 fraction of missing eyes implies a reduced sensitivity for person-specific AMD: AMD in the 667 missing eye can be overlooked, which can lead to a false-negative person-specific AMD 668 classification if only the missing eye of an individual is affected. Finding (iv) was that decreased 669 specificity had larger impact on bias than decreased sensitivity, e.g. for (sens, spec)=(0.9, 0.9) 670 and a fraction of 25% of individuals with "missing eyes" and a true log OR of X on Y of 1 the 671 observed bias was -0.27. When the sensitivity was reduced to 0.8 (specificity=0.9), the bias 672 increased (in absolute value) to -0.32; when the specificity was reduced to 0.8 (sensitivity=0.9), 673 the bias increased to -0.39. This can be explained by rewriting the probability of misclassification 674 in the worse-entity outcome, P(Y * ≠ Y) as 675 P(Y * ≠ Y) = P(Y * = 1|Y = 0)P(Y = 0) + P(Y * = 0|Y = 1)P(Y = 1) 676 = P(max(Z 1 * , Z 2 * ) = 1|Z 1 = 0, Z 2 = 0)P(Y = 0) + P(Z 1 * = 0, Z 2 * = 0|max(Z 1 , Z 2 ) = 1)P(Y = 1) towards disease, and falsely classifying only one of two healthy entities towards disease is 689 sufficient to misclassify the person-specific disease status. 690 When applying the MLA1, we found it to effectively correct for bias and to yield the expected 691 confidence interval coverage rates (~95%) when the misclassification was non-differential, but 692 we found it to still result in biased estimates and excess type-I error when the misclassification 693 was differential ( For estimating sensitivity and specificity, we found the following: (i) for the 3 lead variants from 702 this GWAS (CFH, ARMS2/HTRA1, or HERC2, respectively), the MLA1-derived sensitivity and 703 specificity (at mean age and two copies of the non-effect allele) showed only small differences 704 between the 3 variants (sensitivity = 65%, 67%, 63%; specificity=98%, 98%, 99%, respectively, 705 Supplemental Table 6a). From a model without including a genetic covariate, we obtained an 706 overall sensitivity of 64.5% (95%-CI: 60.1%, 68.7%) and a specificity of 98.6% (98.4%, 98.8%). 707 (ii) We did not find strong evidence for associations with age using MLA1 or MLA2 based on any 708 of the 26 selected variants, except for an association of the specificity with age based on MLA1 709 for the HERC2 variant that disappeared when applying MLA2 (age-P=6.71x10 -9 or 0.70, 710 respectively, Supplemental Table 6a   estimates for automatically derived "any AMD" from UK Biobank without and with accounting for misclassification. We selected the 21 reported AMD lead variants, for which we had ≥80% power to detect them in this UK Biobank sample size with nominal significance.
Shown are log OR effect estimates and 95% confidence intervals reported for advanced AMD on x-axis versus UK Biobank estimates for automatically derived "any AMD" on y-axis from a) the naïve analysis (logistic regression ignoring misclassification, b) MLA1, and c) MLA2.

Figure 4. Evidence for differential misclassification in automatically derived AMD with
respect to the HERC2 variant rs12913832. Shown are (i) estimated odds ratios from the naïve analysis ignoring misclassification and various characteristics per genotype group: (ii) the fraction of persons with self-reported "light eye color" in the AugUR study, (iii) randomly selected fundus images in UKBB, (iv) image-lightness quantified by mean average grayscale, (v) proportion of false-positive AMD in the automated classification (1-specificity) and 95% confidence interval estimated via MLA2, and (vi) observed proportion of manually ungradable images that were deemed gradable by the algorithm and classified as "any AMD" or "AMD-free". Table 1. Simulation results on effect estimates and empirical type-I error in naïve and MLA-analysis. We evaluated the performance of naïve and MLA analysis of a quantitative covariate X and a binary bilateral disease Y, e.g. person-specific AMD, simulating various scenarios. For each scenario, we sampled 1000 data sets à 5000 individuals, 4000 with only error-prone eye-specific AMD classification, and 1000 with additional true AMD classification. Shown are performance measures from three models, naïve analysis, MLA1, or MLA2 assuming non-differential/differential misclassification regarding X, respectively, in various simulation scenarios. For the eight scenarios shown here, we assumed no association of X with δ, the probability of AMD in both eyes given ≥1 affected eye; results were similar when modelling an association of X with δ, see Supplemental Table 1. For each model and scenario, we report mean effect estimates β Ŷ , log OR per unit increase in standard-normal X, over all simulation runs, and the associated root mean squared error (RMSE), fraction of nominally significant effect estimates (% with P<0.05), and coverage frequencies of 95%-confidence intervals. 1% Sens/Spec = average sensitivity and specificity of error-prone, eye-specific AMD classification; %miss. = fraction of randomly selected individuals with missing AMD classification in one of two eyes; βY= log OR of X on true AMD, βsens= log OR of X on the sensitivity or βspec=log OR of X on the specificity of the eye-specific misclassification process, respectively. Table 2. Confusion matrices comparing manual and automated AMD classification per eye and per person. Shown are absolute numbers and conditional classification probabilities, i.e. in row i and column j, P(automated = j | manual=i) as %, with i, j="Ungradable", "No AMD", "Any AMD": a) for all eyes in the validation data; 4001 eyes of 2,013 individuals. b) For all persons in the overlap between validation data and GWAS; 1,327 persons. a) per eye (4,001 eyes, 2,013 individuals)