Suitability of Machine Learning for Atrophy and Fibrosis Development in Neovascular Age-Related Macular Degeneration

Anti-VEGF therapy has reduced the risk of legal blindness on neovascular age-related macular degeneration (nAMD), but still several patients develop fibrosis or atrophy in the long-term. Although recent statistical analyses have associated genetic, clinical and imaging biomarkers with the prognosis of patients with nAMD, no studies on the suitability of machine learning (ML) techniques have been conducted. We perform an extensive analysis on the use of ML to predict fibrosis and atrophy development on nAMD patients at 36 months from start of anti-VEGF treatment, using only data from the first 12 months. We use data collected according to real-world practice, which includes clinical and genetic factors. The ML analysis consistently found ETDRS to be relevant for the prediction of atrophy and fibrosis, confirming previous statistical analyses, while genetic variables did not show statistical relevance. The analysis also reveals that predicting one macular degeneration is a complex task given the available data, obtaining in the best case a balance accuracy of 63% and an AUC of 0.72. The lessons learnt during the development of this work can guide future ML-based prediction tasks within the ophthalmology field and help design the data collection process.


| I N T RODUC T ION
Age-related macular degeneration (AMD) is a progressive chronic disease whose advanced forms, such as neovascular AMD (nAMD), can lead to severe and irreversible vision loss.Neovascular AMD is characterized by macular neovascularization (MNV), which can progress to subretinal fibrosis and macular atrophy (Ferris III et al., 2013;Spaide et al., 2020).Subretinal macular fibrosis is a result of an excessive wound healing response that follows MNV in nAMD and can produce local destruction of photoreceptors, retinal pigment epithelium (RPE) and choroidal vessels (Ishikawa et al., 2016).On the other hand, macular atrophy is characterized by atrophic lesions of the outer retina, RPE and underlying choriocapillaris, and it is usually found in patients with long-standing nAMD (Bhisitkul et al., 2015).Both atrophy and fibrosis can cause permanent macular dysfunction, legal blindness or inability to perform routine activities such as reading or facial recognition (Sadda et al., 2020).
Advances in diagnostic techniques and anti-vascular endothelial growth factor (anti-VEGF) therapy have helped to reduce AMD-related legal blindness in some countries, and its increasing social and emotional impact (Mehta et al., 2018;Moreno, 2016).However, some patients do not achieve a satisfactory long-term response with current treatment, developing atrophy and fibrosis, and the need for frequent intravitreal injections and ophthalmological visits places a significant burden on patients, their families and healthcare professionals (Spooner et al., 2018).Some genetic, clinical and imaging biomarkers have been associated with the anatomical and functional prognosis of patients with nAMD and may help in the planification of individualized anti-VEGF therapies (Caire et al., 2014;García-Layana et al., 2014;Guymer et al., 2019;Lai et al., 2019;Llorente-González et al., 2022;Martínez-Barricarte et al., 2012).One of the imaging biomarkers that has been widely studied in nAMD in the last few years is retinal fluid visualized on optical coherence tomography (OCT), both after the loading phase of anti-VEGF treatment and in the long follow-up.The subretinal location of this fluid seems to be related to better visual prognosis and less atrophy and fibrosis formation, while intraretinal fluid has been associated with higher macular fibrosis and worse vision in the long term (Llorente-González et al., 2022;Saenz-de-Viteri et al., 2021;Schmidt-Erfurth & Waldstein, 2016).
The increasing sophistication of imaging systems, networking and software analysis, are making it possible to implement artificial intelligence, such as machine learning (ML), into the diagnostic in medicine, especially in retinal pathologies (Cao et al., 2021;Quellec et al., 2019).Nevertheless, in all the aforementioned studies, no ML techniques have been analysed to predict the outcome of nAMD patients undergoing anti-VEGF treatment.
Hence, in this work we evaluate the suitability of ML to predict whether a patient with nAMD will develop fibrosis and/or atrophy after anti-VEGF treatment.We use data collected in a 36-month study according to real-world practice (dataset PI15/01374) to assess possible risk factors in nAMD patients (Llorente-González et al., 2022).In the previous study, only a conventional statistical analysis of clinical and environmental variables was performed, without evaluation of genetic variables.The objective of this study was therefore twofold: to perform a statistical analysis of the genetic variables that were collected but not analysed in (Llorente-González et al., 2022) and to evaluate the predictive power of ML models for atrophy and fibrosis development in nAMD patients at 36 months, using all the clinical and genetic variables collected in routine clinical practice up to 12 months from start of treatment.

| Study design
Dataset PI15/01374 (Llorente-González et al., 2022) was used in this study to assess the influence of clinical (including environmental factors) and genetic factors on the progression towards macular atrophy and fibrosis (Table 1 and Table S1).Data collection was conducted from 1 September 2016 to 28 February 2020 across 17 sites in Spain, through an ambispective (retrospective and prospective) multicentre 36-month study of a cohort of 354 patients (one eye study) with nAMD treated according to routine clinical practice.
All patients underwent a detailed ophthalmologic examination including automatic objective refraction, visual acuity assessment with ETDRS (early treatment diabetic retinopathy study) visual acuity test, slit-lamp biomicroscopy with pupillary dilation, colour fundus photography and OCT.Macular atrophy and fibrosis were evaluated as dichotomous qualitative variables through their presence or absence on imaging tests (colour fundus photography and/or OCT) at each visit.Likewise, its progression was calculated by its increase over time in imaging tests.

| SNPs statistical analysis
To evaluate the significance of the alleles' frequencies, we used the chi-square test within the following two groups: fibrotic vs non-fibrotic patients, and atrophic vs non-atrophic patients, all at 36 months.All SNPs analysed in this study were in Hardy-Weinberg equilibrium.The Bonferroni method was used to correct for multiple comparisons.The results of this analysis are also used to perform feature selection of the genetic variables prior to the ML model (see Subsection 2.4.1).

| Machine learning analysis
The dataset PI15/01374 specifies whether a nAMD patient developed fibrosis and/or atrophy at 36 months.
Due to the different nature of these outcomes, we considered distinct machine learning models to predict, at 12 months from start of treatment, whether a patient (eye) will develop 24 months later (i.e., at 36 months): atrophy and/or fibrosis (Atrophy|Fibrosis_36m); fibrosis (Fibrosis_36m); and atrophy (Atrophy_36m).
In other words, in the Atrophy|Fibrosis_36m experiment patients (eyes) who develop atrophy, fibrosis or both correspond to the positive class; in the Fibrosis_36m experiment, patients (eyes) who develop only fibrosis or fibrosis and atrophy belong to the positive class; and in the Atrophy_36m experiment patients who develop only atrophy or fibrosis and atrophy belong to the positive class.In a given experiment, the patients (eyes) that are not considered positive are included in the negative class.
In all cases, this reduces to a supervised learning problem for binary classification, in which the positive class is referred as having the pathology and the negative class as not having it.

| Data preprocessing
The considered PI15/01374 dataset contains information of clinical and genetic (SNPs) variables for 335 eyes.Before being used as input to the ML models, we performed some preprocessing steps.
Since the goal is to make a prediction on month 12 after starting the treatment, clinical variables collected at 36 months were removed, as they would not be available in a predicting real scenario.This reduced the number of clinical variables to 20 (see Table 1).Out of the 14 genetic variables, we selected a representative SNP from each of the 4 risk-pathways associated with nAMD atrophy and fibrosis: complement system (CFI), metabolic change in mitochondria (ARMS2), inflammation (SMAD7) and neovascularization (VEGFR) (DeAngelis et al., 2017).The SNPs statistical analysis results were used to guide this selection and filter out SNPs that did not show statistical differences (Figure 1).To ease the feature importance analysis (see subsection 2.4.4), the retained clinical and genetic variables were further split in seven groups based on their clinical similarity (Table 1).Due to the high variables/eyes ratio, categorical variables (including SNPs) were encoded following a LabelEncoding instead of a OneHotEncoding (using Python's sklearn library).We dropped samples (eyes) which already presented the pathology to be predicted at 4 or 12 months, as it was observed that in these cases the pathology remained unchanged at 36 months.Moreover, retaining these samples can over-simplify the models and avoid their correct training.We also dropped samples containing variables (from the retained ones) with missing values (N/A).These steps reduced down the number of samples to 296 for the atrophy experiment, 284 for fibrosis and 254 for atrophy and/or fibrosis.In total, 55% of the samples presented atrophy and/or fibrosis at 36 months, 37% presented fibrosis and 30% presented atrophy.

| Supervised learning models
Three different supervised learning methods known to perform well in practice were selected: random forest (RF) (Breiman, 2001), extreme gradient boosting (XGB) (Chen & Guestrin, 2016) and support vector machines (SVM) (Noble, 2006).Deep learning models were not considered due to the low number of available samples.RF and XGB are encompassed within the field of ensemble learning, as they combine decision trees (DTs) to find patterns and classify the data.RF is based on bagging, which performs bootstrapping over the data and uses multiple DTs to average the results and reduce the variance.To decorrelate the trees and prevent overfitting, in RF the DTs can only use a random subset of the features.XGB is based on boosting, in which trees are built sequentially (i.e., previously built trees are taken into account to build the next one).SVM classifies the data by applying linear separators, making use of kernels to get margin classifiers that work efficiently in very high dimensional data.Both RF and XGB fall within the category of soft-classifiers, as they compute the posterior probability of an input sample belonging to the positive class.SVM is a hard-classifier that outputs the predictive class without explicitly computing the posterior probability.Yet, an estimation of this probability can be computed using cross-validation.By default, if the posterior (or predictive) probability is larger or equal to 0.5, a positive prediction is made (negative otherwise).Nevertheless, since these probabilities reflect how confident the model is when making a prediction, a different threshold (Th) can be used such that only samples with a probability greater than Th are classified as positive.As shown below, the capacity of a model to separate both classes can be evaluated by modifying this threshold.

| Evaluation metrics
Accuracy, defined as the percentage of samples correctly classified (i.e., for which the correct prediction is made), is Minor allele frequency (MAF) differences between atrophy/non-atrophy and fibrosis/non-fibrosis patients.Two barplots have been used for representing the MAF allele frequencies for each SNP and disease of study.Minor allele type for each SNP has also been added.The significant allele frequency differences, according to the chi-square test among groups within the same disease (see Section 2), have also been pointed out (*).SNPs used in the ML models (see Table 1) are highlighted in bold.
generally the preferred metric to evaluate ML models for classification.However, due to the data imbalance among positive and negative samples, balanced accuracy (BA) score was also considered.BA computes the average between the accuracy on the positive samples and the accuracy on the negative samples, giving equal weight to both classes.
To evaluate the reliability and confidence of the models, we considered the area under the ROC (receiver operating characteristic) curve (AUC).The ROC curve plots the true-positive rate (TPR) vs the false-positive rate (FPR) for each possible threshold, defined as follows: where TP, FN and FP stand for true positives, false negatives and false positives, respectively.The AUC is given by the area under the ROC curve, and ranges from 0 to 1, with 0.5 being a random classifier and 1 a perfect one.
Intuitively, a reliable and confidence model should generate high probabilities when input positive samples, and viceversa.Additionally, if samples are sorted by their predictive probabilities, positive samples are expected to appear before negative samples, such that for high thresholds only positive samples would be predicted as positive (i.e., FPs would be close to zero).As the threshold decreases, the opposite is expected, that is, we should have close to zero FNs.The ROC curve and the AUC therefore provide metrics to better understand how well the model separates both classes.

| Feature importance
In order to analyse the models' feature importance in a homogeneous manner, we define a relative AUC (rAUC) score as: rAUC measures the increase or decrease in the models' AUC that a specific feature or group of features yield.AUC f accounts for the AUC of a model when using a group of features as input, including the specific feature f we want to compute the rAUC for.AUC f ⏤ accounts for the AUC of the same model, that is, same parameters and same features, but excluding feature f.Hence, rAUC measures the specific contribution of feature f to the model's reliability.For a set of p features, there would be 2 p − 1 possible combinations.Hence, to reduce the com- putation complexity, we analyse the importance of each feature group (Table 1) rather than individual features and apply this metric to models with at least two group of features.2 p − 1 − p combinations are therefore evalu- ated (p being 7 in our case).

| Experimental setup
When evaluating ML models, it is key to verify their generalization ability, that is, how they perform on data not used for training (referred to as test data).Due to the low number of available samples, cross-validation (CV) was used to generate training and test folds iteratively (Ng, 1997).However, when performing hyperparameter tuning and feature selection simultaneously, CV can yield overfitted test folds.In our case, for each ML model, we considered different values for the hyperparameters as well as all combinations of feature groups.Therefore, we used nested cross-validation (NCV) instead (Varma & Simon, 2006).Similarly to CV, in NCV data are split in folds, and at each iteration, one fold is left out for testing and the remaining ones are used for training (called outer training fold in NCV), but contrary to CV, in NCV the outer training fold is further split into folds, and iteratively all folds but one are used for training and the left-out fold for validation.The hyperparameters that better perform (on average) in the validation sets are then tested in the test fold.This allows hyperparameter tuning and feature selection while ensuring generalization ability of the resulting models, avoiding overfitting and increasing robustness during the training.
In our experiments, 6 folds were used in both the outer and inner loops.For each considered model, hyperparameter tuning was performed by applying a grid search (Table S4), and all possible subsets of the defined feature groups were tested during training.For each prediction task, we evaluated the importance of each feature group by computing the corresponding rAUCs on the test sets from NCV.Finally, the model with the best combination of features and hyperparameters in terms of average AUC (on the test folds) was selected.Unless stated otherwise, all reported metrics are on the test folds (from NCV).See Figure S2 for further explanation.It is worth noting that similar to CV, with NCV model training, feature selection and hyperparameter tuning are never performed on the left-out set, being the left-out set used for model evaluation.

| SNPs statistical analysis results
The results of the allelic analysis of the 14 considered SNPs regarding their association with the development of atrophy or fibrosis are shown in Figure 1.Allelic frequencies exhibited a significant association between patients with fibrosis compared to non-fibrotic patients with the CFI gene.All the SNPs of this gene showed frequencies with some differences (Table S3) but the SNP rs4698775 (CFI) indicated a significantly higher minor allele frequency (MAF) in patients with fibrosis (p < 0.05, OR 1.4 with 95% CI 1.0-1.9)versus non-fibrotic.This significance is, however, lost after Bonferroni adjustment (p > 0.05).Regarding the development of atrophy, no significant differences were found in the allelic frequencies of these 14 SNPs.

| Feature importance
We first evaluated the importance of each of the seven considered feature groups by computing the Due to the flexibility of RF, XGB and SVM models and the limited availability of samples, we considered different hyperparameters for each model and prediction task, as well as all possible combinations of feature groups.Figure 2 shows the rAUC distributions across features group and ML models, for each prediction task.

Atrophy or fibrosis at 36 m
When analysing the Atrophy|Fibrosis_36m experiment, we observed that ETDRS and foveal thickness were the two most important feature groups based on rAUC, especially for the XGB model (Figure 2).The importance of ETDRS is related to the previous statistical analysis, where it was shown that atrophy and fibrosis at 36 months were associated with lower ETDRS at any visit, explained by the visual impairment generated at the macular level (Llorente-González et al., 2022; Saenz-de-Viteri et al., 2021).However, foveal thickness at baseline (V1) and after the loading phase (V4) did not show statistically significant differences for the development of atrophy and fibrosis in the previous statistical analysis (p > 0.05).This is not surprising, as ML models can learn complex patterns in the data, and features that are not statistically significant when analysed in isolation may add relevant information to the models when combined with other features.The fact that all features in the ETDRS and foveal thickness groups are numerical may also help, as numerical variables can ease the exploitation of patterns in ML models with high predictive power such as the ones being evaluated.Interestingly, even though the variance of rAUC for ETDRS variables is larger than the variance of rAUC for foveal thickness variables, the ETDRS rAUC score distribution is significantly higher than the one of foveal thickness (p < 0.001), corroborating the importance of ETDRS in the evolution of nAMD.It is worth noting that contrary to what previous studies have shown related to the statistical power of the retinal fluid variable group to predict atrophy and fibrosis diseases in nAMD patients, their rAUC distribution shows that they do not add much value to the ML models.Probably, the fact that the retinal fluid variables are qualitative has caused them to be less relevant for the predictive models, while foveal thickness (quantitative), being directly correlated with retinal fluid (since the greater the fluid, the greater the foveal thickness and vice versa), would be an indirect reflection of the importance of retinal fluid.Future studies should consider collecting the retinal fluid variables as quantitative rather than qualitative.Finally, demographic, cataract, SNPs and neovascular membrane groups do not seem to add value to the ML models in terms of rAUC (distribution centred around 0).

F I G U R E 2
Relative AUC (rAUC) scores for each group of features and prediction task.Left.Distribution of rAUC values as a function of each group of features, for the three prediction tasks, shown as a violin plot.A swarm plot is added within each violin plot to distinguish among the three machine learning models (RF, XGB and SVM).Right.Barplot showing the mean rAUC as a function of each group of features, for every prediction task.
at 36 m In the Fibrosis_36m task, the ETDRS rAUC distribution shows a similar but much less pronounced trend to that of the Atrophy|Fibrosis_36m experiment.ETDRS and retinal fluid are the only two groups with a non-negative mean rAUC (see Figure 2).The fact that the rAUC distribution is more skewed towards smaller values as compared to the Atrophy|Fibrosis_36m rAUC distribution can be associated with the complexity of the predicted variable.Specifically, in the Fibrosis_36m experiment, both healthy patients and those that develop only atrophy at 36 months belong to the negative class, even though the latter also have a bad prognosis.This can add noise and blur the decision-making of the ML models.On the contrary, Atrophy|Fibrosis_36m includes patients that develop either atrophy or fibrosis in the positive class, avoiding this problem.

Atrophy at 36 m
Finally, the rAUC distributions in the Atrophy_36m experiment show foveal thickness and retinal fluid groups to increase the robustness of the model the most (see Figure 2).This can be explained by the relation between these groups of variables, as mentioned above, and the fact that retinal fluid has been previously identified as having clinical importance in the development of atrophy in nAMD (Llorente-González et al., 2022;Saenz-de-Viteri et al., 2021).

Model selection
After the conducted analysis that considered all combinations (>250 000) of ML models (RF, XGB and SVM), hyperparameters and feature groups, the combination with the highest validation AUC score (during NCV) was selected for each prediction task.Table 2 contains a summary of the final models.In all cases, XGB obtained the highest AUC, albeit with a different set of hyperparameters.Regarding the feature groups, all three experiments employ the ETDRS group and an additional feature group.Specifically, the Atrophy|Fibrosis_36m experiment includes the foveal thickness group, the Fibrosis_36m experiment the SNPs and the Atrophy_36m experiment the retinal fluid group.

Performance metrics
Next, we report the obtained evaluation metrics of the final models for each prediction task, computed as the average across the NCV test folds (see Figure 2a).For Atrophy|Fibrosis_36m, the obtained average BA is 0.63, the accuracy is 0.65, and the AUC is 0.72.For Fibrosis_36m, the average BA is 0.54, the accuracy is 0.72, and the AUC is 0.6.Finally, for Atrophy_36m, the average BA is 0.54, the accuracy is 0.7, and AUC is 0.57.Due to the imbalance between positive and negative samples in all experiments, BA is always lower than accuracy, showcasing the importance of considering BA in addition to accuracy.The highest AUC is obtained for the Atrophy|Fibrosis_36m, since there is a more clear distinction between negative and positive samples.
To further assess the proposed models, Figure 2a shows the evaluation metrics obtained for each of the splits within NCV.It is clear from the results that there is an intrinsic complexity in the prediction of atrophy and fibrosis given the available data.This is more pronounced for the Fibrosis_36m and Atrophy_36m experiments, in which lower metrics are obtained as compared to the Atrophy|Fibrosis_36m experiment.As stated above, this is expected, as the prediction task in the first two experiments is more complex.Moreover, results for Atrophy|Fibrosis_36m and Atrophy_36m show signs of overfitting in the training fold, suggesting the models were complex enough to learn complex patterns, but the variance within patients did not allow these patterns to become generalizable.The same trend is found in the Fibrosis_36m experiment.Even though the accuracy does not show overfitting signs, patients from validation and test folds yield lower accuracies and AUCs, highlighting the underlying difficulty of the prediction tasks.This is reasonable, since heterogeneity has been pointed out as a common denominator in patients with nAMD, and more so in this real-life clinical practice study, with less exhaustive inclusion and exclusion criteria than in a clinical trial, applying various anti-VEGF therapies, multiple treatment and follow-up regimens.
T A B L E 2 Detailed information about the final models used for each prediction task, including hyperparameters and input features.We also analysed the of each of the included features, computed using internal "feature_impor-tances_" attribute of the XGBClassifier model, from XGBoost Python package.Feature importance is calculated for every decision tree and depends on the amount each attribute's split improves the performance measure and the number of observations the node is responsible for.

Experiment
Atrophy or Fibrosis at 36 m.Regarding the Atrophy|Fibrosis_36m experiment, even though ETDRS at V1 (ETDRS_b) and 4 months (ETDRS_V4) do not show relevance signs within the best combination, ETDRS at 12 months is statistically significant (from the feature importance perspective) for the classification power of the model (p < 0.001, Figure 3a).The rationale behind this is that ETDRS variables are correlated within each other (Figure S1) and the model uses only one (the closest in correlation to the predicting variable) for most of the splits, hence obtaining the highest feature importance among the three ETDRS features and in general also.These results align with the rAUC metrics obtained when evaluating all models, features and hyperparameters (Figure 2).at m.The Fibrosis_36m prediction task exhibits a feature importance distribution to the Atrophy_36m|Fibrosis_36m experiment (Figure 3b).The importance of ETDRS at 12 months is also significantly above the rest of variables (p < 0.05), followed by the SNPs VEGFR, CFI and SMAD7.Recall that the SNP CFI showed some statistical differences between fibrotic and non-fibrotic patients.Finally, as expected, the order importance of the ETDRS variables are sorted by time (12 months, V4 and V1).
Atrophy at 36 m.Finally, for the Atrophy_36m experiment, the importance of the basal subretinal fluid appears to be significantly above the other features (p < 0.001, Figure 3c).Clinically, subretinal fluid has shown to be associated with a better visual acuity and a lower risk of developing macular atrophy or fibrosis, with fewer injections (Guymer et al., 2019;Lai et al., 2019;Llorente-González et al., 2022).The remaining features do not seem to add to the predictive power of the model.Nevertheless, the performance of the model (BA 0.54 and AUC 0.57) indicates that the model cannot learn to distinguish between atrophic and non-atrophic patients.This is expected, as the development and evolution of atrophy involves the interaction of several metabolic, functional, genetic and environmental factors, making its affectation unpredictable (Nowak, 2006).Likewise, at a functional level, atrophy can appear in an advanced form but not have much visual affectation and vice versa, making its prediction very complex.

| DI SC US SION A N D CONC LUSION
This work presents, to the best of our knowledge, the first exhaustive analysis regarding the suitability of machine learning for predicting development of fibrosis and atrophy on neovascular age-related macular degeneration patients undergoing anti-VEGF treatment.The ML models are trained to predict the development of fibrosis and atrophy at 36 months after starting the treatment with VEGF, using data collected during the first 12 months.For the analysis, we used demographic, clinical and genetic variables.
We consistently found ETDRS to be relevant for the prediction of atrophy and fibrosis, confirming previous statistical analyses (Llorente-González et al., 2022).On the other hand, the analysed SNPs, being in some cases widely associated with AMD development (with high risk or protective frequencies compared to healthy controls), have not shown any specific association with macular degeneration in the considered cohort and have not significantly contributed to the ML models.The best performing model is able to predict the development of at least one macular degeneration with an accuracy of 65%, a balance accuracy of 63% and an AUC of 0.72.As highlighted below, access to more samples as well as more features (or of better quality) could boost the prediction power of ML models.Similarly, availability of prospective samples could also benefit the validation of the developed models.
In particular, even though the presented results confirmed the known relationship between macular degeneration and retinal fluids on OCT (Ashraf et al., 2018;Guymer et al., 2019;Lai et al., 2019;Llorente-González et al., 2022;Ying et al., 2018), we believe that the categorical nature of these features may have narrowed down the pattern-exploitation ability of the applied ML predictors.Hence, storing the numerical value (OCT liquid volume) for these features may help in future ML studies.Alternatively, deep learning models could leverage OCT images as features, potentially uncovering strong anatomical patterns.These images should be generated ideally with the same technology and protocol to make them as homogeneous as possible, which may be challenging in studies involving several centres.Finally, more samples would be necessary in this case due to the complexity of deep learning models.This has also been a limitation in this work, as the lack of generalization has been observed along the three considered experiments, possibly due to the nAMD heterogeneity and the underlying complexity of atrophy and fibrosis diseases.
Regarding the evaluated SNPs, even though they did not show to be sufficient to predict nAMD development, additional analysis with larger cohort of patients should be carried out before they are ruled out, as they could have a regulator role in these processes.A different set of SNPs could also be evaluated to analyse their potential effect on disease progression.
The fact that nAMD is a complex disease involving many factors means that the ML models need access to high-quality data in order to make accurate predictions.Hence, when collecting data from real clinical practice, it would be desirable to use the same (or similar) image detection and analysis systems, so that data are as homogeneous as possible, and to have long follow-up periods with regular visits, so that more information per patient is available.Raw values should also be collected for each variable when possible, for example, without converting numerical variables to categorical by applying thresholds.Furthermore, future work could take advantage of the evolutionary nature of the pathology under study in the machine learning model, incorporating new samples to validate models.
In summary, in this work we have established the guidelines for future nAMD atrophy and fibrosis prediction.Several ML approaches have been analysed, and despite the complexity of the prediction task, multiple already-known biological relationships have been found along the process.Moreover, lessons learnt during the development of this work may guide future ML-based prediction tasks within the ophthalmological field and help design the data collection process.

AU T HOR CON T R I BU T ION S
Feature (variables) contained on the considered dataset PI15/01374.Features are organized in groups based on their clinical similarity.a V1, V4 and 12 m variables have not been included in the ML models, see subsection 2.4.1 for further details.Atrophy and fibrosis at 36 m are predicting variables.b These variables have not been included in the ML models due to the lack of importance and improvement within the model or due to data leakage motives.

F
I G U R E 3 Evaluation metrics and importance scores of the final models for each prediction task.(a) Evaluated metrics (BA, accuracy and AUC) along the folds (train, validation and test) for each experiment for the final ML models.Confidence intervals have been computed running the selected models with five different seeds.(b-d) Boxplots showing the distribution of importance scores for variables within each group from the test split across folds of the NCV setup, for (b) Atrophy|Fibrosis_36m, (c) Fibrosis_36m, and (d) Atrophy_36m experiments.Variables within each experiment are sorted by their corresponding feature importance mean.
J.F: ML model and analysis, writing, editing and data uploading.S.L.G: Data uploading, experiment, database, writing and editing.PFR: Study design, database and editing.MHS: Database and editing.AGL: Study design, data uploading, reviewing and supervision.I.O: writing, reviewing and supervision.SR: Study design, database, writing, reviewing and supervision.Spanish