Unsupervised machine learning algorithms identify expected haemorrhage relationships but define unexplained coagulation profiles mapping to thrombotic phenotypes in hereditary haemorrhagic telangiectasia

Abstract Hereditary haemorrhagic telangiectasia (HHT) can result in challenging anaemia and thrombosis phenotypes. Clinical presentations of HHT vary for relatives with identical casual mutations, suggesting other factors may modify severity. To examine objectively, we developed unsupervised machine learning algorithms to test whether haematological data at presentation could be categorised into sub‐groupings and fitted to known biological factors. With ethical approval, we examined 10 complete blood count (CBC) variables, four iron index variables, four coagulation variables and eight iron/coagulation indices combined from 336 genotyped HHT patients (40% male, 60% female, 86.5% not using iron supplementation) at a single centre. T‐SNE unsupervised, dimension reduction, machine learning algorithms assigned each high‐dimensional datapoint to a location in a two‐dimensional plane. k‐Means clustering algorithms grouped into profiles, enabling visualisation and inter‐profile comparisons of patients’ clinical and genetic features. The unsupervised machine learning algorithms using t‐SNE and k‐Means identified two distinct CBC profiles, two iron profiles, four clotting profiles and three combined profiles. Validating the methodology, profiles for CBC or iron indices fitted expected patterns for haemorrhage. Distinct coagulation profiles displayed no association with age, sex, C‐reactive protein, pulmonary arteriovenous malformations (AVMs), ENG/ACVRL1 genotype or epistaxis severity. The most distinct profiles were from t‐SNE/k‐Means analyses of combined iron‐coagulation indices and mapped to three risk states – for venous thromboembolism in HHT; for ischaemic stroke attributed to paradoxical emboli through pulmonary AVMs in HHT; and for cerebral abscess attributed to odontogenic bacteremias in immunocompetent HHT patients with right‐to‐left shunting through pulmonary AVMs. In conclusion, unsupervised machine learning algorithms categorise HHT haematological indices into distinct, clinically relevant profiles which are independent of age, sex or HHT genotype. Further evaluation may inform prophylaxis and management for HHT patients’ haemorrhagic and thrombotic phenotypes.

evaluation may inform prophylaxis and management for HHT patients' haemorrhagic and thrombotic phenotypes.

anaemia, genetic disorders, haemorrhage, high-dimensional data, t-SNE INTRODUCTION
Hereditary haemorrhagic telangiectasia (HHT) is a multisystemic disorder affecting approximately one in 6000 individuals, that displays significant clinical variability [1][2][3][4]. HHT is caused by a single pathogenic DNA variant, usually in the ENG, AVCRL1 or SMAD4 genes which are expressed in endothelial cells [5]. The result of these variants is the development of fragile telangiectatic vessels prone to haemorrhage and arteriovenous malformations (AVMs) at characteristic sites [6,7]. While HHT can be defined clinically by three of four Curaçao Criteria (nosebleeds, telangiectasia, visceral involvement and family history) [8], the converse is not true and possession of a causative HHT pathogenic variant may be associated only with the criterion that precipitated the genetic test [9].
For an individual diagnosed with HHT, it is not possible to predict what the exact consequences will be. Recurrent haemorrhage and iron deficiency anaemia are the most recognised features and are discussed further below. Additionally, extensive evidence accrued in large HHT populations over the last 30 years indicates that approximately one in two patients will have pulmonary AVMs [10] (higher in ENG [11]), approximately one in two patients hepatic AVMs [12] (higher in ACVRL1 [11]), approximately one in 16 patients cerebral AVMs [13] (higher if include other cerebral vascular malformations [13]), with smaller proportions developing AVMs at other sites. For pulmonary AVMs, most patients will have a significant thrombotic complication, such as ischemic stroke/cerebral infarction [14][15][16] or cerebral abscess that is associated with venous thromboemboli (VTE) [17,18] -both result from paradoxical embolism through the right-toleft shunts provided by pulmonary AVMs, and management advice is available from respiratory [19], radiological [20] and neurological [15] specialist groupings. For hepatic AVMs, a longitudinal cohort analysis demonstrated mortality and morbidity rates were 1.1 and 3.6 per 100 person-years, respectively, with development of iron deficiency anaemia a risk factor for high output cardiac failure [21]. For other AVMs in HHT, current data suggest the majority of affected patients will not have a complication, but when these do occur, consequences for the individual can be life changing.
HHT groupings have tended to focus on AVMs, but the most common problems in HHT result from nosebleeds and anaemia which are usually treated in haematological or ENT Services [1,4,22,23].
The majority of HHT patients experience nosebleeds at such frequency and intensity that their iron losses cannot be replaced by dietary iron intake and iron deficiency results [1,[22][23][24]. This may be augmented by gastrointestinal blood loss [1,[22][23][24]. Anaemia is the best recognised of the many consequences of iron deficiency [25], although the haemorrhage-adjusted iron requirement (HAIR) [24] indicated anaemia out of proportion to the intake/haemorrhage paradigm, leading to the identification of low-grade haemolysis as contributing, alongside haemorrhage, to severe anaemia in HHT [26]. Also of relevance to haematologists, in HHT, iron deficiency is associated with increased risk of ischaemic stroke due to pulmonary AVMs [15,16], increased risk of high output cardiac failure due to hepatic AVMs [21] and increased risk of VTE [27]. Mechanisms supported by evidence in the HHT population include augmented platelet aggregation to 5HT (serotonin) [16,28] and elevated Factor VIII [27], in addition to wider consequences of iron deficiency [25]. Separately, high transferrin saturation index (TfSI) and use of intravenous iron were independent predictors of cerebral abscess for 403 British HHT patients with pulmonary AVMs [17].
These pathophysiological factors do not explain why the clinical presentations of HHT varies so extensively between relatives with identical casual mutations [29,30]. For bleeding and iron handling, hypothesis-driven questions provided evidence of subtle differences between HHT ACVRL1/ENG and the rare SMAD4 cases [31], and between groups of HHT patients categorised by the presence of independent non-HHT DNA variants in genes causing coagulation and platelet disorders [32]. In order to objectively test if there were further patterns relevant to clinical management, we developed unsupervised machine learning algorithms and applied in a novel approach in HHT.
Here, we report that the algorithms identified unexpected categories of patients' haematological indices particularly in relation to iron and coagulation. Known biological factors could not explain the main profile distinctions, providing new areas for mechanistic, clinically relevant studies.  [5], or met a TA B L E 1 Key definitions. T-SNE and k-Means clustering was implemented using Python 3.8.13 (Python Software Foundation, Wilmington, USA) and the packages pandas, numpy, matplotlib and sklearn. Data were normalised on a Standard scaler, such that the maximum value for each index was represented using a 1 and the minimum value using a 0 [33]. Euclidean distances were used to calculate the similarity between two high-dimensional datapoints [33,34]. Following this, k-Means clustering was performed on the t-SNE embeddings [35], using the silhouette score [36] method to determine the most optimal number of clinical profiles. A scatter plot of the t-SNE embeddings and k-Means profiles was then plotted for data visualisation.

Term Definition
High-dimensional data Describes data with a large number of features. In the case of clinical profiling, each dimension is represented by a certain clinical parameter.

Unsupervised machine learning
Algorithms analyse patters in uncategorised data t-Distributed stochastic neighbourhood embedding (tSNE) Unsupervised machine learning algorithm which reduces the dimensions of high-dimensional data down to two. The two dimensions, known as tSNE embeddings, can be plotted along x and y-axes to help visualise the data and make sense of patterns. Similar datapoints are represented as closer together in space and dissimilar ones as further apart. t-SNE unsupervised, dimension reduction, machine learning algorithms therefore assign each high-dimensional datapoint to a location in a two-dimensional plane, placing similar datapoints closer together and dissimilar ones further apart.
k-Means clustering Clustering algorithm groups similar datapoints together into a cluster with distances between datapoints inversely related to their similarity.
clinical diagnosis of HHT but were negative on genetic testing at the time of analysis. A definite clinical diagnosis required three of four Curaçao Criteria, namely spontaneous recurrent epistaxis, mucocutaneous telangiectasia at characteristic sites, visceral involvement such as gastrointestinal telangiectasia or AVMs in the lungs, liver or brain, and a first degree relative affected by these criteria [8,9,23]. Although there were multiple results for each patient due to repeated clinic visits, only first visit results were examined, proving a single dataset per patient. Coagulation data were excluded for patients using anticoagulants due to significant increases in clotting times. Demographic patient details included age, sex, severe epistaxis (severe daily nosebleeds or bleeds resulting in a HAIR exceeding replacement feasibility using oral iron resulting in intravenous iron or blood transfusion dependency) [32] and use of iron supplements (oral or intravenous) [26]. The causal HHT gene was included, but small sample size did not permit inclusion of the SMAD4 patient group within the statistical testing.

T-SNE and k-Means clustering
For t-SNE and k-Means analysis (Table 1), haematological indices were grouped into complete blood count (CBC), iron and clotting indices.

Cluster analyses
Profile characteristics were determined using GraphPad Prism 9.

Cohort demographics
The demographic details of all analysis cohorts are displayed in Table 2. There were no significant differences in the age, sex or gene distributions between the cohorts calculated by Chi-squared (all p values > 0.05).

Homogeneous distribution of CBC indices
As noted in Table 2 Note the remaining patients were 'gene negative' at time of analysis (none of this cohort had GDF2 variants).

F I G U R E 1
Complete blood count indices. Two-dimensional scatter plot of the t-SNE embeddings and k-Means clusters for CBC indices (n = 336). Each datapoint represents a patient. k-Means clusters are displayed by colour. Distances between datapoints representative of similarity.
similarity, CBC profiles appeared homogenous with no distinct separation. All median values were within reference ranges (Table 3). Profile 2 was characterised by significantly lower values for most CBC indices (p < 0.0001), except RDW and platelet counts which were significantly higher (p < 0.0001).
Comparing demographics (Table 3), Profile 2 patients were older, with an excess of females, patients with severe epistaxis and patients receiving iron (p < 0.05; Figure 1B). Profile 2 patients also had higher RDW and platelet counts consistent with known anaemia associations [37,38]. There was no difference in HHT genotypes between the profiles.

Two distinct iron indices clusters
Next, iron indices were examined ( Figure 2). These separated into distinct regions. Profile 2 patients had lower serum iron, TfSI and ferritin and higher CRP ( of 60 (16.2%, p < 0.05)), but at presentation, only 18% of the cohort were using iron supplementation and the trend for greater iron use in Profile 2 did not reach significance (Table 4). There was no difference in male/female proportions or genotype, but Profile 2 had a significantly lower median age (Table 4). Profile 2 data were therefore consistent with younger patients with more severe bleeding and lower iron indices.

Four distinct clotting indices clusters
t-SNE and k-Means visualisations for clotting indices, yielded four profiles ( Figure 3), three (1-3) more distinct to the 4th. As shown in Table 5, platelet counts differed between all profiles (p < 0.0001): Profile 1 was characterised by the lowest platelet count, with moderately long PT and APTT. Profile 2 was characterised by the highest median platelet TA B L E 3 Details of the CBC indices cohort and two clinical profiles identified: Normal ranges and cohort/profile data for 336 patients with full CBC data. Median (IQR) or % provided. As both sexes were represented, where reference ranges differ by sex, the upper limit provided is the top of the male reference range; lower limit the bottom of the female reference range. MCH, mean corpuscular haemoglobin; MCHC, mean corpuscular haemoglobin concentration. Highlighting indicates values higher (green) or lower (red) than the other profile, and is stronger where significant (***p < 0.0001). Demographic indices for each profile are in non-highlighted rows below. Patients included in the profiling had incomplete demographic data, therefore rows may not add up to 100%. Comparisons of CBC and other cohorts are provided in Table 2. * indicates p < 0.05.

TA B L E 4
Details of the iron indices cohort and two clinical profiles identified: Normal ranges and cohort/profile data for 218 patients with full iron indices/CRP data. Median (interquartile range) or % for demographic variables provided. For ferritin, as both sexes were represented within each profile, the lower female and upper male references ranges for ferritin have been combined. Highlighting indicates values significantly higher (green) or lower (red) than the other profile (***p < 0.0001, *p < 0.05). For demographics (non-highlighted rows below), patients included in the profiling had incomplete demographic data, therefore rows may not add up to 100%. * indicates a difference of p < 0.05 between Profiles 1 and 2.

TA B L E 5
Details of the clotting indices cohort and four clinical profiles identified: Normal ranges and mean indices in 309 patients with full clotting indices data. APTT, activated partial thromboplastin time. Highlighting indicates lowest (red), second lowest (yellow), second highest (orange) and highest (green) values for each clustered parameter. (***p < 0.0001). For demographics (non-highlighted rows below), patients included in the profiling had incomplete demographic data, therefore rows may not add up to 100%). * indicates a difference of p < 0.05 in demographics between profiles.

Three distinct clusters defined by iron and coagulation indices
Finally, coagulation indices were examined in conjunction with the iron indices that had discriminated the population. This identified three distinct clusters ( Figure 4). As detailed further in Table 6, Profile 1 patients (blue symbols in Figure 4) had the lowest iron, TfSI and ferritin and highest platelet counts (p values < 0.0001). Profile 2 patients (green symbols) were characterised by longest PT and APTT with moderate serum iron and TfSI (p < 0.0001). Profile 3 patients (orange symbols) TA B L E 6 Details of the combined iron/clotting indices cohort, and three clinical profiles identified: Normal ranges and mean iron/clotting indices in 193 patients with full data. APTT, activated partial thromboplastin time. Highlighting indicates lowest (red), intermediate (yellow) and highest (green) values for each clustered parameter. (***p < 0.0001, *p < 0.05). For demographics ( non-highlighted rows below), patients included in the profiling had incomplete demographic data, therefore rows may not add up to 100%. * indicates a difference of p < 0.05 in demographics between profiles.  Unlike the earlier profiles, there was no significant difference in the patients' ages, sex or clinical features for the combined iron and clotting profiles. The Profile 1 combination of lowest iron, TfSI and ferritin with highest platelet count is reminiscent of the data that thrombocytosis accompanies acute haemorrhage with iron deficiency anaemia shown to be associated with higher platelet counts [37,38]. Of note, within the combined iron and clotting profiles, all SMAD4 patients were within the iron-deficient profile. Profile 3 is also of interest as it was characterised by highest iron and TfSI (p values < 0.0001) and lowest fibrinogen concentration. There was no significant difference in the distribution of HHT genes, however the small datasets precluded statistical examination of SMAD4.

DISCUSSION
We have shown that unsupervised machine learning algorithms performing objective categorisation of continuous data in HHT patients without prior assumptions, categorise presentation CBC, iron indices and clotting indices into separate clinical profiles. CBC and iron profiles fitted with expected pathophysiological models, but four distinct coagulation profiles and three distinct combined iron-coagulation pro-files could not be explained by current, clinically recognised features, suggesting other drivers of these differences. While bleeding-iron deficiency considerations are centre stage in HHT, modified coagulation is not despite high rates of VTE [27,39], and the challenging risk-benefit considerations when using anticoagulation therapy [40][41][42][43].
The main strength of the profiling was the use of unsupervised machine learning algorithms. Previous studies have identified trends by initially categorising the data based on an outcome variable prehypothesised to contribute to clinical variability [5,14,16,17,21,24].
This study identified categories without prior assumption of possible determinants. It is widely accepted that reference ranges for several CBC indices are higher in males [38] and t-SNE and k-Means' ability to discriminate sex in CBC profiles validates the methodology. Similarly, more iron-deficient pictures are expected for patients with more severe haemorrhage [5,37,38] and the methodology demarcated an iron indices profile in younger patients with more severe bleeding (Table 4), a group already shown to have poorer long-term outcomes [44]. Further, by purposefully not excluding specific subgroups other than patients using anticoagulants, this study identified clotting and combined iron-clotting categorisations that were not associated with any factors routinely used to categorise patients in clinical practice.
The study cohorts were proportionately large for a rare genetic disorder, and differences between iron profiles were detected with a power of α = 0.05 β = 0.99 for iron and TfSI, and β = 0.84 for ferritin, using the standard deviations of the iron cohort. CBC and iron indices examined in isolation discriminated patients with more severe bleeding. There was no significant difference in sex distribution for iron indices, which was surprising since male reference ranges for iron indices are typically higher than in females [37,38].
When profiles incorporated coagulation indices, overlaps between the distinct profiles and recognised pathophysiology were harder to ascertain, although for combined iron and clotting indices, Profile 1 could fit with a 'simple' haemorrhage/iron depletion model: As seen in Table 6 it had the lowest iron indices, shortest APTT, highest platelet count and more than twice as many patients reporting severe epistaxis as using iron supplements. analyses in the larger HHT cohort from which these genotyped populations were derived: Venous thromboembolism was associated with short APPT and low serum iron [27,39] as seen in the Table 6 Profile 1 [blue] 'haemorrhage/iron depletion' profile; cerebral abscess (due to paradoxical microorganism emboli through pulmonary AVMs and blood-brain barrier breach) was associated with VTE and high TfSI (compare Table 6 Profile 3 [orange]), while ischaemic stroke risk (considered to be mediated by paradoxical platelet emboli through PAVMs) was higher with lower serum iron and higher serum fibrinogen [16] (compare Table 6 Profile 2 [green]). Thus, while mechanisms underlying distinctions between Profiles 2 and 3 require further evaluation, the machine learning appears to have detected patterns where better appreciation may modify risk assessments for primary and secondary prophylaxis.
Taken together, the data support uncharacterised cellular, endocrine, pharmacological or genetic characteristics that may be operating as modifiers in HHT. Neither the possession of ENG nor ACVRL1 variants was significantly associated with any clinical profile.
It is not known whether the profiles will only be relevant to iron and coagulation indices in the setting of an HHT vasculopathy, and replication in non HHT cohorts would be valuable.
In conclusion, in this study unsupervised machine learning t-SNE