Large‐scale plasma proteomics can reveal distinct endotypes in chronic obstructive pulmonary disease and severe asthma

Abstract Background Chronic airway diseases including chronic obstructive pulmonary disease (COPD) and asthma are heterogenous in nature and endotypes within are underpinned by complex biology. This study aimed to investigate the utility of proteomic profiling of plasma combined with bioinformatic mining, and to define molecular endotypes and expand our knowledge of the underlying biology in chronic respiratory diseases. Methods The plasma proteome was evaluated using an aptamer‐based affinity proteomics platform (SOMAscan®), representing 1238 proteins in 34 subjects with stable COPD and 51 subjects with stable but severe asthma. For each disease, we evaluated a range of clinical/demographic characteristics including bronchodilator reversibility, blood eosinophilia levels, and smoking history. We applied modified bioinformatic approaches used in the evaluation of RNA transcriptomics. Results Subjects with COPD and severe asthma were distinguished from each other by 365 different protein abundancies, with differential pathway networks and upstream modulators. Furthermore, molecular endotypes within each disease could be defined. The protein groups that defined these endotypes had both known and novel biology including groups significantly enriched in exosomal markers derived from immune/inflammatory cells. Finally, we observed associations to clinical characteristics that previously have been under‐explored. Conclusion This investigational study evaluating the plasma proteome in clinically‐phenotyped subjects with chronic airway diseases provides support that such a method can be used to define molecular endotypes and pathobiological mechanisms that underpins these endotypes. It provided new concepts about the complexity of molecular pathways that define these diseases. In the longer term, such information will help to refine treatment options for defined groups.


| INTRODUCTION
Chronic airway diseases such as chronic obstructive pulmonary disease (COPD) and asthma are common and significant causes of morbidity and mortality. COPD is characterized by persistent respiratory symptoms and airflow limitation due to airway and alveolar abnormalities. 1 Asthma is characterized by variable respiratory symptoms and expiratory airflow limitation. 2 The differential diagnosis of these is problematic especially in older adults, due to the presence of overlapping clinical features. For instance, fixed airflow limitation is observed in patients with severe asthma 3,4 with distinct phenotypes observed in smokers. 4 While 50% of COPD patients had at least one asthma-like feature (bronchodilator reversibility, blood eosinophilia, or atopy) even if they were not clinically diagnosed with asthma. 5 Hidden in these clinical groups may be a diverse range of molecular endotypes, where lack of knowledge of the underlying pathobiology hampers determining the best treatment regime.
Study of biological networks that govern chronic airway diseases may help to identify the unique underlying biology. This concept of molecular endotypes in asthma was initially discussed in terms of type 2 and non-type 2 asthma. 6,7 Extension to these endotypes have been proposed with the transcriptome analysis of peripheral blood, 8 epithelial brushings and bronchial biopsies, 9 as well as metabolomics. 10 For COPD, a meta-analysis of endotypes was achieved from peripheral blood gene expression analysis. 11 However, as proteins are central to almost all cellular processes, and dysregulation of expression and function is associated with a range of disorders, it makes sense to assess proteome-derived endotypes. In respiratory diseases, the de novo detection of such proteins has been limited to low throughput analysis usually of inflammatory mediators. The application of proteomics in clinical and research applications in respiratory disease has been recently reviewed 12 including the developments in protein detection technologies, that enables the simultaneous quantitation of large numbers of circulating proteins, including low-abundance analytes, with high sensitivity and precision cohorts. 13,14 The use of these has led to the identification of biomarkers signatures and new concepts about disease pathology in allergic skin disease, 15 in respiratory disease such as idiopathic pulmonary fibrosis 16 and bronchiectasis in cystic fibrosis, 17 and in chronic diseases such as cardiovascular disease 18 and inflammatory bowel disease. 19 We hypothesized that molecular endotypes of COPD and severe asthma may be achieved through evaluation of the plasma proteome.
Furthermore, we addressed whether using bioinformatics approaches adopted from the study of RNA sequencing data could help to elucidate the underlying biology. We evaluated the abundance of 1238 proteins in a subset of individuals from the Hokkaido COPD cohort 5,20,21 and the Hokkaido-based Investigative Cohort Analysis for Refractory Asthma (Hi-CARAT) 4,22,23 studies. Our results indicate that large scale plasma proteome approach offers potential to define novel molecular endotypes and unique underlying biology.

| METHODS
Details of the methods are shown in the Supporting Information.

| Patients cohorts
The protocols for the Hokkaido COPD cohort, Hi-CARAT, and this study were approved by the ethics committee of Hokkaido University School of Medicine (med02-001) and Hokkaido University Hospital (009-0025, 015-0336), respectively. They were performed in accordance with the Declaration of Helsinki. All subjects provided written, informed consent with an additional opt-out consent for this study.

| COPD cohort
A subset of Hokkaido COPD cohort subjects 5,20,21 was selected ( Figure S1). Subjects with physician-diagnosis of asthma were excluded. To ensure we evaluated the broadest range of clinical features, we included those with asthma-like features (n = 17): high blood eosinophil levels (>300/μl) and bronchodilator reversibility (ΔFEV 1 ≥ 200 ml and ≥12% after inhalation of 400 μg of salbutamol, the average value for three visits taken during the first year) as well as without them (n = 17). Subject's baseline clinical measures were found to be stable as assessed by yearly evaluation of blood eosinophil levels and bronchodilator reversibility over 5 years. Sex, age, pack-years, and BMI on this cohort can be found in the supplementary information datasets (Dataset-1).

| Statistical analyses of clinical data
Differences among the groups were analyzed using Student's t-test, one-way analysis of variance, the Mann-Whitney U-test, the Kruskal-Wallis test, or Fisher's exact test. Annual change in clinical parameters were estimated using linear mixed-effects models.
Exacerbation-free survival was analyzed using the Kaplan-Meier method with the long-rank test. Statistical significance was defined as p < 0.05.

| Proteomic analysis overview
The proteome was assessed by SomaLogic (Boulder, Colorado, USA) using SOMAscan® assay v3.2. 13,14 100 μl of samples were provided to SomaLogic for the analysis although each analysis took only a few μl. SomaLogic data analysis workflow included hybridization normalization, median signal normalization, and signal calibration to control for inter-plate differences. Here, 77 SOMAmer probe-sets failed quality control, leaving 1233 that represented 1238 proteins, as 35 probe-sets could not distinguish between protein isoforms, and a further 11 probe-sets recognized a complex of two different proteins. These are shown in the supplementary information datasets (Dataset-9). They cover a diverse range of protein and biological functions and as such do not impact the overall pathway and functional analysis. Data were analyzed based on probe-set abundance as expressed by SomaLogic in relative fluorescent units.
Specific analyses are detailed in the supplementary information.
Protein set enrichment analyses were based on gene set enrichment methodology using bespoke python scripts for calculating normalized directional enrichment scores 24 and non-directional scores (earth mover's distance). 25 Exosomal marker proteins in the SOMAscan® array were identified from the ExoCarta database. 26 Putative cell source of proteins was assessed from mRNA expression patterns in 79 human tissues using GeneAtlas U133A, gcrma data from Bio-GPS. 27 All proteomic data can be found in the supplementary information datasets (Dataset-2 to -8).

| Patient cohorts
Patient characteristics of the COPD and asthma cohorts are summarized in Table 1. In general, COPD patients were slightly older, had a lower body mass index (BMI) and a higher smoking index, whereas the prevalence of current smokers was comparable between COPD and asthma cohorts. The differences in clinical features reflect the definition of the two diseases. However, there was overlap in both the demographic and clinical characteristics due to our subgroup selection strategy. This was by design, to include a range of blood eosinophil levels, degree of airflow limitation, and smoking index. Furthermore, we matched for age and BMI among predefined clinical subgroups within each disease (Tables S1 and S2). By taking this approach we challenged the methodology to identify systemic differences between the disease subgroups as well as to minimize effects of some potential covariances.

| The plasma proteome differs between COPD and severe asthma
To determine whether the plasma proteome could differentiate be-  Loss of integrity of the samples over time in storage, which was several years longer for the COPD than the asthma samples, did not appear to contribute to the differences between the two diseases.
Comparison of the SOMAscan® protein abundancies to single protein assessments (ELISAs and radioimmunoassay), performed shortly after sample acquisition, showed a significant Spearman's correlation Additionally, while demographic characteristics, such as BMI and sex, varied between these diseases, no pattern with protein abundance of the 365 differentially expressed proteins was observed when they were aligned to the heatmap ( Figure 1B). Indeed only 3, 5, and 7 proteins differ between the two cohorts for age, BMI and sex, respectively. This did not represent an enrichment over background (hypergeometric test, p < 0.05). Finally, to confirm that the asthma versus COPD differences were not driven by potential confounders, we plotted the five most highly significant asthma versus COPD proteins against age and gender ( Figure S6A and B). These showed these confounders did not correlate within disease, for example, the youngest asthma patient had comparable protein abundance to the oldest, and this was also true for COPD.
Finally, simultaneous correction for these three confounders did not alter substantially the results: there were 365 significant proteins T A B L E 1 Characteristics of the COPD and asthma groups  We also evaluated these covariances further using the whole proteome, also including assessment of inhaled corticosteroid (ICS) dose, or oral corticosteroid (OCS) dose (asthma only). Only a few proteins correlated with these covariances, including blood eosinophils ( Figure S4). Furthermore, we found that in all cases they did not drive any clustering of the subjects, as visualized by overlaying values of the co-variants onto a PCA of the proteome ( Figure S5). Smoking history (pack-years), a known risk factor for COPD, was also explored as a potential confounder. No protein was found to correlate with pack-years either in asthma or COPD ( Figure S4D) nor drive the PCA of the proteome ( Figure S5I and J). A plot of the five most highly significant asthma versus COPD proteins against pack-years clearly showed the separation between the diseases but no pattern with pack-years ( Figure S6C). Comorbidities were unlikely to drive the differences as very few individuals had co-existing cardiovascular or ischemic heart disease or diabetes. These analyses suggest that any difference observed between or within the diseases should be driven mainly by the disease pathology.
We identified 143 networks of enriched/altered protein-sets between COPD and asthma (p ≤ 0.001). These were grouped and labeled based on the main pathway observed in the network ( Figure 1D). Networks that had higher protein abundances in asthma

| The plasma proteome defines four distinct endotypes within severe asthma
The heatmap of the COPD-asthma differentially expressed proteins showed some clustering with smoking index and blood eosinophil levels. This suggested that there may be within-disease endotypes.
This was investigated using k-means clustering of the whole proteome. Four distinct groups were observed by PCA analysis in asthma (Asthma-1 to -4) ( Figure S8A). These were defined by 230 proteins. These split into six blocks (Proteins-A to -F) that differed between any combinations within the asthma groups (p ≤ 0.05 BH- The smallest subpopulation Asthma-4 (n = 4) was the youngest in age and had the earliest age of asthma onset. They were defined by high levels of blood neutrophils, blood and sputum eosinophils, FeNO levels, and serum total IgE. A higher percentage of these subjects were on OCS compared to the other groups. This cohort has consistently high levels of Proteins-D whose functions analysis described allergic inflammation or allergic pulmonary eosinophilia and Proteins-C that describe an inflammatory response. Like Asthma-3, they had high AQLQ score and low levels of ISG15/IFNγ given on the y-axis. (D) Results from protein-set enrichment analysis for COPD versus asthma. The results are summarized as a network, where each enriched protein-set (p ≤ 0.001) is given as a node (circles) and protein-sets with >50% of genes in common are connected by edges (lines). Representative names and arbitrary colors are given for each cluster. Node size represents the size of the difference between COPD and asthma by observed/random earth mover's distance score. Those pathways underlined are elevated in asthma as compared to COPD

| The plasma proteome defines three COPD endotypes reflective of annualized decline in lung diffusion capacity (Kco)
In contrast to the asthma picture, COPD appeared less complex. Kmeans clustering using the whole proteome identified three groups (COPD-1 to -3) with limited overlap in the PCA analysis ( Figure S9A). 121 proteins drove this clustering and these defined two protein groups, Proteins-G (n = 118) and -H (n = 3). The distributions of the probe-set abundancies in the various groups are shown in the heatmap ( Figure 3A) and for individual proteins in the violin plots ( Figure 3B). Neither smoking index, Kco, BMI, nor sex drove the clustering. This was confirmed as baseline demographic and clinical characteristics of these COPD groups were similar (Table 3).

Proteins-H mainly increased. IPA functions analysis indicated
Proteins-G involvement in necrosis, cell movement, organ inflammation, growth and proliferation of connective tissue ( Figure 3C) which was reflected in the upstream modulator analysis ( Figure 3D).
In addition, there was indication of innate/adaptive immunity drivers (IL-15, TCR, IL-2). 44 proteins were found to be common between  in Kco, % predicted and PAK6 protein abundance, for Asthma-1 to -4 (red, green, blue, and pink squares, respectively) and COPD-1 to -3 (olive green, teal, and purple circles, respectively). The R value for the correlation and its associated p-value is given  Proteins-H (36.9%, 44/119 total proteins) and Proteins-A (59.5%, 44/ 74 total proteins) (Dataset-8). COPD-1 had the most annualized decline in Kco whereas COPD-3 the least ( Figure 3E, Table 3). Distributions of linear regression-values of Proteins-G showed a stronger association with annualized decline in Kco than for either Proteins-H or all other SOMAmer proteins ( Figure 3F). Similar observations were observed for baseline Kco, % predicted ( Figure S9B).

| Role of immune and bronchial epithelial cells involved in the plasma proteome and their association with exosomal marker proteins
We wondered whether we could find out more about the cellular origin of the proteins in each Proteins group. Evaluation of the cellular localization, defined by IPA, revealed unique patterns between the groups ( Figure 4A). Proteins-C had a higher proportion of secreted proteins than all SOMAscan® proteins (67% vs. 36%). While

| DISCUSSION
This study indicated that there are molecular pathways defined by systemic proteomics that differ between COPD and severe asthma even when they share clinical and demographic features such as blood eosinophilia, bronchodilator reversibility, and smoking history.
This supports the concept that these diseases are fundamentally different. 28 More differentially regulated pathways were found to be up-regulated in COPD versus asthma subjects. These included pathways involved in metabolic and biosynthetic processes, mitochondria organization, regulation of the cell cycle, and growth factor signaling. Along with other upregulated pathways observed the data suggests that in COPD, subjects were mounting an immune driven, reparative response to stress compared to the proinflammatory, complement/coagulation response seen in asthmatics. We did not assess if there were shared pathways between these diseases as we did not study the plasma proteome in a matched group of healthy controls. However, our results are consistent with a recent study that showed that plasma protein  Contrary to what we observed for asthma, the three COPD endotypes were defined by decreasing abundances of one large group of proteins, consistent with COPD characterized by continuous disease traits co-existing in varying degrees, rather than by mutually exclusive subtypes. 39 The pathway and upstream modulators analysis showed the predominant feature was cell death/apoptosis. The considerable overlap between Proteins-G and Proteins-A and the high levels of signaling proteins (e.g. MAPK1 and SUMO3) suggest similar importance of proteins including HSP90A. These observations align with the increase in apoptotic alveolar epithelial and endothelial cells observed in the lungs of COPD patients. 40 Furthermore, the high abundance of proteins involved in oxidative stress (e.g. STIP1) aligns with oxidative stress proposed to be involved in the development of COPD. 41,42 While these events could be initiated by cigarette smoking, we did not see any clear association with smoking history.
We found that a subset of COPD Proteins-G positively correlated to annualized decline in Kco %-predicted and appeared to originate from B lymphoblasts and to a lesser extent mature CD19 + B cells, which suggests that their origin is non-lymphoid tissue. 43 These results reflect increased number of B cells previously observed in bronchial biopsies and lung tissues. [44][45][46] Our results support the previous suggestion of an immune response role in COPD 47 and that B cells are strongly linked to the emphysema phenotype. 48 Key proteins related to this are the tyrosine kinase, BTK, that has a key role in B cell development, TSLP, originally identified as key to support B cell lymphopoiesis, and its receptor CRLF2 which is expressed on B cells. 49 The data suggest a role of innate/adaptive immunity in lung function decline possibly related to infection as one correlate, PAK6, a protein that associates with susceptibility to childhood pneumonia 50 and reported to be an important factor in the early origins of COPD. 51 It is therefore possible that the B cell response in this group may be related to the host response to the lung microbiome. 52 Proteins-H were mainly elevated in the small COPD-3 cluster, that also had the lowest levels of Proteins-G. SIRT2, reported to be a candidate gene for COPD, associates with FEV 1 . 54 While a preliminary finding in a small number of COPD subjects, this latter observation supports the utility of applying largescale proteomic data to genome-wide association studies. 55 Overall, more subjects in this cluster are needed to understand the clinical impact of this observation although they had the lowest annualized Kco, % predicted of the three groups.
We observed that many plasma proteome proteins were cytoplasmic or nuclear in nature. Some of these proteins could have been released due to apoptotic or necroptotic death as these events were identified in our functions, pathways, and upstream modulator analyses. However, we observed only a few of the protein types reported to be released from myeloid cells during cell death in vitro. 56 Further work is required to define proteins that may be released from nonmyeloid cells undergoing apoptotic or necroptotic death and to understand if any of the plasma proteome is derived from them.
We did find an association of some of the plasma Proteins groups and indeed unexpectedly nuclear proteins, with exosomal marker proteins, suggesting the importance of extracellular vesicles (apoptotic bodies, microvesicles, or exosomes) in COPD or asthma. A mechanism by which nuclear proteins are loaded into exosomes has recently been proposed 57 although it remains to be determined if such a mechanism does occur in non-cancerous cells. There is growing support for a role of these vesicles in asthma. 58,59 Our results suggest that the exosomal proteins are representative of allergic inflammation and higher sputum eosinophils, and supports the role of exosome secretion by eosinophils in asthma pathogenesis. 60 Elevated exosomes have been reported to be elevated in stable COPD or COPD exacerbation and correlated with plasma biomarkers of systemic inflammation. 61 Overall, this study suggests the importance of extracellular vesicles in COPD and asthma endotypes, especially those derived from innate/adaptive immune cells.
Our results indicate that the SOMAscan abundance data compare well to other protein analysis platforms. However, use of single protein assessment measures to validate the SOMAscan data has some limitations as follows: a) not all proteins in the SOMAscan array have suitable low throughput options to validate a result; b) the SOMAscan platform has a large dynamic range which may not always be the case for other platforms; c) in general single assessment measures have larger coefficient of variability (CoV) than SOMAscan platform with a 3%-4% CoV 62 which drives greater precision in any analysis; d) and finally subtly different epitopes may be assessed between the two methods which is common even between ELISAs to the same protein.
Although the average storage time of the samples before the SOMAscan assay differed between asthma and COPD cohorts, the data show that this does not contribute to the differences we detect.
Analysis showed that within either cohort less than 2.0% of the total proteins assessed showed an apparent weak correlation with time in storage before assay. Furthermore, because the samples from the SUZUKI ET AL. indicating that these few correlations could be spurious.
In summary, this analysis shows the utility of large-scale plasma proteome analysis combined with the integration of clinical, disease and bioinformatic sciences. While a pilot study in nature, this noninvasive method that simultaneously evaluates levels of numerous proteins has potential for: a) repository of plasma biomarkers for discovery; b) the definition of molecular endotypes; c) providing new insights into the complex biology of multiple molecular pathways and the identification of potential therapeutic protein targets; d) the role of different cells and or the cell processes that characterize the molecular endotypes; and finally e) linking genetic traits and protein expression. The potential that molecular understanding identified by proteins, rather than mRNA-driven, provides a basis for addressing new ways to target the right pathobiology in the right patient cohort.