Identification of 6 dermatomyositis subgroups using principal component analysis‐based cluster analysis

Abstract Objective Dermatomyositis (DM) is a heterogeneous disease with a wide range of clinical manifestations. The aim of the present study was to identify the clinical subtypes of DM by applying cluster analysis. Methods We retrospectively reviewed the medical records of 720 DM patients and selected 21 variables for analysis, including clinical characteristics, laboratory findings, and comorbidities. Principal component analysis (PCA) was first conducted to transform the 21 variables into independent principal components. Patient classification was then performed using cluster analysis based on the PCA‐transformed data. The relationships among the clinical variables were also assessed. Results We transformed the 21 clinical variables into nine independent principal components by PCA and identified six distinct subgroups. Cluster A was composed of two sub‐clusters of patients with classical DM and classical DM with minimal organ involvement. Cluster B patients were older and had malignancies. Cluster C was characterized by interstitial lung disease (ILD), skin ulcers, and minimal muscle involvement. Cluster D included patients with prominent lung, muscle, and skin involvement. Cluster E contained DM patients with other connective tissue diseases. Cluster F included all patients with myocarditis and prominent myositis and ILD. We found significant differences in treatment across the six clusters, with clusters E, C and D being more likely to receive aggressive immunosuppressive therapy. Conclusion We applied cluster analysis to a large group of DM patients and identified 6 clinical subgroups, underscoring the need for better phenotypic characterization to help develop individualized treatments and improve prognosis.


| INTRODUC TI ON
Dermatomyositis (DM) is an idiopathic inflammatory myopathy (IIM) characterized by inflammatory disorders primarily affecting the skeletal muscle and skin with typical cutaneous lesions. 1 The outcomes of IIM are poor, with a 5-year survival rate of less than 50%. 2 The diagnosis of DM is still based on the Bohan and Peter criteria, 3,4 which were proposed in 1975. Four of these five criteria are related to muscle involvement, and the fifth is the presence of typical cutaneous lesions. In recent years, DM has been shown to be a heterogeneous disease entity with a wide range of clinical features.
In addition to muscle and skin involvement, other organs are often involved, leading to arthritis, esophageal disease, interstitial lung disease (ILD), and cardiac damage. 5 Patients with DM also have a higher risk of malignancy than the general population, and 10%-40% of DM patients go on to develop a malignancy. 1 is much more responsive to systemic corticosteroids than the skin component. 8 Patients with ILD may respond well to cyclophosphamide or mycophenolate mofetil. 9 Survival has been reported to be the worst (25% at 5 years) in cancer-associated myositis, followed by CADM (61% at 5 years). 10 Moreover, ILD with mildly increased serum creatine kinase (CK) levels and skin ulcers are independent risk factors for death in DM patients. 10 Tailoring the therapeutic strategy according to the DM subtype may improve the survival of DM patients. Precise phenotyping is critical for the development of individualized treatments and also for understanding the underlying pathological mechanisms.
In the present study, we aimed to objectively identify the subtypes of DM by using a new exploratory statistical method. We applied principal component analysis (PCA)-based cluster analysis to identify DM subtypes based on characteristic clinical manifestations and to determine the relationships between these variables. This methodology identified six distinct DM subtypes with different clinical characteristics. The validity of the clustering was confirmed by the significant differences in immunosuppressive therapies across the six subgroups.

| Data extraction
From the patients' medical charts, we extracted the data collected at the time of the first hospitalization or the first clinic visit in our hospital after the confirmation of the diagnosis. For all patients, we retrospectively reviewed data on the following parameters: demographics, IIM-related clinical manifestations and laboratory findings, cumulative major organ involvement, and immunosuppressive therapy. Malignancy was documented if it occurred within 3 years before or after the diagnosis of DM. ILD was determined using high-resolution computed tomography. 12 Cardiac involvement, including systolic or diastolic dysfunction, pericarditis, and pericardial effusion, was evaluated using echocardiography and electrocardiography.
We also documented the administration of aggressive immunosuppressive therapy, which was defined as a daily glucocorticoid dose equivalent to or more than 0.5 mg/kg prednisone, and treatment with cyclophosphamide, mycophenolate mofetil, cyclosporine, or tacrolimus.

| Data analysis protocol
Cluster analysis, the most popular method of unsupervised learning, is a multivariate technique used for identifying subgroups sharing similar characteristics in a data set. 13 In this study, cluster analysis was performed to identify subgroups among DM patients. We followed four critical steps in performing the statistical analysis: selection of clinical variables for analysis, cluster analysis of these variables to explore the relationships between them, PCA to reduce interactions between the variables, and cluster analysis of patients based on the PCA-transformed data.
Categorical variables are presented as numbers (percentages), and continuous variables are presented as mean (standard deviation) or median (interquartile range) depending on whether their distribution was normal or skewed. All analyses were conducted using SPSS version 24.0 for Mac (IBM).

| Variable selection
In our study, variables with the same clinical significance, such as the V-sign and shawl sign, and myalgia and muscle tenderness, were combined into new variables for analysis. Variables with a large number of missing data, such as elevated gamma glutamyl transpeptidase (GGT), alkaline phosphatase (ALP), and lactate dehydrogenase (LDH), were excluded from further analysis. In total, 21 variables were included in the analysis (Table 1). Continuous variables, such as age at onset and CK level, were standardized. Seventy-four patients with missing data for these 21 variables were excluded, which is necessary for PCA and cluster analysis. This resulted in an analytic population of 720 patients (91% of the initial study population). We compared the characteristics of the patients who were included in our study with those of the patients who were excluded from our study (Table S1), and found that most of the clinical features studied did not differ between these two groups.

| Relationships between variables
Clinical experience indicates that the 21 identified variables are not independent. Hence, cluster analysis was performed to confirm the relationships between these variables. In our study, agglomerative clustering algorithms, a hierarchical clustering method, were used to cluster variables. In this method, each variable is initially considered to be its own cluster, and then, the clusters are hierarchically combined, with clusters with the smallest distances being combined first. 13 This crucial step of hierarchical clustering is required to define the dissimilarity or proximity measure that appropriately quantifies how similar are individuals or variables. Then, a link function was implemented to calculate the distance between two clusters. Here, we chose the correlation between vectors of value function, which is a similarity measure used for clustering variables. The complete-linkage (or furthest-neighbor) function, which uses a greatest-distance metric between clusters, was then selected to perform the cluster analysis. The results were shown in a dendrogram illustrating the relationship between the tested variables.

| Identification of DM clusters
Because the dendrogram confirmed the redundancy between the identified variables, PCA was first performed to achieve feature exaction, which can accomplish dimensionality reduction without losing important information about the variables. 13 Here, we used categorical PCA (CATPCA), which is used for mixed data that include continuous variables and binary variables. CATPCA of the original variables yielded 21 independent components ordered by decreasing eigenvalues or variances. Components with an eigenvalue >1 explained most of the variance, and were retained for further cluster analysis. Based on the PCA-transformed data, another cluster analysis was conducted to identify DM subgroups. We chose the squared Euclidean distance, which is the most commonly used similarity measure. We implemented the Ward method, which minimizes the total within-cluster variance. Values are expressed as mean (standard deviation). c Values are expressed as median (interquartile range). d The quantifiable limit was 45 U/L for GGT, 100 U/L for ALP, and 250 U/L for LDH.
Differences in characteristics between the clusters were assessed using analysis of variance for continuous normally distributed variables, the non-parametric Kruskal-Wallis test for non-normally distributed variables, and the χ 2 test or Fisher exact test for categorical variables.
A P value < 0.05 was considered statistically significant.
The median CK level was 161.5 U/L (49.0-746.5 U/L), and an elevated CK level was present in 46.5% of patients. A total of 238 (33.1%) patients received aggressive immunosuppressive therapy with cyclophosphamide, mycophenolate mofetil, cyclosporine, and tacrolimus. Figure 1 shows the process and results of the hierarchical cluster analysis of the 21 clinical variables. These variables could be optimally divided into six groups, confirming that the variables were not independent, and that the information obtained from these variables was redundant. Hence, these variables could not be directly subjected to cluster analysis.

| CATPCA
CATPCA of the original variables yielded 21 independent principal components ordered by decreasing variances. The first 9 components with an eigenvalue >1 explained 54.7% of the variance and were retained for further cluster analysis. The correlations of the 21 variables with these nine components is presented in Table S2, and the last 12 components in

| Cluster analysis of DM patients
Hierarchical cluster analysis was performed among the 720 patients based on the nine principal components derived from the CATPCA. Figure 2 shows the grouping of the patients as the number of clusters decreased from 9 to 1. The clustering that resulted in six groups was chosen for further analysis, in part because of the principle of equipartition, which states that the number of patients in each F I G U R E 1 Dendrogram showing the process and results of hierarchical cluster analysis of 21 variables. The horizontal axis represents the rescaled distance cluster combine in which the biggest distance between clusters was marked as 25. The horizontal lines on the left represent the clustering observations, which in our case are clinical variables. The dendrogram shows the process of hierarchical cluster analysis in which variables or clusters join together to form a bigger cluster. Variables or clusters that possess similar distribution patterns join together on the left, while clusters that possess more dissimilar distribution patterns join together on the right. The 21 variables can be optimally divided into 6 groups. ILD, interstitial lung disease; CK, serum creatine kinase cluster should be approximately equal. Cluster A5 in the five-cluster grouping was produced by the combination of the cluster with the largest number of patients (A6) and another cluster (B6) in the sixcluster grouping rather than by the combination of the two clusters with smaller numbers of patients, which made the six-cluster grouping the best choice. Furthermore, the clinical characteristics going from six clusters to five, four, or three clusters resulted in patient features that were more homogeneous rather than more distinct. Table 2 shows the clinical characteristics of the six groups. Most of the tested characteristics significantly differed across the six clusters. A summary of characteristics of these six DM clusters is presented in Table 3.   Table S4. Significant differences were found between these two subgroups. Sub-cluster A9 included patients with the 2nd highest rate of muscle weakness (83.1%), the highest rate of myalgia/muscle tenderness (68.9%), and similar frequencies of heliotrope rash, Gottron sign, and V-sign/shawl sign (59.6%, 53.3%, and 67.2%).
Sub-cluster B9 contained younger patients (mean age at onset, 41.5 years) with less frequent fever, muscle involvement, arthritis/ arthralgia, and ILD, more frequent esophageal involvement, and the highest rate of heliotrope rash (98.3%). Based on these characteristics, we labeled cluster A as "classical DM and classical DM with minimal organ involvement".

| Relationship between immunosuppressive therapy and clusters
To validate the classification, we examined the relationship between the clusters and immunosuppressive therapy, which is a parameter that was not used for the creation of the clusters and reflects the physicians' clinical judgements of the outcomes. The results are presented in Table 2. As expected, there were significant differences in immunosuppressive therapy across the six clusters (P < 0.0001).

| D ISCUSS I ON
In this study, we applied PCA-based cluster analysis to analyze the clinical data of a large group of DM patients, which eventually re-  Taking into account muscle weakness, myalgia/muscle tenderness and creatine kinase level.
We found that some of the six distinct subgroups identified in our study were highly consistent with classes or specific subtypes of DM defined by previous classification criteria and studies. The Generally, these features fit well with those of previously reported CADM with positive anti-MDA5 antibody. 14,15 The prognosis of anti-MDA5-positive CADM patients is unfavorable, with a 40% mortality rate, attributed mostly to the rapid progression of ILD. 15 In our study, we found that the patients in cluster C were more likely to receive aggressive immunosuppressive therapy (46.4%), which reflected the physicians' clinical judgements of a poorer outcome in this cluster.
Cluster D (DM with dominant lung, muscle, and skin involvement) was characterized by myositis, ILD, mechanic's hand, and Raynaud phenomenon, which was consistent with the clinical manifestations of antisynthetase syndrome. The first case series of patients with antisynthetase syndrome was published in 1990, which defined the disease as a constellation of the following signs: polymyositis, interstitial pneumonia, Raynaud phenomenon, mechanic's hand, and arthritis. 16  was no correlation between overall disease severity and cardiac involvement. 18 Given the phenotypic uniqueness of this subgroup and the discordance between cardiac involvement and disease severity, we propose that DM with myocarditis be regarded as a new distinct subtype of DM. Anti-Ro antibody is reported to be a biomarker specifically associated with cardiac involvement in DM, 19 which provides mechanistic evidence in favor of our findings.
We also performed cluster analysis of variables, resulting in a dendrogram. Variables categorized into the same groups were more closely associated with each other than with other variables, and had similar distribution patterns among patients. Our results were mostly consistent with those of previous studies. Muscle involvement and myocarditis shared similar patterns of distribution; the association of these two conditions was also observed in cluster F, as has been discussed above. ILD was associated with Gottron sign and mechanic's hand. This association was also demonstrated in cluster D, and is consistent with the results of previous studies. 20 The presence of comorbidities such as other connective tissue diseases was associated with Raynaud phenomenon, periungual telangiectasia, and pericardial effusion, which is consistent with a previous review stating that Raynaud phenomenon is one of the most frequently re- This study represents the first-ever attempt to apply cluster analysis to a cohort of DM patients. Due to the intrinsic property of this method, a large sample size is required. Hence, the method has been applied to patient populations of certain common diseases, such as chronic obstructive pulmonary disease, heart failure, encephalitis, and Parkinson disease. [22][23][24][25][26] However, it has scarcely been used in the field of rheumatology. 21,27,28 Only one study has used cluster analysis to group 233 patients with antisynthetase syndrome. That study resulted in three clusters and revealed that the tropism of the disease depends more on muscle involvement in the case of patients with anti-Jo-1 antibodies and more on ILD in the case of patients with anti-PL7 or anti-PL12 antibodies. Consequently, the mortality (due to ILD) is higher in the anti-PL7/12 group than in anti-Jo-1 group. 27 We conducted PCA to transform the original variables included in the cluster analysis for two reasons. First, PCA is especially useful to reduce dimensionality, which can eliminate noisy variables that may corrupt the cluster structure. 13 Independence of the variables is a prerequisite for cluster analysis, and clinically, we were aware that the original variables lacked independence. Accordingly, in a preliminary analysis, the direct application of cluster analysis to the original variables did not yield satisfactory results. Second, PCA not only reduced dimensionality but also detected key features of the data. 13 Studies utilizing cluster analysis have explored various methods of pre-processing the original variables, including factor analysis, 22 PCA, 23 and the subjective deletion of variables with a prevalence of <20% or >80%. 24 PCA stands out from all these pre-processing methods, as it maintains the integrity of the data, and consequently, would not leave out information on symptoms with a lower prevalence. DM is characterized by its heterogeneity of symptoms. Some symptoms are less prevalent but are nevertheless clinically significant; if these symptoms had been missed due to methodological flaws, the reliability of clustering would have been compromised.
Furthermore, our results proved correct our notion that PCA-based cluster analysis would be suitable for the subtyping of DM, a disease with several rare but important symptoms.
Our study has several strengths. The large study sample of over 700 DM patients made it possible to demonstrate diverse phenotypes and to conduct the 1st cluster analysis in the field of DM. The clustering resulted in 6 subgroups, most of which showed good concordance with previous reports. Furthermore, new subgroups and features emerged, providing a basis for further studies. There are also several limitations of our study. First, missing data and memory bias existed due to the retrospective nature of the study. For example, cardiomyopathy was only detected when patients were referred for echocardiography due to relevant clinical manifestations or abnormal electrocardiographic findings, which precluded the detection of subclinical cardiac involvement. Second, our study did not include myositis-specific antibody profiles because the detecting kits were not commercially available until October 2015. Hence, most of the patients lacked these data. We believe that myositis-specific antibodies will greatly facilitate DM subtyping in future studies. Third, the diversity of six DM subgroups obtained using cluster analysis needs to be validated by long-term follow-up studies, and the universality of the classification also needs to be validated in an independent cohort. However, our results did identify some important prognostic factors that have been reported in previous studies, and we analyzed the clinicians' therapeutic choices as a surrogate endpoint measure, which provided some support for the validity of the subgrouping.
In conclusion, we, for the first time, applied a new exploratory statistical methodology to a large cohort of DM patients, which led to the identification of six clinical subgroups of DM. These subgroups may help to develop individualized treatments and improve patient prognosis. Longitudinal studies are needed to evaluate the prognostic value of the classification.

CO N FLI C T O F I NTE R E S T
The authors declare they have no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

AUTH O R CO NTR I B UTI O N
Huiyi Zhu, Qian Wang and Nan Jiang were the authors who con-