Prediction of stone disease by discriminant analysis and artificial neural networks in genetic polymorphisms: a new method

Authors


W.-C. Chen, Department of Urology, China Medical College Hospital, 2 Yu-Der Road, Taichung, 404, Taiwan.
e-mail: drtom@http://www.cmch.org.tw

Abstract

OBJECTIVE

To use information from genetic polymorphisms and from patients (drinking/exercise habits) to identify their association with stone disease, the main analytical and predictive tools being discriminant analysis (DA) and artificial neural networks (ANNs).

PATIENTS, SUBJECTS AND METHODS

Urinary stone disease is common in Taiwan; the formation of calcium oxalate stone is reportedly associated with genetic polymorphisms but there are many of these. Genotyping requires many individuals and markers because of the complexity of gene-gene and gene-environmental factor interactions. With the development of artificial intelligence, data-mining tools like ANNs can be used to derive more from patient data in predicting disease. Thus we compared 151 patients with calcium oxalate stones and 105 healthy controls for the presence of four genetic polymorphisms; cytochrome p450c17, E-cadherin, urokinase and vascular endothelial growth factor (VEGF). Information about environmental factors, e.g. water, milk and coffee consumption, and outdoor activities, was also collected. Stepwise DA and ANNs were used as classification methods to obtain an effective discriminant model.

RESULTS

With only the genetic variables, DA successfully classified 64% of the participants, but when all related factors (gene and environmental factors) were considered simultaneously, stepwise DA was successful in classifying 74%. The results for DA were best when six variables (sex, VEGF, stone number, coffee, milk, outdoor activities), found by iterative selection, were used. The ANN successfully classified 89% of participants and was better than DA when considering all factors in the model. A sensitivity analysis of the input parameters for ANN was conducted after the ANN program was trained; the most important inputs affecting stone disease were genetic (VEGF), while the second and third were water and milk consumption.

CONCLUSIONS

While data-mining tools such as DA and ANN both provide accurate results for assessing genetic markers of calcium stone disease, the ANN provides a better prediction than the DA, especially when considering all (genetic and environmental) related factors simultaneously. This model provides a new way to study stone disease in combination with genetic polymorphisms and environmental factors.

INTRODUCTION

Urolithiasis has an overall prevalence of 9.6% in Taiwan and a positive family history is a risk factor for stone disease [1]. Although the cause of most calcium oxalate stone disease is still unclear, previous genetic studies have shown that urolithiasis is associated with a polygenic defect and partial penetrance [2–4]. Several disorders that cause renal stones are hereditary. Ethnicity might also be important, as the incidence of urolithiasis in whites and Asians is higher than Africans. Although genetic causes have been studied extensively, there are no studies of chromosomal mapping in patients with stones and idiopathic hypercalciuria [4]. It seems improbable that work on stone disease will confirm that only one gene defect is responsible for stone disease; on the contrary, urolithiasis is probably caused by several genetic factors simultaneously.

Computer science has been used as an analytical tool in several medical fields; the binary classification problem has wide application in biology and medicine. This problem has been studied extensively by statisticians, and recently several machine-based learning approaches have been proposed. The latter group of techniques can be categorized as ‘soft’ computing methods; machine-learning techniques used for discrimination can be put into two categories: (i) the ‘connectionist’ model which uses some form of neural network algorithm; and (ii) the inductive learning model, which is expressed in symbolic form using rules, decision trees, etc. Data-mining tools, e.g. artificial neural networks (ANNs) and discriminant analysis (DA), have been well documented and are also used to predict diseases, treatment outcomes and prognosis for a variety of diseases. Various architectures of ANN have been used in different medical diagnoses and their results compared with existing classification methods and physicians' diagnoses [5].

In urology, ANNs have recently been used for several diagnostic and prognostic problems. Foresee et al.[6] used serum PSA, age, DRE findings and TRUS results from 1578 men with a serum PSA of> 4 ng/mL as input variables in a multilayer perceptrons (MLP) network. Tewari et al.[7] used clinical data from 1200 patients with localized prostate cancer to train and test an MLP with a genetic algorithm to predict the presence of positive margins, seminal vesicle involvement and lymph node disease. Snow et al.[8] used age, clinical stage, tumour grade, preoperative PSA level and race as input variables in an MLP network from 938 men to predict the prostate cancer recurrence rate; 67% had no recurrence. Sonke et al.[9] used several noninvasive variables including the IPSS and produced an MLP network to predict the outcome of urinary pressure-flow studies. Their model was trained and tested on 1903 patients and yielded 69% specificity at 71% sensitivity. Cummings et al.[10] developed an ANN for calculating the probability of spontaneous ureteric stone passage; the model was trained on 125 patients and was accurate in 69% of the participants. Therefore, ANNs can be useful for assessing several urological diseases.

Single nucleotide polymorphisms (SNPs) have been used as a tool for mapping complex disease genes such as those thought to be responsible for urolithiasis [11,12], but there are many important genetic polymorphisms that remain to be assessed when predicting disease. Genotyping requires many individuals and markers because of the complexity of gene-gene and gene-environmental factor interactions. Given this variety of available approaches, a difficult task for decision makers is the selection of a particular technique that best matches a given problem. Most comparisons of machine-learning techniques have been based on real data from diverse domains; this differs from statistical methods as for those there are some assumptions made before the study. Therefore, the power of prediction using statistics is limited, and a new approach for determining the cause and predicting disease is needed. Few reports have used patient gene and environmental factors together with machine-learning algorithms to predict stone disease, and few have compared the performance of machine learning with traditional statistical approaches to identify patients with stone disease. In the present study, we investigated the distribution of cytochrome p450c17 (CYP17), E-cadherin (CDH-1), urokinase, and vascular endothelial growth factor (VEGF) genes between a control group and patients with stones using data-mining technology to develop a new method for assessing stone disease.

PATIENTS, SUBJECTS AND METHODS

In all, 151 patients were enrolled in the study (118 men and 33 women, mean age 44.1 years, sd 11.9, range 26–78) who had calcium oxalate stones (at least two episodes) and regardless of their family history. Serial blood and urine biochemistry tests were undertaken to exclude possible hypercalcaemia, hyperuricaemia, or hyperuricosuria. Patients who had symptoms of UTI during the period of stone treatment were excluded. Stone composition was verified by infrared spectroscopy and revealed either calcium oxalate monohydrate, dihydrate, or a combination of these. The control group comprised 105 healthy volunteers aged> 40 years (60 men and 45 women, mean age 53.5 years, sd 11.1, range 40–73) who had no history of familial stone disease or cancer. Renal ultrasonography and routine tests for urinary microscopic haematuria were undertaken to exclude any subject who may have had renal calcification. Informed consent was obtained from both groups of participants.

Basic information about the patients and controls included family history of stones, age when the stone was first diagnosed, and recurrence of stone on ultrasonography or X-ray. Additional information on the consumption of water, milk and coffee were recorded; water intake was defined as low (< 1 L/day), moderate (1.0–2.0 L/day) and high (> 2.0 L/day ) [13]. Milk consumption was similarly categorized as low (< 240 mL/day), moderate (240–720 mL/day) and high (> 750 mL/day), and coffee consumption as low (< 1 cup/day), moderate (2–3 cup/day) and high (> 4 cup/day). Outdoor activity was recorded as infrequent (< 2 h/day), frequent (2–4 h/day) and high (> 4 h/day).

Genomic DNA was prepared from peripheral blood using a commercial kit (DNA Extractor WB, Wako, Japan). PCR was carried out in a total volume of 50 µL, containing genomic DNA, 2–6 pmol of each primer, 1 × Taq polymerase buffer (1.5 mmol/L MgCl2) and 0.25 units of AmpliTaq DNA polymerase (Perkin Elmer, Foster City, CA, USA). Restriction analysis was used on each genetic marker (data not shown). The genes studied (with the site of polymorphism) included VEGF (nucleotide − 460 upstream), CDH-1 (3′ untranslated region + 15 nucleotides), urokinase (+ 4065 nucleotide 3′-untranslated region) and CYP17 (5′ untranslated region, i.e. promoter), with polymorphisms designated CC, CT and TT. The input variables were sex, the genes CYP17, IGF2, VEGF and CDH1, the stone number in family history (Snum), outdoor activity, and water milk and coffee consumption.

COMPUTING TECHNIQUES

DA is a linear statistical classification method which determines a linear combination of dependent variables and provides the maximum degree of distinction among the compared classes by using specific class characteristics [14]. An ANN with two hidden layers and back propagation-momentum was used as a machine-learning method (Fig. 1) [15]. About 90% of the ANNs presented in clinical medicine are MLPs [5]; Fig. 2 is an MLP model with two input variables, three hidden neurones and one output neurone. Each arrow in the figure represents a weight parameter, i.e. an optimisable value. The short arrows in the figure are bias weights which are not multiplied by any incoming values. The MLP model can be considered a combination of several logistic regression models. For the present study we chose the MLP model as our main ANN analysis tool.

Figure 1.

Network layout of the ANN used for classifying stone disease.

Figure 2.

A multilayer perceptrons (MLP) model.

The classification results of both methods are based on the same datasets that were used for verification and calibration. This provides objectivity by comparing only the performance of each classification method. The success rate of classification is determined by the ratio of correctly classified recordings to the total number of recordings in that set.

All calculations for the DA used appropriate statistical software and all ANN values were calculated using the SPSS Clementine™ data-mining software (http:www. spss.com/datamine/whitpap.htm, 1998).

RESULTS

DA using the input variables selected by gene type only correctly classified 64% of the participants (Table 1). The coefficients for the linear discriminant function Ds (which describes the posterior probability of contracting stone disease, Fisher's discriminant function) were:

Table 1.  Classification results from DA and the ANN
Model/inputStone diseaseControlsTotal
DA
Using gene factors
Correct10757164
False  4547  92
Classification rate, %  7055  64
Using all factors
Correct  9893191
False  5411  65
Classification rate, %  6589  75
ANN
Using gene factors
Correct12936165
False  2368  91
Classification rate, %  8535  65
Using all factors
Correct127102229
False  252  27
Classification rate, %  8498  89

31.269 + 1.873·sex + 4.264·IGF2 + 12.403· VEGF + 7·CDH-1 + 4.507·CYP17

After interactively modifying the variables selected by stepwise DA, 75% of the participants were correctly classified (Table 1). The results were best when six variables (sex, VEGF, Snum, coffee, milk, outdoor activities; found by iterative selection) were used. The coefficients for the linear discriminant function were:

−  20.714 + 2.065·sex + 11.455·VEGF + 0.207·Snum + 0.8·coffee + 9.273·milk + 1.989·outdoor

Various calculations with different compositions of ANN as a second classification method were undertaken, and the suitability and performance of an appropriate network or learning method, and the corresponding combination of variables, were analysed iteratively. When input variables related by gene type only were selected, the ANN correctly classified 65% of the participants (Table 1). When all variables were used as input vectors for the ANN then 89% of the participants were correctly classified (Table 1).

The ANN configuration and additional information about the learning algorithm are shown in Table 2. The learning mode type selected was the ‘prune’ method which starts with a large network and removes (prunes) the weakest units in the hidden and input layers as training proceeds. The proportion of the training data is set to half of the whole dataset. As we were interested in identifying which input fields are most important in predicting stone disease, the sensitivity analysis function (which detects the most important factors) was set. Setting the ‘Stop On’ function as default allows the network to stop training when it reaches the optimally trained state. No random seed was set, which means that the sequence of random values used to initialize the network weights were different every time the node was executed. Different combinations of hidden layers were tried and two hidden layers yielded the best results for the data. All the other related configuration parameters that yielded the best results for the ANN model are listed in Table 2.

Table 2.  Parameters and configuration of the ANN
ParameterValue
Learning mode typePrune method
Prevent over-trainingYes
Training %50
Sensitivity analysisYes
Stop OnDefault
Set random seedNo
Generate model fromBest network
Hidden Layers2
Hidden units (#1, #2 hidden layer)20, 15
Hidden rate0.15
Input rate0.15
Hidden persistence6
Input persistence4
Persistence100
Overall persistence3
Learning rate initial ɛ0.3
ɛ range [0.01,0.1]
Momentum term α0.9

The topology of the present ANN was similar to that shown in Fig. 2. Sensitivity analysis of the input variables after the ANN was trained (to determine which input variables are most important) in shown in (Table 3); the most important was VEGF, with water and milk consumption second and third.

Table 3.  Sensitivity analysis of the input variables for the most successful calculation with the ANN
Importance priorityInput variablesRelative importance
1VEGF0.21387
2Water0.19146
3Milk0.18548
4CYP170.17353
5CDH-10.17014
6Outdoor0.13504
7IGF20.11999
8Coffee0.11931
9Sex0.11427
10Snum0.01390

DISCUSSION

There seemed to be no obvious differences in success between the models when only genetic factors were considered, with DA successful in 64% and the ANN in 65%. However, when both environmental and genetic factors were considered together the ANN model was more successful than DA, at 89% and 75%, respectively.

Although the ANN was not completely successful in classifying the participants the rate was improved as each environmental and genetic factor involved in stone disease was considered together. The adjusted model could be further applied to predict the age of onset, recurrence and outcome. Each factor's importance in stone disease was also determined by calculating the fractional percentages (Table 3). Using these methods, further information can be obtained from patient data providing a more accurate prediction of stone disease.

The most important genetic polymorphism was VEGF, comprising 28% of the ‘importance’. Other genetic polymorphisms were relatively less important, results also confirmed by chi-square testing not to be associated with urolithiasis. However, when the polymorphisms were analysed individually in a previous study [11,12], environmental factors were not considered. Although SNPs have recently been used as a tool for mapping stone disease genes, it remains difficult to identify the gene(s) which causes stone disease, i.e. there are important genetic polymorphisms yet to be identified. Furthermore, because the gene-gene and gene-environmental interactions were not considered before, our previous studies of this complex disease were ineffective [11,12]. The new ANN model may be a method for genetic research in stone disease.

VEGF, a homodimeric glycoprotein of 45 kDa, is the only mitogen that specifically acts on endothelial cells [16]. VEGF is a potent inducer of endothelial cell growth and is important in neovascularization [16]. VEGF-mediated neo-angiogenesis is integrated with a signalling system and other factors such as fibroblast growth factor [17]. Hypoxia, androgen, interleukin-1 and TNF are reported to regulate or at least correlate with VEGF in enhancing neovascularization [18,19], indicating that VEGF needs to be activated by upstream signalling [20]. Many reports indicate that VEGF might either combine with upstream signals or other growth factors to enhance epithelial cell growth and to enhance vascular permeability. Therefore, VEGF may act through several different pathways to initiate the pathogenesis of stone disease.

According to the theories of fixed particles and cellular injury in stone formation [21–25], hyperoxaluria and chronic oxalosis injure tubular epithelial cells, which results in the production of cytokines, osteopontin and other inflammatory proteins [26]. Damaged epithelial cells caused by the formation of calcium oxalate crystals are excreted in urine, phagocytosed, endocytosed, undergo cell division, and finally apoptosis. All of these events contribute to stone formation. Calcium oxalate crystals must be retained in renal tubules for stone formation; the larger crystals and crystal aggregates become covered by neighbouring epithelial cells. A new basement membrane is formed at the basal site and the original basement membrane disappears, leading to the incorporation of the crystals into the renal interstitium. Renal crystal deposition begins when the crystals and crystal aggregates are retained. The interstitial matrix becomes enlarged and oedematous through cell proliferation stimulated by chemokines and cell adhesion molecules. This indicates that crystal deposition in the kidney causes an inflammatory reaction. If the crystals are removed, inflammation may resolve both in the tubules and interstitium after some time, otherwise inflammation may cause the retention of crystals and stone formation if there is long-term exposure to oxalosis. VEGF is one of the cytokines involved in stone formation; it has been implicated as a contributor to the formation of urolithiasis, and to act as a ‘signpost’ or physical landmark on the chromosome which can be used as a possible genetic marker [27].

Beside the VEGF gene, the second most important factor in predicting urolithiasis was the consumption of water [13]. Milk was the third most important, more so than the IGF-2, CYP17 and CDH-1 genes. Outdoor activities and coffee were relatively unimportant. The model detected an interaction between environmental factors and genes.

Both DA and ANN gave a better classification when further environmental factors were considered but the ANN was more accurate than DA when all related factors (gene and environmental) were considered. Although the parameter configurations and modifications require further trials, the ANN and data mining provide new approaches to the study of genetic markers for calcium stone disease, complementing the study of genetic polymorphisms and environmental factors.

Abbreviations
ANN

DA

Ancillary