Integrative data mining to identify novel candidate serum biomarkers for pre-eclampsia screening


  • Jeroen L. A. Pennings,

    Corresponding author
    • Laboratory for Health Protection Research (GBO), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    Search for more papers by this author
  • Sylwia Kuc,

    1. Laboratory for Infectious Diseases and Perinatal Screening (LIS), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    2. Department of Obstetrics, Wilhelmina Children's Hospital, University Medical Center Utrecht (UMCU), Utrecht, The Netherlands
    Search for more papers by this author
  • Wendy Rodenburg,

    1. Laboratory for Health Protection Research (GBO), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    Search for more papers by this author
  • Maria P. H. Koster,

    1. Laboratory for Infectious Diseases and Perinatal Screening (LIS), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    2. Department of Obstetrics, Wilhelmina Children's Hospital, University Medical Center Utrecht (UMCU), Utrecht, The Netherlands
    Search for more papers by this author
  • Peter C. J. I. Schielen,

    1. Laboratory for Infectious Diseases and Perinatal Screening (LIS), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    Search for more papers by this author
  • Annemieke de Vries

    1. Laboratory for Health Protection Research (GBO), National Institute for Public Health and the Environment (RIVM), Bilthoven, The Netherlands
    Search for more papers by this author

Jeroen L. A. Pennings, Laboratory for Health Protection Research (GBO), National Institute for Public Health and the Environment (RIVM), P.O. Box 1, NL-3720BA Bilthoven, The Netherlands.




Pre-eclampsia (PE) is a serious complication that affects approximately 2% of pregnant women worldwide. At present, there is no sufficiently reliable test for early detection of PE in a screening setting that would allow timely intervention. To help future experimental identification of serum biomarkers for early onset PE, we applied a data mining approach to create a set of candidate biomarkers.


We started from the disease etiology, which involves impaired trophoblast invasion into the spiral arteries. On the basis of this, we used a three-stage filtering strategy consisting of selection of tissue-specific genes, textmining for further gene prioritization, and identifying blood-detectable markers.


This approach resulted in 38 candidate biomarkers. These include the best three first-trimester serum biomarkers for PE found to date LGALS13 (placental protein 13, PP13), PAPPA (pregnancy-associated plasma protein-A, PAPP-A), and PGF (placental growth factor, PlGF), as well as five proteins previously identified as biomarker after the first-trimester or disease onset. This substantiates the effectiveness of our approach and provides an important indication that the list will contain several new biomarkers for PE.


We anticipate this list can serve in prioritization of future experimental studies on serum biomarkers for early onset PE. Copyright © 2011 John Wiley & Sons, Ltd.


Pre-eclampsia (PE) is a severe disorder that occurs only during pregnancy and the postpartum period. It affects approximately 2% of pregnant women worldwide and is the leading cause of maternal and perinatal morbidity and mortality, particularly when it occurs before the 34th week of gestation (early onset PE) (WHO, 2005). Renal failure, HELLP syndrome (hemolysis, elevated liver enzymes, and thrombocytopenia), cerebral or liver hemorrhage, eclampsia, and maternal death are the most common complications of PE. Furthermore, PE is strongly associated with intrauterine growth restriction, iatrogenic prematurity, placental abruption, and stillbirth of the child.

Despite great efforts, the exact pathophysiology of PE remains unknown. Several different hypotheses have come into light and numerous theories have been postulated (Redman and Sargent, 2005; Sibai et al., 2005; Roberts, 2007). As yet, PE is considered to be a result of an interplay between placental factors, maternal constitution and inadequate adaptive changes to pregnancy predominantly involving the cardiovascular and inflammatory system (Roberts and Lain, 2002; Redman and Sargent, 2005; Visser et al., 2007). Likewise, the clinical syndrome of PE does not develop along the same pathophysiological pathway and the causative complex of interacting factors may differ from patient to patient (Sibai et al., 2005; Vatten and Skjaerven, 2004). A number of specialists in the field distinguish between two types of PE: placental and maternal. The first type is considered to be a result of impaired trophoblast invasion into spiral arteries and their failure to remodel (Redman and Sargent, 2005). Narrow spiral arteries lead to placental ischemia and generate oxidative stress conditions (Roberts and Hubel, 2009; Burton and Jauniaux, 2004). Early-onset PE, developing early in the pregnancy and necessitating delivery prior to 34 weeks' gestation, is more frequently associated with this defective placentation than late-onset disease. On the other hand, the late-onset PE is considered as ‘maternal’, where an abnormal maternal response rather than an abnormal pregnancy occurs (Ness and Roberts, 1996; Redman and Sargent, 2005).

Because of the serious health consequences of PE, risk assessment and identification of women at risk early in the pregnancy remain a major challenge in prenatal care (Conde-Agudelo et al., 2004). Because widespread serum alterations are expected to forego the clinical onset of PE, there is great interest in the identification of early predictive biomarkers. A number of candidate biomarkers have been proposed for prediction of disease, including placental hormones and angiogenic factors (Kaaja, 2008; Cuckle, 2011; Kuc et al., 2011). Additionally, maternal characteristics combined with the measurements of uterine artery Doppler appear useful in PE risk stratification (Duckitt and Harrington, 2005; Magnussen et al., 2007; Cnossen et al., 2008; Poon et al., 2010). However, to date no serum or maternal marker (either single or in combination) has emerged with the necessary specificity and sensitivity to be of clinical use. In consequence, clinicians are so far unable to offer targeted intensified prenatal surveillance and potential preventive therapies to women at high risk. Therefore, development of a clinically relevant prenatal screening test, preferably as early as the first-trimester of pregnancy, is of great importance to enable timely intervention where needed, thereby improving the outcome for mother and child as well as general prenatal care efficiency.

In a previous study at our laboratory, we used a bioinformatics approach to identify novel screening markers for Down syndrome screening by a three-step strategy, namely (1) selection of genes highly expressed in relevant tissues, (2) limiting this set to genes more specifically associated with the disorder, and (3) identifying blood-detectable markers (Pennings et al., 2009). In this study, we will apply a similarly structured strategy for identification of potential serum biomarkers for predominantly early onset PE, taking the etiological mechanism as a starting point.


Analysis of tissue-specific gene expression data

To compare gene expression data across various human tissues we used data from the human U133A/GNF1H dataset (Su et al., 2004) as available on the BioGPS website (, formerly known as SymAtlas) (Wu et al., 2009). This dataset includes gcRMA normalized microarray gene expression data from 84 human tissues. For our analysis, we excluded data from cancer (cell lines) as well as testis and prostate tissues, as these were considered less relevant for screening in pregnant women. Gene expression data for the remaining 72 tissues were imported in R software and used to compare the expression in placenta or endothelial cells to the expression in all other tissues. Here we used essentially the same approach as described before (Pennings et al., 2009). For each gene, the expression in placenta or endothelial cells was compared to the median expression among all other tissues, thereby providing a ratio for each gene. Next, we determined, for different ratio thresholds, how many genes had a ratio at least as large as those thresholds. The resulting data distribution was assessed to determine the nonspecific underlying trend over lower ratio stringency levels. As we described before, this trend could be approximated with a power law distribution, where a twofold increase in the threshold led to a fourfold decrease in the number of genes expressed above that threshold (Pennings et al., 2009). We refer to this trend as the nonspecific underlying trend. For higher stringencies, the number of genes began to decrease at a slower rate, indicating enrichment for tissue-specific genes over the nonspecific trend. On the basis of this finding, a threshold was determined that yielded approximately 10 times more tissue-specific genes than could be estimated based on the nonspecific underlying trend. This threshold was determined at 30 times the median gene expression level. Genes exceeding this threshold (Supplementary Table 1) were used in subsequent analysis steps.


To determine textmining associations between genes and pathogenesis-associated search terms, we used two different textmining tools, which are complementary in nature, to restrict the chance of false negatives. The first of these tools is Anni 2.1 ( (Jelier et al., 2008), which provides an ontology-based and thesaurus-based interface to Medline and retrieves associations for several classes of biomedical concepts (e.g. genes, drugs, and diseases). These concepts are given a concept weight, which indicates their relevance to the applied search term. The second application is Polysearch ( (Cheng et al., 2008), which supports information retrieval queries against several different types of text, scientific abstract or bioinformatic databases such as PubMed, OMIM, DrugBank, SwissProt, the Human Metabolome Database, the Human Protein Reference Database, and the Genetic Association Database. The relevancy scores of the obtained genes or proteins are expressed as Z scores, that is, as standard deviations above the mean.

In our study on Down syndrome biomarkers (Pennings et al., 2009) we found that the combined use of Anni and PolySearch offers a better search performance, as these tools use different approaches to search partially different databases.

The two textmining applications were searched for genes associated with the terms ‘placentation’, ‘placental villus’, ‘chorionic villus’, ‘uterine artery’, ‘spiral artery’, ‘Doppler ultrasound/ultrasonography’, ‘pulsatility index’, and ‘blood flow velocity’. Significance criteria were based on a minimal tenfold enrichment over the statistically determined distribution of the concept weight (Anni) or a software documentation suggested Gaussian distribution (PolySearch). Gene lists obtained for the search terms were combined and subsequently manually curated to resolve ambiguous or redundant gene symbols (Supplementary Table 1).

Assessing applicability for blood-based detection

To determine if putative biomarkers identified by gene expression and text mining analysis are potentially blood-detectable, they were cross-checked against two different data resources. Proteins were considered blood-detectable if they had at least one of the gene ontology (GO) annotation terms ‘extracellular region’, ‘extracellular region part’, or ‘extracellular space’; or if they were included in the human plasma proteome list. GO ( (Ashburner et al., 2000) annotations are partially based on computational predictions whereas the human plasma proteome list (Anderson et al., 2004) is based on a combination of experimental methodologies. Because it has been found (Anderson et al., 2004; Pennings et al., 2009) that these resources are complementary, their results were combined (Supplementary Table 1).


Identification of tissue-specific candidate genes

Current insights into early onset PE indicate that its pathogenesis primarily occurs in the transitional area between the placenta and the endometrial spiral arteries (Burton and Jauniaux, 2004; Vatten and Skjaerven, 2004). Therefore, as a first step in our data mining approach, we identified genes with expression specific to either of these tissues. Using the BioGPS gene expression data set for 72 human tissues, we compared placenta and endothelial cells (the latter as a substitute for spiral artery tissue, which was not included as a tissue in the data set) to other tissues in the dataset. For each gene, the ratio was calculated between the expression in either of these tissues to the median expression in all other tissues. By using various stringencies, we first determined the number of false positive genes exceeding the ratio at several lower stringency levels. Next, we extrapolated the trend in these values to higher stringency levels. At a threshold of 30 times the median tissue expression, we found the number of positive genes being ten times higher than the number of expected false positives, and therefore 90% of the positive genes for either placenta or endothelial cells can be considered to be specifically derived from that tissue and not be a statistical artifact. Using these criteria, we found 268 genes specifically expressed in placenta and 170 in endothelial cells, which when combined add up to 433 nonredundant genes (Figure 1) (these respective lists can be found in the Supplementary Table 1).

Figure 1.

Graphic overview of our data mining approach and the number of markers identified per step

Applying additional relevance criteria

In the next step, we applied textmining to determine which genes are functionally associated with the mechanism of PE pathogenesis. This was done using two different textmining tools (Anni and Polysearch) to select genes associated with several PE pathogenesis terms (placentation, villi, uterine spiral arteries) or diagnostic characteristics (Doppler, pulsatility index, blood flow). This resulted in a total number of 247 nonredundant genes (134 using Anni, also 134 using PolySearch) (Supplementary Table 1). As can be expected for a textmining search, genes found using both methods are mostly those that are frequently mentioned in the literature on pregnancy or placental development. Some of such genes are also found in step 1 (e.g. PAPPA, LGLAS13, PGF, INHBA) although some are not found in step 1 as sufficiently tissue specific (e.g. AFP, ERVWE1, FLT1, PPARG) (Supplementary Table 1).

We determined the overlap between the 433 tissue-specific genes identified in step 1 and the 247 functionally PE-associated genes in step 2, which resulted in a list of 52 unique genes (Figure 1, Supplementary Table 1) that met our relevance criteria.

Selection of blood-detectable markers

For the 52 genes selected based on the previous steps, we identified the ones potentially detectable as proteins in plasma or serum. Such detectability is a necessary prerequisite for biomarkers to be measured by immunoassays that are currently used for routine screening programs. We checked which of the 52 genes had either a GO annotation as being extracellular or were part of the experimentally derived human plasma proteome list. This led to a final list of 38 potential biomarkers detectable in human plasma or serum (Figure 1, Table 1), and as such relevant for biomarker analysis follow-up studies.

Functional analysis

To evaluate the functionality of the list, we compared our list with PE markers assessed by others (Table 1). We found that 21 of the final 38 proteins were tested by others as potential biomarker for PE of which eight were specifically tested as first-trimester marker for PE (Table 1). Over-representation analysis of biological processes within the list of 38 proteins showed that the most notable GO terms were those related to hormone activity (ADM, CGA, CGB5, CSH1, CSH2, INHA, INHBA, INSL4, IGF2, PRL) and other forms of growth regulation (DLK1, DKK1, GPC3, HMOX1, HTRA1, IGFBP1, IGFBP3, PAPPA, PAPPA2, PGF, PLA2G2A, PLAU, SPP1). The enrichment is mainly found among markers for which evidence as a potential biomarker has been found. Other represented functions known to be involved at this pregnancy stage are cell adhesion molecules (COL15A1, FN1, PECAM1, VCAM1), proteases (HTRA1, PAPPA, PAPPA2, PLAU, PRSS8) and their inhibitors (GPC3, SPINT1, TFPI2, TIMP1), and the IGF pathway (IGF2, IGFBP1, IGFBP3, PAPPA, PAPPA2). Five of the seven proteins not annotated in one of the above categories, (ABP1, ALPP, LGALS13, PLAC1, PSG5) are associated with the placenta and its early development, although their functions cannot be assigned to a common denominator. The remaining two proteins (CD55, HBB) are primarily known as blood associated proteins.


Pre-eclampsia is a serious disease burden to pregnant women, and therefore there is much need for a reliable early screening program that can be offered on a routine basis. Although serum biomarkers for PE have been identified, their overall prediction accuracy is not yet sufficient for implementation in a screening setting (Poon et al., 2010; Cuckle, 2011; Kuc et al., 2011). By identifying novel PE serum biomarkers and combining these – with current serum markers, uterine artery Doppler measurements, and additional maternal history – into a single prediction method, it may be possible to create a screening program with clinically relevant performance. This would allow for improved surveillance of women with high-risk pregnancies and preventive measures such as antihypertensive medication or induced delivery before clinical signs of serious complications occur.

This study was set up to identify novel potential PE markers, to subsequently be tested in biomarker discovery approaches. By combining data from different publicly available data resources into a three-step approach (Figure 1, Table 1), we identified 38 potential PE serum biomarkers (Supplementary Table 1).

Among the markers given in Table 1 are PAPPA (PAPP-A), LGALS3 (PP13), and PGF (PLGF), which to date represent the most promising biomarkers for PE (Kuc et al., 2011). The finding that these markers are also identified by our approach demonstrates that our approach has the potential to identify other promising biomarkers as well. For another 18 markers we found literature studies that examine their potential as a PE biomarker. For five of these (FN1, HMOX1, PAPPA2, PRL, VCAM1) there is moderate evidence for their use as a PE biomarker (Krauss et al., 1997; Zeisler et al., 2001; Chaiworapongsa et al., 2002; Aydin et al., 2004; Aydin et al., 2006; Leeflang et al., 2007; Eide et al., 2008; Leanos-Miranda et al., 2008; Nishizawa et al., 2008; Dane et al., 2009). These markers were informative beyond the first trimester of pregnancy or already after the clinical onset of the disease. For the other 13, the evidence was either lacking, inconsistent, or contradictory between studies.

Table 1. Identified candidate biomarkers for early onset PE
MarkerDescriptionOverall potentialaFirst trimester potentialbSource tissue
  1. a

    Potential PE markers described as examined in the literature, with their indicated level of evidence.

  2. b

    Potential PE markers also examined in the first trimester with their indicated level of evidence.

  3. c

    The word ‘Tested’ is used here to denote proteins which have been tested as PE biomarker but for which (to date) insufficient evidence has been found.

ABP1amiloride binding protein 1 (amine oxidase, copper-containing)  Placenta
ALPPalkaline phosphatase, placental (Regan isozyme)Tested Placenta
CD55CD55 molecule, decay accelerating factor for complement  Placenta
CGAglycoprotein hormones, alpha polypeptide  Placenta
CGB5chorionic gonadotropin, beta polypeptide 5Tested Placenta
COL15A1collagen, type XV, alpha 1  Placenta
CSH1chorionic somatomammotropin hormone 1 (placental lactogen)Tested Placenta
CSH2chorionic somatomammotropin hormone 2 (placental lactogen)Tested Placenta
DKK1dickkopf homolog 1 (Xenopus laevis)  Placenta
DLK1delta-like 1 homolog (Drosophila)  Placenta
FN1fibronectin 1Evidence Placenta
GPC3glypican 3  Placenta
HBBhemoglobin, beta  Endothelial
HMOX1heme oxygenase 1Evidence Endothelial
HTRA1HtrA serine peptidase 1  Placenta
IGF2insulin-like growth factor 2 (somatomedin A)Tested Placenta
IGFBP1insulin-like growth factor binding protein 1TestedTestedPlacenta
IGFBP3insulin-like growth factor binding protein 3Tested Placenta
INHAinhibin, alphaTestedTestedPlacenta
INHBAinhibin, beta ATestedTestedPlacenta
INSL4insulin-like 4 (placenta)TestedTestedPlacenta
LGALS13lectin, galactoside-binding, soluble, 13 (PP13)Strong evidenceStrong evidencePlacenta
PAPPApregnancy-associated plasma protein A, pappalysin 1Strong evidenceStrong evidencePlacenta
PAPPA2pappalysin 2Evidence Placenta
PECAM1platelet/endothelial cell adhesion moleculeTested Placenta
PGFplacental growth factor (PlGF)Strong evidenceStrong evidencePlacenta
PLA2G2Aphospholipase A2, group IIA (platelets, synovial fluid)  Placenta
PLAC1placenta-specific 1  Placenta
PLAUplasminogen activator, urokinase  Placenta
PRLprolactinEvidence Placenta
PRSS8protease, serine, 8  Placenta
PSG5pregnancy specific beta-1-glycoprotein 5  Placenta
SPINT1serine peptidase inhibitor, Kunitz type 1  Placenta
SPP1secreted phosphoprotein 1 (osteopontin)Tested Placenta
TFPI2tissue factor pathway inhibitor 2  Placenta
TIMP2TIMP metallopeptidase inhibitor 2  Placenta
VCAM1vascular cell adhesion molecule 1Evidence Endothelial

In the context of first-trimester screening, it is important whether the selected markers can be used so early in the pregnancy to distinguish between women at risk of developing PE and those with healthy pregnancies already. So far, most of the markers are shown to be distinctive in the third trimester after the onset of the disease, and as such provide little use in preventing PE development by intervention strategies. Only eight of the markers described in the literature were tested in the first trimester (Table 1). It may be assumed that the marker profiles of women with clinically confirmed PE are different from the profiles of healthy women. Therefore, it is of major importance that the selected markers also have the potential of differentiating between healthy individuals and women at risk even before the onset of the disease. Larger prospective or case-control studies will be needed to provide an answer to this matter.

Among the 38 proteins in the Table 1, several overrepresentations of biological processes can be observed. The most notable of these are GO terms related to hormone activity and other forms of growth regulation. The enrichment for these functional terms can be attributed to early development of the placenta, which involves the production of growth factors, hormones, and metalloproteases (Koster et al., 2010). The enrichment is mainly found among markers for which evidence for usage as a potential biomarker already exists. This suggests that markers that have not been tested yet but have a similar function might deserve priority in the following experimental stages of biomarker identification. More specifically, this would lead to prioritization of HTRA1 and DKK1. Indeed, it has recently been found that down-regulation of total HTRA1 can be correlated to placental (i.e. early onset) as opposed to maternal (i.e. late onset) PE (Lorenzi et al., 2009), making it a very interesting candidate marker. Also, aberrant expression of DKK1 has been associated with impaired embryonic attachment and implantation (Liu et al., 2010).

Among the 17 genes that have not yet been tested as candidate biomarkers for PE, functional enrichment was strongest for proteases (HTRA1, PLAU, PRSS8) or protease inhibitors (GPC3, SPINT1, TFPI2, TIMP2). Such proteins are involved in early placental development through their function in either tissue remodeling, but also by regulating concentrations of IGF binding proteins and thereby IGF levels (Koster et al., 2010). This latter aspect might especially be relevant for prioritizing these markers, as PAPPA is a well-known example of a protease acting upon IGFBP4 (Lawrence et al., 1999). Given this, the further interest in the IGF pathway among previously tested markers (Table 1), and the knowledge that HTRA1 also acts upon IGF binding proteins (Zumbrunn and Trueb, 1996) means that HTRA1 also might deserve future priority from this functional point of view.

The importance of the placenta in the etiology of PE is further underlined by the finding that 35 out of the 38 markers in Table 1 are of placental origin, with only HMOX1, VCAM1 and HBB being of endothelial origin. The predominance of placental genes among the identified markers suggests that this tissue might outperform endothelial cells as a relevant source for novel PE biomarkers. To some extent, this can be explained by the fact that when compared to placenta, endothelial tissue has less specific (literature) association with pregnancy. This is reflected in the finding that the textmining step keeps more placental genes (47 out of 268) than endothelial (5 out of 170) genes. However, both HMOX1 and VCAM1 have been described as markers for PE, which suggests that genes of endothelial origin are nevertheless interesting enough to warrant further study.

One marker with endothelial origin, hemoglobin beta (HBB), meets the criteria used in our approach. Interestingly, fetal hemoglobin (containing alpha and gamma chains) has recently been described as a promising PE biomarker (Olsson et al., 2010; Dolberg Anderson et al., 2011). However, this has not been described for the (maternal) HBB chain. Additionally, it can be assumed that HBB concentrations in plasma or serum will be affected by variations in erythrocyte lysis during sample handling. Therefore, we expect that this particular marker might not be very valuable in clinical practice. This might serve as an illustration of how data mining can identify potential markers, yet additional clinical background experience can help in giving lower (HBB) or higher (HTRA1) priority for further clinical testing.

To summarize, we have used integrative data mining to identify a set of 38 candidate early PE screening biomarkers. Among the list are three markers that have been shown to be validated first-trimester clinical PE biomarkers (strong evidence). Five markers have literature evidence; however, they have not yet been tested in the first trimester. Thirteen other markers have been examined in other studies, from which five in the first trimester, leaving approximately half of the markers in Table 1 still interesting and open for further examination as potential biomarkers. Given the number of confirmed biomarkers among those that have been examined, and taking into account that the first two gene selection steps in our approach are both based on a minimally tenfold enrichment over the background, we expect there will be several novel, useful PE biomarkers among those that have not yet been examined. Additional case-control serum analysis experiments will be necessary and initiated by us, to determine which of these candidate biomarkers have differential serum levels in PE versus normal pregnancies as early as the first trimester. Moreover, before a set of serum biomarkers can be combined with parameters such as maternal history or uterine Doppler measurements in a risk stratification algorithm, larger cohort studies need to be performed. These are necessary to further determine how these markers interrelate and whether sufficiently reliable prediction accuracy can be obtained before a large-scale PE screening program can be introduced. Such validation experiments with the PE screening biomarkers reported in this manuscript will be the subject of forthcoming research.