Transcriptional Hallmarks of Noonan Syndrome and Noonan-Like Syndrome with Loose Anagen Hair

Noonan syndrome (NS) is among the most common nonchromosomal disorders affecting development and growth. NS is genetically heterogeneous, being caused by germline mutations affecting various genes implicated in the RAS signaling network. This network transduces extracellular signals into intracellular biochemical and transcriptional responses controlling cell proliferation, differentiation, metabolism, and senescence. To explore the transcriptional consequences of NS-causing mutations, we performed global mRNA expression profiling on peripheral blood mononuclear cells obtained from 23 NS patients carrying heterozygous mutations in PTPN11 or SOS1. Gene expression profiling was also resolved in five subjects with Noonan-like syndrome with loose anagen hair (NS/LAH), a condition clinically related to NS and caused by an invariant mutation in SHOC2. Robust transcriptional signatures were found to specifically discriminate each of the three mutation groups from 21 age- and sex-matched controls. Despite the only partial overlap in terms of gene composition, the three signatures showed a notable concordance in terms of biological processes and regulatory circuits affected. These data establish expression profiling of peripheral blood mononuclear cells as a powerful tool to appreciate differential perturbations driven by germline mutations of transducers involved in RAS signaling and to dissect molecular mechanisms underlying NS and other RASopathies. Hum Mutat 33:703–709, 2012. © 2012 Wiley Periodicals, Inc.


I. Microarray data processing and filtering
a. Scaling, Log2 transformation and detection filtering.
Cubic spline-normalized probe signals, together with detection p-values, were obtained using the Illumina BeadStudio 3.1 software with background subtraction. To avoid negative values, the fixed amount of 50 was added back to each data point prior to log 2 transformation. Subsequent data filtering and analysis was carried out with Excel (Microsoft). To the 20584 probes analyzed, we applied a statistical filter to select genes with reliable signal detection, selecting genes with a detection value (as provided by BeadStudio) greater than 0.999 in at least 5 samples.
b. Removal of genes correlated with age, sex, or differential leukocyte count.
To filter out genes correlated to basic clinical features, we collected information on gender, age (years), white blood cells and lymphocyte counts for the NS samples presented in the Table below.  For each detected gene, we calculated the Pearson correlation with each of the four clinical variables. Probes displaying a Pearson correlation higher than 0.5 with any of the clinical variables were removed from subsequent analyses, which left a total of 5605 probes employed for further analysis.

c. Log 2 ratio transformation
For each probe, the log 2 signal in each sample was converted into log 2 ratio against global average expression of that probe in all samples. To avoid distortions due to the different sample sizes (from n = 5 for SHOC2 samples to n = 21 for controls), the global average was calculated by first averaging the log 2 signal within each group (CTRL, PTPN11, SOS1 and SHOC2) and then by further averaging these four means. The log 2 ratio was then calculated by subtracting log 2 signal of the single sample from the global average.

II. Microarray data analysis a. Selection of genes differentially expressed between controls and mutated groups.
To be defined as differential between controls and mutated groups, a probe Log 2 ratio signal had to display significance in three basic tests: (i) log 2 ratio between the means in control and mutated group greater than 0.5 (positive or negative); (ii) independent samples, two-way T-test p-value lower than 0.01; (iii) signal-to-noise ratio greater than 0.5 (positive or negative), according to the formula described by Golub and colleagues (Science 286:531-537, 1999).

b. Montecarlo simulation for FDR estimation.
To estimate the False Discovery Rate, i.e. the fraction of false positives expected for the four signatures described in the main text, 2000 orthogonal random permutations of the samples were performed, whereby for each cycle of permutation the three statistical filters employed for genes selection we reapplied, recording the number of significant probes for each of the four signatures. For each signature, median and mean false positives were then compared with true positives as illustrated in Supp. Table  S3.
c. Weighted average score for full leave-one-out classification analysis.
After removing a sample from the dataset and redefining the four signatures, the left-out sample was classified by a weighted average score, obtained by subtracting average log2 ratios of probes with higher signal in control samples from average log2 ratio of probes with higher signal in mutated samples.

III. Functional data mining a. Functional annotation analysis
To define functional annotation keyword enrichment in the mutation-specific signatures, we uploaded the lists of Illumina probes belonging to the PTPN11, SOS1 and SHOC2 signatures on the David Ease web-based annotation tool (http://david.abcc.ncifcrf.gov/). Three types of lists were uploaded for each signature: (i) all probes, both up-and down-regulated in mutated samples; (ii) only probes up-regulated in mutated samples; (iii) only probes down-regulated in mutated samples. As a background list, we employed the 5605 probes passing filtering criteria mentioned in the Results section. Subsequently the various lists underwent "Functional Annotation Clustering" analysis, returning Benjamini corrected enrichment p-values for functional clusters. The output of this analysis was filtered to include only results with corrected p-values of less than 0.01.

b. Kinase Enrichment analysis
To identify substrates of kinases in the signatures, lists of gene symbols corresponding to Illumina probes of the PTPN11, SOS1 and SHOC2 signatures were uploaded on the web-based Kinase Enrichment Analysis tool (http://amp.pharm.mssm.edu/lib/kea.jsp). The output of this analysis was filtered to include only results with kinase enrichment p-values of less than 0.01.

c. Protein-protein interactions analysis
To check the signatures for enrichment in protein-protein interactions, we employed two web-based tools, Gather (http://gather.genome.duke.edu/) and Gene2Network (http://actin.pharm.mssm.edu/genes2networks/). The first employs data from two large-scale proteinprotein interaction studies in humans, and the second integrates multiple small-scale, high-quality protein-protein interaction datasets. Lists of gene symbols corresponding to Illumina probes of the PTPN11, SOS1 and SHOC2 signatures were uploaded on the two websites and "in silico" proteinprotein interaction analysis was performed. To achieve complete independence between the two analyses, Gene2Network analysis was conducted excluding the "Vidal and Stelzl" dataset, which is by default employed by Gather. Addictional parameters of Gene2Network analysis were the following: (i) max path length threshold =2; (ii) significance cut-off =3; (iii) max interaction for reference=1; (iv) min reference per interaction=no filter.

d. Binding site analysis and transcriptional circuits
To analyze transcription factor activity, a curated, non-redundant list of transcription factors (TFs) was extracted from the JASPAR CORE database (http://jaspar.genereg.net/). This list of TFs was then filtered to maintain only those TFs with available binding site data in the Opossum tool (http://www.cisreg.ca/cgi-bin/oPOSSUM/opossum). Eleven of these TFs were found in one or more of the three signatures: GFI1 (PTPN11 and SOS1 signatures), GABPA (SOS1 and SHOC2 signatures), and nine TFs only in the SHOC2 signature (CEBPA,CREB1,ELK1,ELK4,MAX,NR3C1,SP1,STAT1 and ZNF395). For each of these eleven TFs, we verified, if the respective PTPN11, SOS1 or SHOC2 signature was enriched in genes containing one or more specific TF binding sites. To this aim, we employed the "Custom binding site analysis" of Opossum, uploading lists of gene symbols corresponding to Illumina probes of the PTPN11, SOS1 or SHOC2 signatures, subdivided in upregulated and down-regulated in mutated samples. As a background list, we employed the 5605 probes passing filtering criteria mentioned in the Results section. The amount of upstream / downstream sequence was set to (2000/0) and Matrix match threshold to 80%. We considered significant Opossum Z-Scores greater than 5 and p-values lower than 0.05. Putative transcriptional circuits highlighted by Opossum and including a TF and its enriched targets in the same signature were consolidated by a complementary procedure, that involved for each given TF: (i) retrieval, using Opossum, of all genes containing binding sites for that TF (putative targets), in the 5605-gene background list and in the signature(s) containing the TF; (ii) hypergeometric distribution analysis of enrichment of putative target genes in the signature vs. background, with subsequent p-value calculation. Results of both analyses are shown in Supp. Table S5.

Supp. Figures
Supp. Figure S1. Expression levels in human PBMCs of RAS/MAPK pathway genes mutated in RASopathies. Histograms represent average expression signal in 127 human PBMC samples (Burczynski et all, J Mol Diagn. 2006 Supp. Figure S3. Experimental workflow for PBMC purification, expression profiling, and data filtering. The workflow is subdivided in three main parts: blood processing (light blue), RNA extraction and gene expression profiling (pink), and data processing for gene filtering (light green).