Robust data‐driven identification of risk factors and their interactions: A simulation and a study of parental and demographic risk factors for schizophrenia

Abstract Objectives Few interactions between risk factors for schizophrenia have been replicated, but fitting all such interactions is difficult due to high‐dimensionality. Our aims are to examine significant main and interaction effects for schizophrenia and the performance of our approach using simulated data. Methods We apply the machine learning technique elastic net to a high‐dimensional logistic regression model to produce a sparse set of predictors, and then assess the significance of odds ratios (OR) with Bonferroni‐corrected p‐values and confidence intervals (CI). We introduce a simulation model that resembles a Finnish nested case–control study of schizophrenia which uses national registers to identify cases (n = 1,468) and controls (n = 2,975). The predictors include nine sociodemographic factors and all interactions (31 predictors). Results In the simulation, interactions with OR = 3 and prevalence = 4% were identified with <5% false positive rate and ≥80% power. None of the studied interactions were significantly associated with schizophrenia, but main effects of parental psychosis (OR = 5.2, CI 2.9–9.7; p < .001), urbanicity (1.3, 1.1–1.7; p = .001), and paternal age ≥35 (1.3, 1.004–1.6; p = .04) were significant. Conclusions We have provided an analytic pipeline for data‐driven identification of main and interaction effects in case–control data. We identified highly replicated main effects for schizophrenia, but no interactions.

The search resulted in 209 studies. All the titles and abstracts were screened for relevance and full texts were acquired for 64 studies. Among these, 22 studies were prospective, reported on interactions other than gene by environment interactions (reviewed by (Misiak et al., 2018), reported on risk factors available on the individual level and studied diagnosed schizophrenia or non-affective psychoses as outcomes. No restrictions on the methodology to assess interactions were applied. After assessing the full-texts of the 22 studies, two additional studies were identified that fulfilled the objective of the search and that were published between 1998 and 2017 (Harrison et al., 2003;van Os, Hanssen, Bak, Bijl, & Vollebergh, 2003). In total, 24 studies on 35 interactions were summarized in Table 1.

Predictor definitions
Detailed information about the definition of main effects is shown in Supplemental Parental low education 0 = either parent has post-secondary school education, 1 = both parents no post-secondary school education Parental psychosis 0 = no, 1 = yes (ICD-10 codes F20-25, F28-29; ICD-9 codes 295,297,298.9X,301.2C;297,298.20,298.30,298.99,299) Mother 19 or younger 0 = maternal age 20 or older at time of birth of study subject, 1 = maternal age 19 or younger at time of birth of study subject

Detailed rationale for defining the simulation study
When conducting the simulation study, we aimed at resemblance of epidemiologic data by including groups of variables that were correlated, variables with different prevalence, and both active main and interaction effects. Further, a general assumption is that machine learning techniques can capture complex relationships between variables-such as interactions-with a minimum of a priori restrictions. Therefore, we included an active interaction effect constituted from variables that did not include active main effects, i.e., we applied no hierarchy restriction stipulating that an interaction effect of two variables can only be included in a model if the main effects of the variables are important. Finally, we aimed to optimize the interpretability of the simulation study by setting the active effects from variables with the same prevalence (e.g. all active main effects had the same prevalence) and by not setting multiple active interaction effects taken from the same variables (e.g. if A2 ✕ B2 is set active, B2 ✕ C2 should not be set active, because the variable B2 is already included in the other active interaction).

Simulation study for defining data preprocessing and analyses
We performed a simulation study as described in 'Analytic pipeline' in the manuscript, but with varying criteria that defined the maximum absolute log(odds ratio [OR]) between main effects for an interaction to be included in the analyses. Based on 10 000 simulations for each criterion, we calculated the empirical true positive (TP) and false positive (FP) rate for identifying main and interaction effects when using criterion of log(OR) at 0.1, 0.3 and 0.5 (30 000 simulations in total). As shown in Supplemental Table 2, the FP rate remained <5% regardless of the criterion of log(OR). However, the power, i.e. the TP rate, to detect interaction effects was 37.2% when using a log(OR)-criterion of 0.1 compared to 46.7% for a criterion of log(OR) at 0.3 and 45.7% for a criterion of log(OR) at 0.5. Using a strict approach, we opted to use a criterion of log(OR) at 0.3 for the analyses reported in the main text.
As further shown in Supplemental Table 2, we performed a simulation study with varying tuning parameter alpha in the elastic net analyses while keeping the criterion of log(OR) constant at 0.3. This analysis was also based on 10 000 simulations for each value of alpha (0.5, 0.75 and 1). Both the FP and TP rates were similar regardless of which value for the tuning parameter alpha was used. We opted to use an alpha value at 0.75, because the simulations with an alpha value at 0.75 showed the highest TP for detecting interaction effects. A further reason to use an alpha value of 0.75 instead of 1 (i.e. LASSO regression) was based on theory and the literature showing that elastic net analyses with alpha values < 1 perform better in correlated data (Zou & Hastie, 2005).

Simulation study with three active interactions
While the rationale of the primary simulation study reported in the manuscript included several aspects (see 'Detailed rationale for defining the simulation study' above) that were not compatible with setting multiple active interactions of same prevalence in datasets of nine variables, we also conducted an additional simulation study with multiple active interactions but with altered definitions for the simulated datasets. To optimise interpretability and to not set interaction effects taken from variables that were active in other main or interaction effects, we defined simulated datasets as follows: among 4500 subjects, we set the prevalence at 15% for all nine variables (A1, A2 … C3); we set a within-group correlation of 0.3 in the three groups of variables (A, B, C); we set no active main effects; and we set three active interaction effects with an exchangeable OR at 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5 and 5.0 (A1 ✕ B1; A2 ✕ C2; B3 ✕ C3); these three active interactions were set to have the same OR in each simulation and were taken between variables in uncorrelated groups having prevalence of 15%, so the interactions had a prevalence of 2.3%.

Marginal screening
We carried out additional simulations to test whether variable selection using elastic net followed by multivariate logistic regression was superior to marginal screening (without prior variable selection). We performed preprocessing of the data as described in 'Analytic pipeline' in the manuscript, but we tested each predictor in a separate model with casecontrol status as the outcome; models of main effects consisted of only the main effect variable as dependent variable (e.g. formula 'y ~ x1' for testing main effect 'x1') while models of interaction effects consisted the interaction and the two main effect variables included in the interaction (e.g. formula 'y ~ x1 + x2 + x1*x2' for testing interaction 'x1*x2'). We report both results that were and were not Bonferroni-corrected.

Simulation study with three active interactions
The FP rates were <5% (Supplemental Table 3) as in the simulation reported in the manuscript. However, the power to detect interactions in these simulated datasets without active main effects were higher than in the simulation in the manuscript which did include three active main effects. For example, to detect interactions with 2.3% prevalence and with >80% power, the OR of the active interaction had to be 5.0 in the simulation reported in the manuscript ( Table 2 in the manuscript), while at least one, two or all three active interactions with the same prevalence could be detected at OR=2.0, OR=2.5 and OR=3.5, respectively, in the datasets with no main effects and three active interactions (Supplemental Table 3).

Simulation study using marginal screening
When we carried out marginal screening without Bonferroni-correction, the range of the FP rates were between 47.1% and 91.1% (Supplemental Table 4). When we applied Bonferroni-correction, the TP and FP rates for identifying interactions using marginal screening (Supplemental Table 5) was similar to the corresponding rates using elastic net analyses followed by multivariate logistic regression ( Table 2 in the manuscript). However, the FP rate of identifying main effects ranged between 3.6% and 71.1% for all definitions of simulated datasets (Supplemental Table 5).

Descriptive results of FiPS-S
The frequencies and expected counts of all predictors are shown in Supplemental Table 6.
To examine whether the interactions that have been reported in previous studies were roughly consistent with our data, we made additional exploratory analyses of the interactions between parental psychosis and urbanicity, parental psychosis and father 35 or older, and born in winter months and urbanicity. These analyses were performed with marginal screening without Bonferroni-correction and are reported in Supplemental Table 7.
Supplemental Table 2. The true positive (TP) and false positive (FP) rates of identifying main and interactions effects in 10 000 simulated datasets for different data preprocessing criteria and elastic net tuning parameters when using elastic net variable selection and Bonferronicorrected multivariate logistic regression. Abbreviations: TP, true positive; FP, false positive. Note: The simulation datasets were in all analyses defined with 4500 subjects, 2.3% prevalence of the active interaction, OR=3 of the active interaction and a 0.3 within-group correlation. a The TP rate of identifying main effects was defined as the proportion of simulations in which at least one of the three active main effects were correctly identified. b The FP rate of identifying main effects was defined as the proportion of simulations in which at least one of the non-active main effects were incorrectly identified. c The TP rate of identifying interaction effects was defined as the proportion of simulations in which the one active interaction effect was correctly identified. d The FP rate of identifying interaction effects was defined as the proportion of simulations in which at least one of the non-active interactions effects were incorrectly identified.
Supplemental Table 3. The true positive (TP) and false positive (FP) rates of identifying interaction effects in 10 000 simulated datasets using elastic net variable selection and Bonferroni-corrected multivariate logistic regression. The simulated datasets were set to have 4500 subjects, a prevalence of 15% among all variables, no active main effects, a within-group correlation of 0.3 and an exchangeable OR of the three active interactions that were taken between variables in uncorrelated groups. All three active interactions had the same OR in each simulation. a The proportion of simulations in which at least one (non-active) main effect were incorrectly identified. b The proportion of simulations in which at least one of the non-active interactions effects were incorrectly identified. c The proportion of simulations in which at least one of the three active interaction effects were correctly identified. d The proportion of simulations in which at least two of the three active interaction effects were correctly identified. e The proportion of simulations in which all three active interaction effects were correctly identified.
7 Supplemental Table 4. The true positive (TP) and false positive (FP) rates of identifying main and interactions effects in 10 000 simulated datasets using marginal screening without Bonferroni-correction. The TP rate of identifying main effects was defined as the proportion of simulations in which at least one of the three active main effects were correctly identified; b The FP rate of identifying main effects was defined as the proportion of simulations in which at least one of the non-active main effects were incorrectly identified; c The TP rate of identifying interaction effects was defined as the proportion of simulations in which the one active interaction effect was correctly identified; d The FP rate of identifying interaction effects was defined as the proportion of simulations in which at least one of the non-active interactions effects were incorrectly identified. Table 5. The true positive (TP) and false positive (FP) rates of identifying main and interactions effects in 10 000 simulated datasets using Bonferroni-corrected marginal screening. 9 Abbreviations: TP, true positive; FP, false positive. Note: Simulated datasets with both the TP rate ≥ 80 % and the FP rate < 5 % are shown in bold. a The TP rate of identifying main effects was defined as the proportion of simulations in which at least one of the three active main effects were correctly identified; b The FP rate of identifying main effects was defined as the proportion of simulations in which at least one of the non-active main effects were incorrectly identified; c The TP rate of identifying interaction effects was defined as the proportion of simulations in which the one active interaction effect was correctly identified; d The FP rate of identifying interaction effects was defined as the proportion of simulations in which at least one of the non-active interactions effects were incorrectly identified.

Simulate data
Use the dd_sim-function to produce simulation data.