Improving nonhomogeneous dynamic Bayesian networks with sequentially coupled parameters

In systems biology, nonhomogeneous dynamic Bayesian networks (NH‐DBNs) have become a popular modeling tool for reconstructing cellular regulatory networks from postgenomic data. In this paper, we focus our attention on NH‐DBNs that are based on Bayesian piecewise linear regression models. The new NH‐DBN model, proposed here, is a generalization of an earlier proposed model with sequentially coupled network interaction parameters. Unlike the original model, our novel model possesses segment‐specific coupling parameters, so that the coupling strengths between parameters can vary over time. Thereby, to avoid model overflexibility and to allow for some information exchange among time segments, we globally couple the segment‐specific coupling (strength) parameters by a hyperprior. Our empirical results on synthetic and on real biological network data show that the new model yields better network reconstruction accuracies than the original model.


INTRODUCTION
Dynamic Bayesian networks (DBNs) are a popular class of models for learning the dependencies between random variables from temporal data. In DBNs, a dependency between two variables X and Y is typically interpreted in terms of a regulatory interaction with a time delay. A directed edge from variable X to variable Y, symbolically X → Y, indicates that the value of variable Y at any time point t depends on the realization of X at the previous time point t − 1. Typically, various variables X 1 , … , X k have a regulatory effect on Y, and the relationship between X 1 , … , X k and Y can be represented by a regression model that takes the time delay into account. For example, if the time delay is one time point, t = 0 + 1 x 1,t−1 + · · · + k x k,t−1 + u t (t = 2, … , T), where T is the number of time points, y t is the value of Y at time point t, x i,t−1 is the value of covariate X i at time point t − 1, 0 , … , k are regression coefficients, and u t is the "unexplained" noise at time point t.
In DBN applications, there are N domain variables Y 1 , … , Y N , and the goal is to infer the covariates of each variable Y i . As the covariates can be learned for each Y i separately, DBN learning can be thought of as learning the covariates for a set of target variables {Y 1 , … , Y N }. There are N regression tasks, and in the ith regression model, Y i is the target variable, and the remaining N − 1 variables take the role of the potential covariates. The goal is to infer a covariate set i ⊂ From the covariate sets 1 , … , N , a network can be extracted. The network shows all regulatory interactions among the variables Y 1 , … , Y N . An edge Y j → Y i indicates that Y j is a covariate of Y i , that is, Y j ∈ i . In the terminology of DBNs, Y j is then called a regulator of Y i . All variables in i are regulators of Y i (i = 1, … , N).
The traditional (homogeneous) DBN is based on the assumption that the regression coefficients 0 , … , K stay constant across all time points (t = 2, … , T). Especially in systems biology, where one important application is to learn gene regulatory networks from gene expression time series, this assumption is often not appropriate. Many gene regulatory processes are subject to temporal changes, for example, implied by cellular or experimental conditions. It has therefore been proposed by Lébre, Becq, Devaux, Lelandais, and Stumpf (2010) to replace the linear model by a piecewise linear regression model, where the data points are divided into H temporal segments (by changepoints). The data points within each segment h are modeled by a linear model with segment-specific regression coefficients h,0 , … , h,K (h = 1, … , H). If the segmentation is not known, it can be inferred from the data (e.g., by employing a changepoint process). Replacing the linear model by a piecewise linear model yields a nonhomogeneous DBN (NH-DBN) model.
The problem of the piecewise linear regression approach is that the regression coefficients have to be learned separately for each segment. As the available time series are usually rather short, this approach comes with inflated inference uncertainties. A short time series is divided into shorter segments, and for each of the segments, the regression coefficients have to be learned. To avoid model overflexibility, two parameter coupling schemes were proposed: The segment-specific regression coefficients can be coupled globally (Grzegorczyk & Husmeier, 2013) or sequentially (Grzegorczyk & Husmeier, 2012). In this paper, we put our focus on the sequential coupling scheme and show how it can be improved. In the model of Grzegorczyk and Husmeier (2012), the coefficients of any segment are encouraged to be similar to those of the previous segment by setting the prior expectations of the coefficients in segment h+1 to the posterior expectations of the coefficients from segment h. A coupling parameter c regulates the coupling strength (i.e., the similarity of the regression coefficients). A drawback of the model is that all pairs of neighboring segments (h − 1, h) are coupled with the same coupling parameter c and thus with the same strength.
We present a new and improved sequentially coupled model that addresses this bottleneck. Our generalized sequentially coupled model possesses segment-specific coupling parameters. For each pair of neighboring segments (h − 1, h), there is then a segment-specific continuous coupling strength h (h = 2, … , H). The segment-specific coupling parameter h has to be inferred from the data points within segment h (h = 1, … , H). Because some segments might be rather short and uninformative, we impose a hyperprior onto the second hyperparameter of the coupling parameter prior. This allows for information exchange among the segment-specific coupling strengths 2 , … , H .

MATHEMATICAL DETAILS
We consider piecewise linear regression models where the random variable Y is the target and the random variables X 1 , … , X k are the covariates. We assume that T data points  1 , … ,  T are available and that the subscript index t ∈ {1, … , T} refers to T equidistant time points. Each data point  t contains a value of the target Y and the corresponding values of the k covariates. We assume further that the T data points are allocated to H disjunct segments, h ∈ {1, … , H}. Segment h contains T h consecutive data points with ∑ H h=1 T h = T. Within each individual segment h, we apply a Bayesian linear regression model with a segment-specific regression coefficient vector Let y h be the target vector of length T h and let X h be the T h -by-(k + 1) design matrix for segment h, which includes a first column of ones for the intercept. For each segment h = 1, … , H, we assume a Gaussian likelihood, where I denotes the T h -by-T h identity matrix, and 2 is the noise variance parameter, which is shared among segments. We impose an inverse gamma prior on 2 , −2 ∼GAM( , ). In the forthcoming subsections, we will present different model instantiations with different prior distributions for the segment-specific regression coefficient vectors h (h = 1, … , H).

The original sequential coupling scheme
In the sequentially coupled piecewise linear regression model, proposed by Grzegorczyk and Husmeier (2012), it is assumed that the regression coefficient vectors h have the following Gaussian prior distributions: where 0 is the zero vector of length k+1, I denotes the (k+1)-by-(k+1) identity matrix, and̃h −1 is the posterior expectation of h−1 . That is, only the first segment h = 1 gets an uninformative prior expectation, namely the zero vector, whereas the subsequent segments h > 1 obtain informative prior expectations, stemming from the preceding segment h − 1. We follow Grzegorczyk and Husmeier (2012) and refer to u ∈ R + as the signal-to-noise ratio parameter and to c ∈ R + as the coupling parameter. A low (high) signal-to-noise ratio parameter u yields a peaked (vague) prior for h = 1 in Equation (2), and thus, the distribution of 1 is peaked (diffuse) around the zero vector. A low (high) coupling parameter c yields a peaked (vague) prior for h > 1 in Equation (2) and thus a strong (weak) coupling of h to the posterior expectatioñh −1 from the preceding segment. We note that reemploying the variance parameter 2 in Equation (2) (10).
The posterior distribution of h (h = 1, … , H) can be computed in closed form (Grzegorczyk & Husmeier, 2012), as follows: and the posterior expectations in Equation (3) are the prior expectations used in Equation (2), as follows:̃h Grzegorczyk and Husmeier (2012) assigned inverse gamma priors to the parameters u and c , as follows: The fully sequentially coupled model is then fully specified, and we will refer to it as the  0,0 model. A graphical model representation for the relationships in the first segment h = 1 is provided in Figure 1, whereas Figure 2 shows a graphical model representation for the segments h > 1. The posterior distribution of the  0,0 model fulfills FIGURE 1 Graphical model of the probabilistic relationships in the first segment, h = 1. Parameters that have to be inferred are represented by white circles. The observed data (y 1 and X 1 ) and the fixed hyperparameters are represented by gray circles. All nodes in the plate are specific for the first segment. The posterior expectatioñ1 is computed and then treated like a fixed hyperparameter vector when used as input for segment h = 2 FIGURE 2 Graphical model of the probabilistic relationships within and between segments h > 1 for the  0,0 model from Grzegorczyk and Husmeier (2012). Parameters that have to be inferred are represented by white circles. The observed data and the fixed hyperparameters are represented by gray circles. All nodes in the plate are specific for segment h. The posterior expectatioñh −1 of the regression coefficient vector from the previous segment h − 1 is treated like a fixed hyperparameter vector. The posterior expectatioñh is computed and forwarded as a fixed hyperparameter vector to the subsequent segment h + 1 Like the regression coefficient vectors h , whose full conditional distributions have been provided in Equation (3), the parameters u and c can also be sampled from their full conditional distributions.
For the parameter 2 , a collapsed Gibbs sampling step, with h (h = 1, … , H) integrated out, can be used, as follows: wherẽ0 ∶= 0 and C h ∶= For the marginal likelihood, with h (h = 1, … , H) and 2 integrated out, the marginalization rule from section 2.3.7 of Bishop (2006) can be applied: where the matrices C h were defined below Equation (9). For the derivations of the full conditional distributions in Equations 8 and 9 and the marginal likelihood in Equation (10), we refer to Grzegorczyk and Husmeier (2012).

The improved sequential coupling scheme proposed here
We propose to generalize the sequentially coupled model from Section 2.1 by introducing segment-specific coupling parameters h (h = 2, … , H). This yields the new prior distributions, wherẽh −1 is again the posterior expectation of h−1 . For notational convenience, we now introduce two new definitions, namely 1 ∶= u and̃0 ∶= 0. We can then compactly write: For We show in the next subsection that̃h −1 , defined in Equation (12), is the posterior expectation of h−1 ; compare Equation (14). For the parameter u , we reuse the inverse gamma prior with hyperparameters u and u . For the first segment h = 1, we thus have the same probabilistic relationships for the original model; compare the graphical model representation in Figure 1. For the parameters h , we assume that they are inverse gamma distributed, −1 h ∼ GAM( c , c ), where c is fixed and c is a free hyperparameter. A free c allows for information exchange among the segments h = 2, … , H. We impose a gamma hyperprior onto c , symbolically c ∼GAM(a,b).
We refer to the improved model as the  1,1 model. A graphical model representation of the relationships within and between segments h > 1 is provided in Figure 3. The posterior of the  1,1 model is

Full conditional distributions of the improved sequentially coupled model
In this subsection, we derive the full conditional distributions for the  1,1 model, proposed in Section 2.2. For the derivations, we exploit that the full conditional densities are proportional to the posterior density and thus proportional to the factorized joint density in Equation (13  Unlike the original  0,0 model, whose graphical model is shown in Figure 2, the  1,1 model has a specific coupling parameter h for each segment h > 1. Furthermore, there is a new gamma hyperprior onto the second parameter of the inverse gamma prior on h . The hyperprior allows for information exchange among the segment-specific coupling parameters 2 , … , H The full conditional distribution of h (h = 1, … , H) can be derived as follows. With 1 ∶= u and̃0 ∶= 0, we have and from the shape of the latter density, it follows for the following full conditional distribution: We note that the posterior expectation in Equation (14) is identical to the one we used in Equation (12).
For the full conditional distributions of the segment-specific coupling parameters h (h = 2, … , H), we get and it follows from the shape of the following full conditional density: ) . (15) For the full conditional distribution of u , we get in a similar way The full conditional density has the shape of the inverse gamma distribution in Equation (8), that is, the following full conditional distribution of u stays unchanged: The new hyperparameter c can also be sampled from its full conditional distribution. The shape of For the noise variance parameter 2 , we follow Grzegorczyk and Husmeier (2012) and implement a collapsed Gibbs sampling step (with the h integrated out). We have A standard rule for Gaussian integrals (see, e.g., section 2.3.2 in Bishop, 2006) implies: With 1 ∶= u ,̃0 ∶= 0, and using the marginal likelihood from Equation (18), we have The shape of the latter density implies the following collapsed Gibbs sampling step (with the h integrated out): where 1 ∶= u and̃0 ∶= 0.
For the marginal likelihood, with h (h = 1, … , H) and 2 integrated out, again the marginalization rule from section 2.3.7 of Bishop (2006) can be applied. For the improved model, the following marginalization rule implies:

Models "in between" the original and the improved sequentially coupled model
In the last subsection, we have proposed an improved sequentially coupled model,  1,1 . Compared with the original model  0,0 from Section 2.1, we have proposed two modifications: (a) to replace the shared coupling parameter c by segment-specific coupling parameters { h } h≥2 , and (b) to impose a hyperprior onto the hyperparameter c of the inverse gamma prior on the coupling parameters { h } h≥2 . To shed more light onto the relative merits of the two individual modifications, we also define the two "in-between" models where only one of the two modifications is implemented.
The first "in-between" model  1,0 does not introduce segment-specific coupling parameters but places a hyperprior onto the hyperparameter c of the inverse gamma prior on the shared coupling parameter c . A graphical model representation for  1,0 is shown in Figure 4. The posterior distribution of the  1,0 model is an extension of Equation (7), as follows: The modification does neither change the earlier defined full conditional distributions from Section 2.1 nor the marginal likelihood in Equation (10). The only difference is that c has become a free parameter that, thus, must now be sampled too. For the full conditional distribution of c in the  1,0 model, we have This implies for the full conditional distribution:

FIGURE 4
Graphical model for the first "in-between" model  1,0 . See caption of Figure 2 for the terminology.
The model is an extension of the original sequentially coupled model, whose graphical model is shown in Figure 2. Unlike the original  0,0 model, the  1,0 has a free hyperparameter, c , with a gamma hyperprior The second "in-between" model  0,1 does make use of segment-specific coupling parameters { h } h≥2 but keeps the hyperparameter c of the inverse gamma priors on the parameters { h } h≥2 fixed. This yields that the segment-specific coupling parameters 2 , … , H are independent a priori. A graphical model representation is shown in Figure 5. The posterior distribution of the second "in-between" model  0,1 is a simplified version of Equation (13), as follows: and the modification (i.e., fixing c ) does not change either the full conditional distributions in Equations (14)- (16) and (19) or the marginal likelihood in Equation (20). The only difference is that c is kept fixed and will not be inferred from the data. The corresponding Gibbs sampling step [see Equation (21)] is never performed.

Learning the covariate set
In typical applications, the covariates have to be inferred from the data. That is, there is a set of potential covariates, and the subset, relevant for the target Y, has to be found. Let X 1 , … , X n be a set of potential covariates for the target variable Y, and let  1 , … ,  T be equidistant and temporally ordered data points. Each  t contains a target value y t and the values x 1,t−1 , … , x n,t−1 of the n potential covariates. A priori, we assume that all covariate sets FIGURE 5 Graphical model for the second "in-between" model  0,1 . See caption of Figure 3 for the terminology. The model is similar to the proposed improved sequentially coupled model, whose graphical model is shown in Figure 3. Unlike the proposed  1,1 model, the  0,1 model has a fixed hyperparameter c ⊂ {X 1 , … , X n } with up to three covariates are equally likely, whereas all parent sets with more than three elements have zero prior probability. 1 We make the assumption that there cannot be more than three covariates (| | ≤ 3) with regard to our applications in the field of gene network inference, and we note that this assumption might be inappropriate for other applications. However, in the context of gene regulatory networks, this assumption is very common. The known topologies of gene regulatory networks suggest that there are rarely genes that are regulated by more than three regulators. The assumption is thus biologically reasonable and has the advantage that it reduces the complexity of the problem and the computational costs of the Markov chain Monte Carlo (MCMC)-based model inference.
Given a fixed segmentation into the segments h = 1, … , H, for each possible covariate set , the piecewise linear regression models can be applied. We focus our attention on the improved sequentially coupled model  1,1 from Section 2.2, but we note that the MCMC algorithm can also be used for generating samples for the competing models ( 0,0 ,  1,0 , and  0,1 ). Only the marginal likelihood expressions have to be replaced in the acceptance probabilities.
Using the marginal likelihood from Equation (20), we obtain the following for the posterior of the  1,1 model: Given u , { h } h≥2 , and c , the Metropolis-Hastings algorithm can be used to sample the covariate set . We implement three moves: In the deletion move (D), we randomly select one X i ∈ and remove it from . In the addition move (A), we randomly select one X i ∉ and add it to . In the exchange move (E), we randomly select one X i ∈ and replace it by a randomly selected X j ∉ . Each move yields a new covariate set ⋆ , and we propose to replace the current by ⋆ . When randomly selecting the move type, the acceptance probability for the proposed move is , n is the number of potential covariates X 1 , … , X n , and |.| denotes the cardinality.

Learning the segmentation
If the segmentation of the data is unknown, we can also infer it from the data. We assume that a changepoint set A priori, we assume that the distances between changepoints are geometrically distributed with hyperparameter p ∈ (0, 1) and that there cannot be more H = 10 segments. 2 This implies the prior density: Let y ∶= {y 1 , … , y H } denote the segmentation, implied by the changepoint set . Again, we focus on the improved sequentially coupled model  1,1 , and just note that the MCMC algorithm requires only minor adaptions when used for the three competing models, namely the original sequentially coupled model  0,0 (see Section 2.1) and the two "in-between" models  1,0 and  0,1 (see Section 2.4).
Using the marginal likelihood from Equation (20), the posterior of the improved model takes the following form: In the birth move (B), we propose to set a new changepoint at a randomly selected location. The new changepoint set ⋆ then contains H ⋆ = H + 1 changepoints. The new changepoint is located in a segment h and divides it into two consecutive subsegments h and h + 1. For both, we resample the segment-specific coupling parameters ⋆ h , ⋆ h+1 ∼ INV-GAM( c , c ). In the death move (D), we randomly select one changepoint ∈ and delete it. The new changepoint set ⋆ then contains H ⋆ = H − 1 changepoints. Removing a changepoint yields that two adjacent segments h and h + 1 are merged into one single segment h, and we sample ⋆ h ∼ INV-GAM( c , c ). In the reallocation move (R), we randomly select one changepoint ∈ and propose to reallocate it to a randomly selected position in between the two surrounding changepoints. The reallocated changepoint yields new bounds for two consecutive segments h (whose upper bound changes) and h + 1 (whose lower bound changes), and for both segments, we resample the coupling T is the number of data points, and |.| denotes the cardinality. We note that the prior ratio has canceled with the inverse proposal ratio (HR) for resampling the coupling parameters for the new segments.
To adapt the MCMC algorithm to the competing models, the marginal likelihood expressions in the acceptance probability have to be replaced. Moreover, for the two models ( 0,0 and  1,0 ) with a shared coupling parameter c , we follow Grzegorczyk and Husmeier (2012) and implement the three changepoint moves such that they do not propose to resample c .

MCMC inference
For model inference, we use an MCMC algorithm. For the posterior distribution of the improved sequentially coupled model  1,1 , described in Section 2.2, we have We initialize all entities, for example, = {}, = {}, u = 1, h = 1 for h > 1, and c = 1, before we iterate between Gibbs and Metropolis-Hastings sampling steps: Gibbs sampling part: We keep the covariate set and the changepoint set fixed, and we successively resample the parameters u , h (h = 2, … , H), and c from their full conditional distributions. Although the parameters 2 and h (h = 1, … , H) do not appear in the posterior above, the full conditionals of u and 2 , … , H depend on instantiations of 2 and h . The latter parameters thus have to be sampled first but can then be withdrawn at the end of the Gibbs sampling part. The full conditional distributions have been derived in Section 2.3. With 1 ∶= u and 0 = 0, we have We note that each Gibbs step yields parameter updates and that the subsequent full conditional distributions are always based on the newest parameter instantiations (sampled in the previous steps).
Metropolis-Hastings sampling part: We keep u , h (h = 2, … , H), and c fixed, and we perform one Metropolis-Hastings move on the covariate set and one Metropolis-Hastings step on the changepoint set .
(M.1) We propose to change the covariate set → ⋆ by adding, deleting, or exchanging one covariate. The new covariate set is accepted with probability A( , ⋆ ); see Section 2.5 for details. If accepted, we replace ← ⋆ . If rejected, we leave unchanged. (M.2) We propose to change the changepoint set → ⋆ by adding, deleting, or reallocating one changepoint. Along with the changepoint set, we propose to update coupling parameters, ]); see Section 2.6 for details. If accepted, we replace ← ⋆ and { h } h≥2 ← { ⋆ h } h≥2 . If rejected, we leave and { h } h≥2 unchanged. The MCMC algorithm, consisting of seven sampling steps (G.1-5) and (M.1-2), yields a posterior sample, as follows: We adapt the MCMC inference algorithm for the three alternative models ( 0,0 ,  1,0 , and  0,1 ) from Sections 2.1 and 2.4. To this end, we modify the Gibbs and Metropolis-Hastings steps as outlined in Sections 2.4-2.6.

Learning dynamic networks
Dynamic network models are used for learning the regulatory interactions among variables from time series data. The standard assumption is that all regulatory interactions are subject to a time lag ∈ N. Here, we assume that the time lag has the standard value = 1. We further assume that the values of n random variables Y 1 , … , Y n have been measured at T equidistant time points t = 1, … , T. Let D denote the n-by-T data matrix with D i,t being the observed value of Y i at time point t. The piecewise linear regression models, described in the previous subsections, can then be applied to each variable Y i (i = 1, … , n) separately.
In the ith regression model Y i is the target, and the potential covariates are thẽ∶= n − 1

Model Coupling parameter(s) for h≥ Hyperparameter Graphical model see Section
 0,0 shared: c ∼GAM( c , c ) c fixed see Figure 2 2.1  1,0 shared: c ∼GAM( c , c ) c ∼GAM(a,b) see Figure 4 2.4  0,1 segment specific: h ∼GAM( c , c ) c fixed see Figure 5 2.4  1,1 segment specific: h ∼GAM( c , c ) c ∼GAM(a,b) see Figure 3 2.2 Note.  0,0 is the sequentially coupled model from Grzegorczyk and Husmeier (2012). In this paper, we propose the improved  1,1 model, featuring two modifications. We also include the "in-between" models ( 0,1 and  1,0 ) with only one modification incorporated. The first subscript of  indicates whether there is a hyperprior on c (0 = no,1 = yes). The second subscript of  indicates whether the model has segment-specific coupling parameters h (0 = no,1 = yes).
For the models without hyperprior ( 0,0 and  0,1 ), we further set 2) while we use the following for the models with hyperprior ( 1,0 and  1,1 ): For the four models, we run the MCMC algorithms with 100,000 iterations. Withdrawing the first 50% of the samples ("burn-in phase") and thinning out the remaining 50,000 samples (from the "sampling phase") by the factor 100 yield R = 500 samples from each posterior. To check for convergence, we applied diagnostics based on trace plot and potential scale reduction factor diagnostics (see, e.g., Gelman & Rubin, 1992). All diagnostics indicated perfect convergence for the above setting.

Synthetic RAF pathway data
For our cross-method comparison, we generate synthetic network data from the RAF pathway, as reported by Sachs, Perez, Pe'er, Lauffenburger, and Nolan (2005). The RAF pathway shows the regulatory interactions among the following n = 11 proteins: Y 1 : PIP3, Y 2 : PLCG, Y 3 : PIP2, Y 4 : PKC, Y 5 : PKA, Y 6 : JNK, Y 7 : P38, Y 8 : RAF, Y 9 : MEK, Y 10 : ERK, and Y 11 : AKT. There are 20 regulatory interactions (directed edges) in the RAF pathway. We extract the true 11-by-11 adjacency matrix  where  ,i = 1 if there is an edge from the jth to the ith protein. We get The covariate set of variable Y i is then i = {Y ∶  ,i = 1}. We follow Grzegorczyk and Husmeier (2012) and generate synthetic data sets with H = 4 segments having m data points each. For each Y i , we thus require four segment-specific regression coefficient vectors i,1 , … , i,4 , with each being of length | i | + 1. Given those vectors, we generate 11-by-(4m + 1) data matrices D, where D i,t is the value of Y i at t. Let D .,t denote the tth column of D (i.e., the values of the variables at t), and let D i ,t denote the subvector of D .,t containing only the values of the | i | covariates of Y i . We randomly sample the values of D .,1 from independent Gaussian distributions with mean 0 and variance 2 = 0.025, before we successively generate data for the next time points. D i,t (i = 1, … , 11; t = 2, … , 4m + 1) is generated as follows: where the i,t are independently  (0, 2 ) distributed noise variables, and H(t) is a step function, indicating the segment to which time point t belongs, as follows: We sample the regression coefficient vectors for the first segment h = 1, i,1 (i = 1, … , 11), from independent standard Gaussian distributions, and normalize each vector to Euclidean norm 1: i,1 ← i,1 | i,1 | . For the segments h > 1, we change the vector from the previous segment, i,h−1 , as follows: (U) either we change the regression coefficients drastically by flipping their signs, i,h = (−1) · i,h−1 ; we then say that segment h is uncoupled ("U") from segment h − 1, that is, i,h and i,h−1 are very dissimilar; (C) or we change the regression coefficients moderately. To this end, we first sample the entries of a new vector i,⋆ , from independent standard  (0, 1) Gaussians, and normalize i,⋆ to Euclidean norm , i,⋆ ← · i,⋆ | i,⋆ | , where is a tuning parameter. Then, we add the new vector to the vector i,h−1 and renormalize the result to Euclidean norm 1, as follows: We then say that segment h is coupled ("C") to segment h − 1, that is, i,h and i,h−1 are similar.
We use the symbolic notation "C − U − C" to indicate that segment 2 is coupled to segment 1, segment 3 is uncoupled from segment 2, and segment 4 is coupled to segment 3. In our simulation study, we consider all possible scenarios "S 2 − S 3 − S 4 " with S h ∈ {C, U} (h = 2, 3, 4), where S h = U (S h = C) indicates that segment h is uncoupled from (coupled to) segment h − 1, that is, the regression coefficient vectors i,h and i,h−1 are dissimilar (similar).
For coupled segments ("C"), the parameter regulates how similar the vectors i,h and i,h−1 are. For our first study, we set = 0.25. In a follow-up study, we then investigate the effect of and vary this parameter ( ∈ {0, 0.25, 0.5, 1}).

FIGURE 6
The true yeast network, as synthetically designed by Cantone et al. (2009) Cantone et al. (2009) synthetically generated a small network of n = 5 genes in Saccharomyces cerevisiae (yeast), depicted in Figure 6. The five genes are: CBF1, GAL4, SWI5, GAL80, and ASH1. The network among those genes was obtained from synthetically designed yeast cells grown with different carbon sources: galactose ("switch on") or glucose ("switch off "). Cantone et al. (2009) obtained in vivo data with quantitative real-time polymerase chain reaction (RT-PCR) in intervals of 20 min up to 5 h for the first and of 10 min up to 3 h for the second condition. This led to the sample sizes T 1 = 16 ("switch on") and T 2 = 21 ("switch off "). We follow Grzegorczyk and Husmeier (2012) and preprocess the data, (D (1)

Results for synthetic RAF pathway data
In our first empirical evaluation study, we cross-compare the network reconstruction accuracies of the four models  i, (i, j ∈ {0, 1}), listed in Table 1, on synthetic RAF pathway data, generated as described in Section 3.1. In this study, we assume the data segmentation into H = 4 segments to be known. We can then set the three changepoints at the right locations and do not perform MCMC moves on the changepoint set . That is, we keep fixed during the MCMC simulations. The corresponding moves on , described in Section 2.6, are skipped. In our first study, we generate data for eight different coupling scenarios of the form "S 2 − S 3 − S 4 " with S i ∈ {U, C}, where X h = U indicates that segment h is uncoupled from segment h − 1 (i.e., i,h and i,h−1 are dissimilar), and S h = C indicates that segment h is coupled to segment h − 1 (i.e., i,h and i,h−1 are similar). For the technical details, we refer to Section 3.1.
First, we perform a sanity check for the proposed  1,1 model: We investigate whether it actually infers different coupling parameters for the segments and whether the segment-specific coupling parameter distributions are consistent with the underlying coupling schemes of the form "S 2 − S 3 − S 4 ". For uncoupled segments with S h = U the coupling parameters should on

FIGURE 7
Boxplots of the logarithmic segment-specific coupling parameters for the RAF pathway data. We generated data with H = 4 segments and m = 10 data points per segment and distinguished eight coupling scenarios of the type: "S 2 − S 3 − S 4 " with S h ∈ {U, C}. For each scenario, there is a panel with four boxplots. The boxplots indicate how the logarithmic sampled coupling parameters log( 1 ) ∶= log( u ), log( 2 ), log( 3 ), and log( 4 ) are distributed for the given scenario. For a compact representation, we decided to merge all samples taken for N = 11 variables from independent Markov chain Monte Carlo simulations on 25 independent data instantiations. In each panel, the hth boxplot from the left refers to log( h ). Our focus is on h with h > 1, and it can be seen that the coupling parameters for coupled segments (S h = C) are lower than for uncoupled segments (S h = U). For m = 5 data points per scenario, we observed the same trends (boxplots not shown) average be greater than for coupled segments with S h = C (h = 2, 3, 4). Figure 7 shows boxplots of the inferred segment-specific coupling parameters 1 (= u ), 2 , 3 , and 4 . Focusing on h with h = 2, 3, 4, it can be seen from the boxplots that the coupling parameters for coupled segments (where S h = C) are consistently lower than for uncoupled segments (where S h = U).
The AUC results, provided in Figure 8, show that the proposed generalized model ( 1,1 ) shows, overall, the best performance. It is always among the best models and never performs substantially worse than any other model. On the other hand, the proposed  1,1 model outperforms its competitors for some settings, especially for scenarios where two out of three segments with h > 1 are coupled to the previous segment. For the scenarios "C − C − U", "U − C − C", and "C − U − C", the proposed model performs better than the three other models. When comparing the two in-between models ( 1,0 and  0,1 ) with the original  0,0 model from Grzegorczyk and Husmeier (2012), it becomes obvious that imposing a hyperprior on c , as implemented in the  1,0 model, almost consistently improves the AUC scores, whereas making the coupling parameter segment specific, as done in the  0,1 model, can lead to deteriorations of the AUC scores. We draw the conclusion that replacing the coupling parameter c by segment-specific parameters 2 , … , 4 is counterproductive, unless this modification is combined with a hyperprior so that information can be shared among the segment-specific coupling strengths. Just imposing a hyperprior, which then only allows us to adjust the prior for the coupling strength parameter c (in light of the data), also improves the network reconstruction accuracy, but the improvement is slightly minor to the improvement that can be achieved by implementing both modifications together, as proposed in this paper.
In a follow-up study, we then had a closer look at the eight scenarios and varied the tuning parameter ∈ {0, 0.25, 0.5, 1} for each of them. With the parameter , the similarity of the regression parameters (i.e., the coupling strength between coupled segments) can be adjusted. The greater , the weaker the similarity of the regression coefficient i,h and i,h−1 of coupled (a) (b) segments; see Section 3.1 for the mathematical details. As an example, Figure 9 shows the results for the scenario "C − C − U", which belongs to the scenarios where the proposed  1,1 model was found to outperform its competitors (see Figure 8). We can see from the AUC results in Figure 9 that = 0 and = 0.5 yield the same trends, as observed earlier for = 0.25; see Figure 8. However, for the highest ( = 1),  1,0 and  1,1 perform equally well and better than the other two models ( 0,0 and  0,1 ). The explanation for this finding is most likely as follows: The similarity of the coupled regression coefficients decreases in . Hence, for = 1, even the regression parameters for the two coupled segments get very dissimilar. Thus, = 1 implies that all four segment-specific regression coefficients i,h (h = 1, … , 4) are dissimilar, and there is no more need for segment-specific coupling parameters. The reason why for = 1, the original  0,0 model and the  0,1 model are inferior to the models that possess hyperpriors is probably the following: The models  0,0 and  0,1 cannot adjust the prior of the coupling parameter (in light of the data). As a consequence, they are likely to overpenalize dissimilar regression coefficients (i.e., high coupling parameters) through the prior.

FIGURE 9
Network reconstruction accuracy for the RAF pathway data. Unlike in Figure 8, here, we focus on the scenario "C − C − U" and vary the tuning parameter ∈ {0, 0.25, 0.5, 1}. Like in Figure 8, the model-specific bars and error bars correspond to the average area under the precision recall curve (AUC) values and their standard deviations

Results for the yeast gene expression data
In this subsection, we cross-compare the network reconstruction accuracies of the four models  i, (i, j = 0, 1), listed in Table 1, on the yeast gene expression data, described in Section 3.2. For this application, we infer the data segmentation (i.e., the changepoint set ) along with the network structure from the data. To vary the segmentations and especially the number of segments (i.e., the number of changepoints in ), we implement the four models with eight different hyperparameters p ∈ {0.00625, 0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8} of the geometric prior on the distance between changepoints; see Equation (23) in Section 2.6 for the mathematical details. We note that the hyperparameters are of the form p = 0.1 · 2 i with i = −4, −3, … , 3. Figure 10 shows histograms of the inferred posterior distributions of the numbers of segments H for the proposed  1,1 model from Section 2.2. It can be clearly seen that the inferred FIGURE 10 Posterior distribution of the numbers of segments H for the yeast data from Section 3.2. We implemented the models with different hyperparameters p for the geometric distribution on the distance between changepoints. For the proposed  1,1 model, the histograms show how the posterior distributions of the numbers of segments H vary with the hyperparameter p. Each row refers to a gene of the yeast network, and the four columns refer to the hyperparameters p ∈ {0.00625, 0.025, 0.1, 0.4}. For each number of segments 1 ≤ H ≤ 10, the bars give the relative frequencies with which the data of the corresponding gene were segmented into H segments in the posterior samples. The relative frequencies are averaged over 25 independent Markov chain Monte Carlo simulations FIGURE 11 Network reconstruction accuracy for the yeast gene expression data from Section 3.2. For our study, we implemented the four models, listed in Table 1, with eight different hyperparameters p = 0.1 · 2 i (i ∈ {−4, −3, … , 3}) for the geometric distribution on the distance between changepoints; see Equation (23). The upper left panel shows the average area under the precision recall curve (AUC) scores, averaged across 25 Markov chain Monte Carlo simulations. The other panels show the relative area under the precision recall curve differences in favor of the  1,1 model, with error bars indicating 95% t-test confidence intervals segmentations strongly depend on the hyperparameter p. For the lowest p, the data are rarely divided into more than H = 2 segments, whereas the posterior for p = 0.4 peaks at the imposed maximum of H = 10 segments. For the other three models, we observed almost identical trends (histograms not shown in this paper). Figure 11 shows how the empirical network reconstruction accuracy, quantified in terms of the average AUC values, varies with the hyperparameter p. From the upper left panel of Figure 11, which shows the average AUC scores, the following trends can be observed: • The  0,1 model (which has segment-specific coupling parameters 2 , … , H but does not couple them by a gamma hyperprior) performs worse than the original sequentially coupled model from Grzegorczyk and Husmeier (2012). Only for very small hyperparameters p ≤ 0.05 (i.e., when few changepoints are learned) the two models  0,0 and  0,1 reach approximately the same AUC scores. For higher hyperparameters, the introduction of segment-specific coupling parameters has a counterproductive effect and is thus not recommended. • The  1,0 model, which leaves the coupling parameter c shared among segments and only imposes a hyperprior to adjust the prior on c (in light of the data), performs better than the original model from Grzegorczyk and Husmeier (2012). For very small hyperparameters p ≤ 0.0125 (i.e., when very few changepoints are learned), the two models  0,0 and  1,0 reach approximately the same AUC scores. For the six higher hyperparameters p > 0.0125, the introduction of the hyperprior consistently leads to improved network reconstruction accuracies. • The  1,1 model, which we propose in this paper, has both modifications implemented: It has segment-specific coupling parameters 2 , … , H and couples those parameters by a gamma hyperprior; see Section 2.2 for a detailed model description. The combination of both modifications yields that the  1,1 model performs best overall. Its performance even stays stable for rather large hyperparameters p, where the AUC scores of the other three models substantially diminish. The AUC difference plots in Figure 11 show that most of the improvements are statistically significant in terms of paired t tests. In particular, the  1,1 model performs better than the original  0,0 model, proposed by Grzegorczyk and Husmeier (2012), for six out of eight hyperparameters p, namely for all p > 0.0125.
Based on our network reconstruction accuracy results for the real gene expression data, we conclude that the performance of the proposed generalized model ( 1,1 ) is superior to the performance of the original sequentially coupled model ( 0,0 ). The empirical results obtained with the two "in-between" models ( 1,0 and  0,1 ) suggest further that the main contribution stems from the hyperprior. The capability to adjust the coupling parameter prior in light of the data boosts the performance. The model  0,1 , whose segment-specific coupling parameters are not coupled by a hyperprior, led to decreased AUC results. The results, obtained for the yeast gene expression data, are hence in agreement with the earlier results obtained for synthetic network data; see Section 4.1.

CONCLUSIONS
In this paper, we have proposed an improved version of the NH-DBN model, proposed by Grzegorczyk and Husmeier (2012). Unlike the original  0,0 model, our new  1,1 model possesses segment-specific coupling (strength) parameters and a hyperprior on the coupling parameter priors; see Section 2.2 for the mathematical details. Replacing the shared coupling parameter c by segment-specific coupling parameters 1 , … , H increases the model flexibility, whereas the new hyperprior is allowing information exchange among segments and adjusting the coupling parameter prior(s) in light of the data. Our empirical evaluation studies on synthetic RAF pathway data (see Section 4.1) and on yeast gene expression data (see Section 4.2) have shown that the new  1,1 model leads to improved network reconstruction accuracies. To gain more insight into the merits of the two individual modifications, we also compared it with the performances of the two "in-between" models ( 1,0 and  0,1 ), which we defined to be subject to only one of the two modifications; see Table 1 for an overview to the four models  i, (i, j ∈ {0, 1}) under comparison. Overall, the proposed  1,1 has reached the highest network reconstruction accuracies among the four models. The  0,1 model, which we defined to have segment-specific coupling parameters but no hyperprior, performed worse than the original  0,0 model. The  1,0 model, which we defined to have a shared coupling parameter with a hyperprior, performed better than the original  0,0 model and for some scenarios comparable with the proposed  1,1 model. This shows that the major part of the improvement, achieved with the proposed  1,1 , stems from imposing a hyperprior onto the coupling parameter prior(s).
To put it in a nutshell, our empirical results show that the model variant  1,1 reaches, overall, the highest network reconstruction accuracies. With regard to future applications, we therefore recommend giving precedence to this model. Moreover, our results for the yeast gene expression data (see Figure 11) also suggest that only the  1,1 model is robust with respect to the changepoint process hyperparameter. The network reconstruction accuracies of the other models deteriorate, as the number of inferred changepoints increases. Only the network reconstruction accuracy of the  1,1 model stays high, even if the data are divided into short (uninformative) segments. This is a very important property for applications where the underlying segmentation is unknown and has to be inferred from the data. The number of inferred changepoints (i.e., the data segmentation) strongly depends on the changepoint process hyperparameter (see Figure 10). In the absence of any genuine prior knowledge, the changepoint process hyperparameter can easily be misspecified. Nonrobust models will then output biased results that might lead to erroneous conclusions.
Our future work will aim to transfer the concept of segment-specific coupling (strength) parameters to the globally coupled NH-DBN model from Grzegorczyk and Husmeier (2013). Unlike the sequential coupling mechanism, which requires a temporal ordering of the segments, the global coupling mechanism treats all segments as interchangeable units. When segmenting a single time series by changepoints, the assumption of interchangeable segments is often not appropriate. However, there are other applications in systems biology where data stem from different experiments so that the segments might contain data from different experimental conditions. The segments then do not have any natural order, and the sequential coupling scheme should be replaced by the global coupling scheme.