While there has been a lot of attention paid recently to big data, in which data is written to massive repositories for later analysis, there also is a rapidly increasing amount of data available in the form of data streams or events. Data streams typically represent very recent measurements or current system states. Events represent things that happen, often in the context of computer processing. When processing data streams or events, we often need to make decisions in real time. Complex event processing (CEP) is an important area of computer science that provides powerful tools for processing events and analyzing data streams. CEP deals with events that can be comprised of other events and can model complex phenomena like a user's interactions with a web site or a stock market crash. In the current literature, CEP is almost entirely deterministic, that is, it does not account for randomness or rely on statistical methods. However, statistics and machine learning have a critical role to play in the use of data streams and events. Also, understanding how CEP works is critical to analyzing data based on complex events. When processing data streams, a distinction must be made between analysis, the human activity in which we try to gain understanding of an underlying process, and decision making, in which we apply knowledge to data to decide what action to take. Useful statistical techniques for data streams include smoothing, generalized additive models, change point detection, and classification methods.

For further resources related to this article, please visit the WIREs website.

Interval-valued data refers to collection of observations in the form of intervals, rather than single numbers. It originally arose from situations of imprecision due to factors such as measurement or computation errors, where intervals are used to represent the true data points that are inside the intervals but not exactly known. Other circumstances include grouping and censoring. Recently, with the trend of big data, there is an increasing popularity of interval-valued data resulting from data aggregation. In the past decades, a great deal of effort has been seen in the literature to investigate linear regression with interval-value data. Various models that provide predictive tools and statistical inferences have been proposed and studied. The framework thus established is also well suited for both theoretical and computational advancements in the future.

For further resources related to this article, please visit the WIREs website.

Today, algorithms such as the gradient boosting machine and the random forest are among the most competitive tools in prediction contests. We review how these algorithms came about. The basic underlying idea is to aggregate predictions from a diverse collection of models. We also explore a few very diverse directions in which the basic idea has evolved, and clarify some common misconceptions that grew as the idea steadily gained its popularity. *WIREs Comput Stat* 2015, 7:357–371. doi: 10.1002/wics.1362

For further resources related to this article, please visit the WIREs website.

Classical statistics relies largely on parametric models. Typically, assumptions are made on the structural and the stochastic parts of the model and optimal procedures are derived under these assumptions. Standard examples are least squares estimators in linear models and their extensions, maximum-likelihood estimators and the corresponding likelihood-based tests, and generalized methods of moments (GMM) techniques in econometrics. Robust statistics deals with deviations from the stochastic assumptions and their dangers for classical estimators and tests and develops statistical procedures that are still reliable and reasonably efficient in the presence of such deviations. It can be viewed as a statistical theory dealing with approximate parametric models by providing a reasonable compromise between the rigidity of a strict parametric approach and the potential difficulties of interpretation of a fully nonparametric analysis. Many classical procedures are well known for not being robust. These procedures are optimal when the assumed model holds exactly, but they are biased and/or inefficient when small deviations from the model are present. The statistical results obtained from standard classical procedures on real data applications can therefore be misleading. In this paper we will give a brief introduction to robust statistics by reviewing some basic general concepts and tools and by showing how they can be used in data analysis to provide an alternative complementary analysis with additional useful information. In this study, we focus on robust statistical procedures based on M-estimators and tests because they provide a unified statistical framework that complements the classical theory. Robust procedures will be discussed for standard models, including linear models, general linear model, and multivariate analysis. Some recent developments in high-dimensional statistics will also be outlined. *WIREs Comput Stat* 2015, 7:372–393. doi: 10.1002/wics.1363

For further resources related to this article, please visit the WIREs website.

We were motivated by three novel technologies, which exemplify a new design paradigm in high throughput genomics: nanostring ^{TM}, DNA-mediated Annealing, Selection, extension, and Ligation DASL ^{TM}, and multiplex real-time *quantitative polymerase chain reaction* (QPCR). All three are solution hybridization based, and all three employ on 10–1000 DNA sequence probes in a small volume, each probe specific for a particular sequence in a different human gene. nanostring ^{TM} uses 50-mer, DASL and multiplex QPCR use ∼20-mer probes. Assuming a 1-nM probe concentration in a 1 μL volume, there are 10^{− 9} × 10^{− 9} × 6.23 × 10^{23} or 6.23 × 10^{5} molecules of each probe present in the reaction compared to 10–1000 target molecules. Excess probe drives the sensitivity of the reaction. We are interested in the limits of multiplexing, i.e., the probability that in such a design a particular probe would bind to any other, sequence-related probe rather than the intended, specific target. If this were to happen with appreciable frequency, this would result in much reduced sensitivity and potential failure of this design. We established upper and lower bounds for the probability that in a multiplex assay at least one probe would bind to another sequence-related probe rather than its cognate target. These bounds are reassuring, because for reasonable degrees of multiplexing (10^{3} probes) the probability for such an event is practically negligible. As the degree of multiplexing increases to ∼10^{6} probes, our theoretical boundaries gain practical importance and establish a principal upper limit for the use of highly multiplexed solution-based assays vis--*a*-vis solid-support anchored designs. *WIREs Comput Stat* 2015, 7:394–399. doi: 10.1002/wics.1364

For further resources related to this article, please visit the WIREs website.

Using attributed graphs to model network data has become an attractive approach for various graph inference tasks. Consider a network containing a small subset of interesting entities whose identities are not fully known and that discovering them will be of some significance. Vertex nomination, a subclass of recommender systems relying on the exploitation of attributed graphs, is a task which seeks to identify the unknown entities that are similarly interesting or exhibit analogous latent attributes. This task is a specific type of community detection and is increasingly becoming a subject of current research in many disciplines. Recent studies have shown that information relevant to this task is contained in both the structure of the network and its attributes, and that jointly exploiting them can provide superior vertex nomination performance than either one used alone. We adopt this new approach to formulate a Bayesian model for the vertex nomination problem. Specifically, the goal here is to construct a ‘nomination list’ where entities that are truly interesting are concentrated at the top of the list. Inference with the model is conducted using a Metropolis-within-Gibbs algorithm. Performance of the model is illustrated by a Monte Carlo simulation study and on the well-known Enron email dataset. *WIREs Comput Stat* 2015, 7:400–416. doi: 10.1002/wics.1365

For further resources related to this article, please visit the WIREs website.