## 1. Introduction

[2] Over the past couple of decades, there has been an exponential explosion in the development of real-time sensor networks and other means for collecting and storing data in most areas of modern science [*Szalay and Gray*, 2006]. But to date, the analysis tools needed to sufficiently mine these data have not kept pace [*Gil et al*., 2007]. Recent emphasis on interdisciplinary research adds to the challenge because the data networks found, for example, in astronomy [*Lang and Hogg*, 2011], protein folding [*Khatib et al*., 2011], the Earth system [*Bickel et al*., 2007], and the hydrological sciences [*Wagner et al*., 2009] demand expertise across multiple disciplines for interpretation. These data-intensive issues result in the need for advanced statistical and computational tools capable of analyzing the complex, multivariate associations, and uncertainty inherent in these large data networks [*Emmott and Rison*, 2005].

[3] Pattern recognition techniques, such as clustering and classification, are important components of intelligent data preprocessing, data mining, and decision making systems [*Schalkoff*, 1992]. These tools are of particular interest in hydrological river research [*Dollar et al*., 2007; *Helsel and Hirsch*, 1992; *Wright*, 2000], where a number of statistical classification methods, both parametric and nonparametric, have been used to classify river regimes [*Harris et al*., 2000], explore the influence of streamflow on biological communities [*Monk et al*., 2006], and optimize the selection of input data to improve ecohydrological classification [*Snelder et al*., 2005] at multiple scales, including catchment [*Pegg and Pierce*, 2002], regional [*Nathan and Mcmahon*, 1990], national [*Poff*, 1996], and continental [*Puckridge et al*., 1998]. In addition, hydrologists often gain insights from well-gauged regions and classify or extrapolate to sparsely gauged regions using some limited number of stream characteristics. For example, *Kondolf* [1995] used geomorphologic characteristics to classify stream channel stability; *Alberto et al*. [2001] used select water quality parameters to identify variation at multiple temporal and spatial scales; *Rabeni et al*. [2002] used benthic invertebrates for stream habitat health classification; and *Besaw et al*. [2010] used local climate data to predict flow in small ungauged streams.

[4] Multiple correlated and cross-correlated data, missing data, binary data (i.e., presence or absence), and most importantly, the uncertainty inherent in these data pose significant limitations to existing classification and clustering algorithms and demand the development of new or hybrid clustering techniques [*Jain*, 2010]. A recent NSF-sponsored workshop, *Opportunities and Challenges in Uncertainty Quantification for Complex Interacting Systems* [*Ghanem*, 2009] recognized the need for new computationally efficient tools capable of improving the quantification of uncertainty in inferred models [*Roache*, 1997], network structure and model parameters [*Katz et al*., 2002].

[5] Bayesian methodology provides a fundamental approach to the problem of pattern classification [*Duda and Hart*, 1973], and offers the ability to quantify and reduce any kind of uncertainty given enough relevant new information (i.e., prior data) [*Malakoff*, 1999]. Combined with Monte Carlo Markov chain methods, Bayesian approaches have become popular for processing data and knowledge [*Steinschneider et al*., 2012] because of the relative computational ease with which they handle complex data sets [*Han et al*., 2012]. Bayesian methods have proven useful in hydrologic applications for parameter estimation and assessing uncertainty [*Smith and Marshall*, 2008]; and have been incorporated into stochastic simulation models [*Balakrishnan et al*., 2003; *Leube et al*., 2012; *Williams and Maxwell*, 2011] and optimization techniques [*Mariethoz et al*., 2010; *Reed and Kollat*, 2012] to reduce prediction uncertainty. In this work, we develop a new classification tool that couples the concept of a Naïve Bayesian classifier with an artificial neural network often used for nonparametric clustering and classification.

[6] Briefly, artificial neural networks (ANNs) are nonparametric statistical tools that specialize in nonlinear mappings given large amounts of data [*Haykin*, 1999; *Mitra et al*., 2002]. They have gained popularity in applications that require mining large numbers of multiple data types with both continuous and categorical responses [*Zhang et al*., 1998]. Along with other nonparametric statistical techniques, they are more suited than physics-based models [*Govindaraju and Artific*, 2000a, 2000b] when the objectives are classification or system characterization rather than an understanding of the physical system [*Kokkonen and Jakeman*, 2001]. Recently, ANNs have been shown to be more successful in many hydrology-related applications [*Maier and Dandy*, 2000; *Solomatine and Ostfeld*, 2008] than their traditional (parametric) statistical counterparts such as discriminant analysis [*Yoon et al*., 1993], regression techniques [*Paruelo and Tomasel*, 1997], principal component analysis [*Kramer*, 1991], and Bayesian analysis [*Cheng and Titterington*, 1994; *Richard and Lippmann*, 1991; *Wan*, 1990].

[7] Bayesian analysis has been incorporated into ANN algorithms used in hydrology for the purpose of improving the training procedure and overcoming the computational limitations associated with optimizing the ANN hidden weights [*Kingston et al*., 2005b]. Despite difficulties in coding these advanced Bayesian approaches into existing ANN learning algorithms (see *Titterington* [2004] for details), these predictive models provide a better means for computing uncertainty and model validation than the best deterministic ANN models while maintaining the high computational performance associated with traditional ANNs [*Kingston et al*., 2005a, 2008; *Zhang and Zhao*, 2012].

[8] In this work, we developed and applied a new framework that couples a Simple Bayesian analysis with a clustering ANN to advance the efficiency and statistical optimality of both techniques. Specifically, we use a Naïve Bayesian classifier in combination with an unsupervised ANN (the Kohonen Self-Organizing Map, SOM, Kohonen [1982, 1990]) to leverage prior information (or evidence) embedded in multiple data for the purpose of improving classification, while minimizing within class variance. To show proof-of-concept, we applied, evaluated, and tested the Bayesian-SOM network using two real-world data sets. The first uses genetic data and expert-assessed morphological data to predict the relative abundance of worm taxa related to the state of Whirling Disease in native fish populations in the upper Madison River, MT, USA. Specifically, we spatially estimate the relative abundance of stream sediment-dwelling worms. These worms are the definitive host of the parasite that causes Whirling Disease in fish that ingest these worms. The second application uses stream geomorphic and water quality data measured in ∼2500 Vermont stream reaches (comprising 1371 stream miles) to assess habitat conditions. We compared the new classification tool with traditional classification techniques, a Simple Bayesian analysis, a traditional Naïve Bayesian classifier and Gaussian mixture models.