Using an affinity analysis to identify phytoplankton associations

Abstract Phytoplankton functional traits can represent particular environmental conditions in complex aquatic ecosystems. Categorizing phytoplankton species into functional groups is challenging and time‐consuming, and requires high‐level expertise in species autecology. In this study, we introduced an affinity analysis to aid the identification of candidate associations of phytoplankton from two data sets comprised of phytoplankton and environmental information. In the Huaihe River Basin with a drainage area of 270,000 km2 in China, samples were collected from 217 selected sites during the low‐water period in May 2013; monthly samples were collected during 2006–2011 in a man‐made pond, Dishui Lake. Our results indicated that the affinity analysis can be used to define some meaningful functional groups. The identified phytoplankton associations reflect the ecological preferences of phytoplankton in terms of light and nutrient acquisition. Advantages and disadvantages of applying the affinity analysis to identify phytoplankton associations are discussed with perspectives on their utility in ecological assessment.


| INTRODUC TI ON
Ecologists are facing challenges to decipher a rich amount of biological and environmental information embedded in an ecological community. The classification of a set of taxonomic units into functional groups based on morphology and species traits has been widely used in ecological research (Litchman & Klausmeier, 2008;Usseglio-Polatera et al., 2000). If species are pooled into the same group based on similar morphological or physiological characteristics and developing ecological groups, that can help ecologists to better understand the interactions between biological communities and their environment. For example, stream macroinvertebrates have been categorized into functional feeding groups, such as scrapers, shredders, collector-gatherers, collector-filterers, and predators. Logez et al. (2013) suggested that similar fish assemblage functional structures will be found in similar environmental conditions. Phytoplankton is one of the primary producers in aquatic ecosystems, their community structure can be a rapid and sensitive response to varying environmental conditions in aquatic systems managers, especially in some rapidly developing regions such as China because of serious water pollution. A great deal of phytoplankton ecological studies have been conducted in Europe (EC Parliament and Council, 2000), North America (Arhonditsis et al., 2004), and China (Deng et al., 2014). The implementation of water programs has generated an enormous amount of phytoplankton data with large spatial scales using standardized field protocols. How to effectively use these "data sets" to enhance our current understanding of phytoplankton assemblages in relation to their environments and water resource management still remains challenging.
In order to provide a better understanding of the ecological information of phytoplankton associations, we introduce an affinity analysis called association rule for identifying phytoplankton associations. Association rule is a machine-learning method for discovering co-occurrence relationships among activities performed by specific individuals or groups in a large database using simple statistical performance measures. There have been many successful business applications for applying the method in finance, telecommunication, marketing, retailing, and web analysis (Chen et al., 2005). In ecological studies, we assume that each sampling site or sampling date is a "transaction" in a business setting and each species is an item and then develop the associations. Many researches focus on phytoplankton spatial and temporal variations in lakes and rivers, and therefore this study identified phytoplankton associations from spatially and temporally data sets, respectively.
The main objective of this study was to use affinity analysis to aid identification of the candidate associations of phytoplankton and then assess the relationships between the candidate phytoplankton associations and environmental factors using the redundancy analysis (RDA).
2 | ME THODS 2.1 | Data preparation 2.1.1 | River phytoplankton: A spatial data set River phytoplankton data were collected as a part of the Water Pollution Control Program in the Huaihe River Basin (30°55′-36°36′N, 111°55′-121°25′E) (HRB). The basin, with a drainage area of 270,000 km 2 , is located between the Yangtze River and the Yellow River in China (Wang & Xia, 2010). It forms a geographical separation between northern and southern China.
Phytoplankton samples were collected from 217 randomly selected sites during the low-water period from May 1 to May 31 in 2013 ( Figure 1). Detailed field and lab methods can be found in Zhu et al. (2015). At each site, three cross-section transects were established. We measured in situ environmental variables including water temperature (WT), pH, conductivity (Cond), turbidity, and total suspended solids (TSS) using a portable HACHCDC40105.
We also measured Secchi depth (SD), water depth, stream width, water velocity, and elevation. Spectrophotometer (DR5000) was used to measure total phosphorus (TP), total nitrogen (TN), and chemical oxygen demand (COD Mn ) according to standard methods (NEPAC, 2002).
For phytoplankton, a 1 L sample from the 0.5 m depth below surface was collected from three cross-section transects, respectively. After complete mixing, a 1 L sample was preserved with 1% Lugol's iodine solution immediately in the field and concentrated to 50 ml after sedimentation for 48 h. After complete mixing, 0.1 ml of the concentrated sample was counted directly in a 0.1 ml counting chamber under a microscope at 400× magnification. Phytoplankton was identified according to the reference book by Hu and Wei (2006). At least 400 algal units were counted in each sample. Phytoplankton biomass was expressed as wet biomass and was estimated for individual species by assigning a geometric shape similar to the shape of each phytoplankton species (Hillebrand, 1999). Detailed field and lab methods can be found in Zhu et al. (2013). Eight sampling stations were selected in the lake (Figure 2b). We measured in situ environmental variables including water temperature (WT), pH, conductivity (Cond), turbidity, total suspended solids (TSS), and Secchi depth (SD). Total phosphorus (TP), total nitrogen (TN), and chemical oxygen demand (COD Mn ) were measured according to standard methods.

| Data analysis
We selected 38 species in HRB and 23 species in Dishui Lake, respectively, after excluding "rare" taxa from analyses. The rare taxa were defined as those with average relative biomass (RB) <0.5% and occurred at <10 sites/samples (Zhu et al., 2015).

| Data format
The data set was formatted as an M × N matrix, where the row represents different sampling sites S1, S2, S3…, Sn and the column represents phytoplankton species G1, G2, G3…Gn. Each element [i,j] represents the occurrence of the species j in the sample i (Table 1).

| Association rule
The matrix usually contains large amounts of data; therefore, data mining techniques are used to extract useful knowledge. We followed the association rule proposed by Agrawal et al. (1993). Association rule is intended to capture a certain type of dependence among species represented in the database. The rule is defined as an implication of the form G1->G2, for example, an association rule between species in the form of G1->G 2 which means species 1 is also very likely to be observed with species 2 to form an association {G1, G2}.
The significance of the association rule is measured via support and confidence. The support of rule G1->G2 is the percentage of G1 and G2 occurring together. Confidence of rule G1->G2 is merely an estimate of the conditional probability of G2 given G1. If the confidence of rule G1->G2 is 1 that means G1 occurs in a particular site then G2 should occur in that site, too.
First, the binary phytoplankton data for identifying phytoplankton associations were constructed (Table 1), "S" represents the sampling site or time series, and "G" represents algae species. Secondly, the support of phytoplankton association was calculated. For instance, the association {G1, G3} has 18% support because the species G1 and G3 occur together in 2 of the 11 sites (Table 2). Finally, we calculated the confidence of each phytoplankton association (Table 3). For example, the confidence of the association {G1, G3} is 0.5 because species 3 occurs at half of times that also contains species 1.
We identified the phytoplankton associations based on both support >=50% and confidence >= 0.8 .

| Relationships between phytoplankton associations and environmental variables
We analyzed the phytoplankton assemblages characterized by 12 phytoplankton associations in HRB using detrended TA B L E 4 The summary of phytoplankton taxa frequency in HRB and Dishui Lake

| DISCUSS ION
Compared to the traditional phytoplankton functional group development (Reynolds et al., 2002), affinity analysis encompasses a broad set of analytics techniques aimed at uncovering the connections of phytoplankton associations. Affinity analysis is a method for rapidly finding phytoplankton associations from a large data set. It has the advantage of time-saving and easy use, especially for new algae researchers in a region with limited ecological studies on local phytoplankton assemblages. So affinity analysis can be used as a first step to identify candidate phytoplankton associations.
The identified phytoplankton associations reflect the ecological preferences of phytoplankton including the resource acquisition (e.g., light and nutrients) and competitive abilities (e.g., r/K selection or C-S-R model) (Salmaso et al., 2015). Cryptomonas erosa and Cryptomonas ovata or Chroomonas acuta from the same family were often concurrent in HRB (Table 2). These species can benefit from both mixotrophy and phagotrophy, and also can tolerate high dissolved nutrients and limiting light conditions (Graham & Wilcox, 2000;Kruk & Segura, 2012), Scenedesmus and Selenastrum are more resistant to grazing than either Chlamydomonas or Ankistrodesmus, while the latter two taxa are better competitors in the absence of grazing (Drake et al., 1993).
Motile benthic diatoms such as Nitzschia palea are concurrent with some planktonic algae. Benthic diatom motility offers not only a selective advantage on silty substrata but also it is correlated with some ecological traits (Passy, 2007). Kawamura et al. (2004) demonstrated that the grazing pressure of gastropods had an influence on the Nitzschia species.
We performed an RDA for assessing the applicability of the identified phytoplankton associations in environmental assessment.
In HRB, light and TN were the best predictors of phytoplankton associations ( Figure 3). Our results are consistent with (Mackay et al., 2012) that the diatom-association was strong with the TN nutrient.
In Dishui Lake, the light and salinity were the best predictors for phytoplankton associations (Figure 4). Chrysophytes are restricted to cold, oligotrophic conditions. Small Chromulina groups showed a different response to pH and water clarity, compared to the medium size Chlamydomonas groups ( Figure 4). The importance of pH as a primary factor affecting chrysophytes has been reported in studies from widely separated geographic regions. Chromulina and Chlamydomonas are both r-selected taxa, their small-medium body size and motility conferred by flagella are advantages and allow them to reduce sinking rate (Kruk et al., 2010). Compared to the Chlamydomonas, the Chromulina prefer the oligotrophic environments with an abundance of macrophytes. Compared to the river, more variance (33%) in phytoplankton associations can be explained by different combinations of environmental factors in the manmade shallow lake. A lake is perceived to be relatively stable, that of a river, and is characteristically graded from the origin to the rivermouth (Reynolds et al., 1994). Therefore, candidate phytoplankton associations are reasonable proxies for explaining environmental variables.

F I G U R E 3
The RDA plot of 12 phytoplankton associations and environmental variables in HRB (DO: Dissolved oxygen; TN: Total nitrogen; NP:TN/TP ratio; Cond: Conductivity; phytoplankton association codes are in Table 5) Binary data were used to construct the phytoplankton associations in this work, which ignores the abundance of phytoplankton species. Although binary data are commonly observed and analyzed in many application fields (Yamamoto & Hayashi, 2015), some species which were not abundant potentially contributed much more to the analysis than those common taxa but we minimized the effects. The phytoplankton associations identified by affinity analysis should be viewed as candidate associations and each association should be carefully evaluated using ecological theories and concepts.
In essence, affinity analysis can be a useful method for find-

CO N FLI C T O F I NTE R E S T
The authors declared that they have no conflicts of interest to this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available online:

F I G U R E 4
The RDA plot of 15 phytoplankton associations and environmental variables in Dishui Lake (Sal: Salinity; phytoplankton association codes are in Table 5)