Modelling global insect pest species assemblages to determine risk of invasion


Dr S. P. Worner, National Centre for Advanced Bio-Protection Technologies, Ecology and Entomology Group, Lincoln University, PO Box 84, Canterbury, New Zealand (fax +64 3325 3844; e-mail


  • 1The many thousands of potential invasive species pose one of the greatest threats to global biodiversity world-wide. In this study we propose that assemblages of well-known global invasive pest species, irrespective of whether they arise by anthropogenic means, are non-random species groupings that contain hidden predictive information. Such information can assist the identification and prioritization of species that have the potential to pose an invasive threat in regions where they are not normally found.
  • 2Data comprising the presence and absence of 844 insect pest species recorded over 459 geographical regions world-wide were analysed using a self-organizing map (SOM), a well-known artificial neural network algorithm. The SOM analysis classified the high dimensional data into two-dimensional space such that geographical areas that had similar pest species assemblages were organized as neighbours on a map or grid.
  • 3The SOM analysis allowed each species to be ranked in terms of its risk of invasion in each area based on the strength of its association with the assemblage that was characteristic for each geographical region. A risk map for example species was produced to illustrate how such a map can be compared with the species’ actual distribution and used with other information, such as the species’ biotic characteristics and interactions with the abiotic environment, to improve pest risk assessments further.
  • 4 Synthesis and applications. This study presents a new approach to the identification of potentially high-risk invasive pest species based on the hypothesis that global insect pest assemblages are non-random species groupings that can be subjected to traditional community analysis. A well-known data mining and knowledge discovery method for high dimensional data, SOM, was used to determine pest species assemblages for global regions. Species were ranked according to their potential for establishment based on their strength of association with the species assemblage that characterizes a particular region. Such an analysis can then be used to support additional risk assessment of potential invasive species, giving invasive species researchers, conservation managers, quarantine and biosecurity scientists a means for prioritizing species as candidates for further research.


Species assemblages are groupings of species that co-occur in the same place and at the same time. Such assemblages are not instantly created but come into being through the progressive invasion of species such that the assemblage is built up sequentially from a simple starting point (Begon, Harper & Towsend 1996). In this study we investigated the largely anthropogenic global pest assemblages of large-scale geographical regions to determine whether such species groupings can provide predictive knowledge and insight into exotic pest species invasions. We used the term assemblage in a broad sense, in line with authors such as Keddy (1992) and Diaz, Cabido & Casanoves (2001), who apply it to any regional species pool that has come about as a result of ‘a filter of any kind’.

Jones & Kitching (1981) define a pest as an organism that damages crops, destroys products, transmits or causes disease, is annoying or in other ways conflicts with human needs or interests. International concerns about preserving biodiversity further extend the definition of a pest to a species that can either cause native species decline or alter the structure and function of natural ecosystems (Worner 2002).

Exotic animals can be introduced into new areas either accidentally or on purpose. While many have little impact on the lives of humans or on local native flora and fauna, some, as a consequence of the absence of natural enemies, undergo an explosive increase in numbers and rapid spread. Such species soon come to occupy large areas, inflicting dramatic damage on crops, stock and native ecosystems that can cause considerable economic and environmental harm (Elton 1958; Sailer 1983; Pimentel 1986; Simberloff 1986, 1989; Worner 1994). More recently, pest invasions have been recognized as an important cause of loss of biodiversity. One of the best ways to reduce the likelihood of exotic species invasions is to prevent their establishment, but this is dependent on identifying potential invasive species in advance of their establishment

Clearly, for any species given the opportunity to invade a new area, a source of food and a place to live and reproduce are fundamental to establishment success. But, to evaluate fully the invasive potential of a species, the conditions required for invasion, the characteristics of the invasive species and the ecosystems that are susceptible to invasion, need to be examined (Worner 2002).

Successful establishment of a species arriving in a new environment depends on both biotic and abiotic factors, including climate and the specific environmental conditions of the habitat in the area of invasion (Mooney & Drake 1989). The species assemblages investigated in this study were unusual because they comprise groupings of phytophagous pest species that occur largely as a result of human actions. Our hypothesis was that geographical areas with similar pest assemblages share similar biotic and abiotic conditions that allow particular pest species to invade the area. Biotic conditions can include the particular assemblage of crops and garden plants growing in a region, which in turn reflect abiotic, mainly climatic, conditions that affect species distribution and abundance. In addition, abiotic factors can include many anthropogenic factors, such as a similar history of invasion or pathways of entry, the amount and type of trade, the amount of protected cultivation, the effectiveness of quarantine procedures and the resources available to find and report pest species in any country. We hypothesized that the particular combination of pest species in a region integrates these complex factors and their interactions. In other words, pest species assemblages are non-random species groupings that contain hidden predictive information that can be analysed using ecological community analysis techniques.

Despite species invasions having been studied for decades, little progress has been made such that advances towards arresting biological invasions will not be made unless the limitations of standard methods are addressed (Hulme 2003). Furthermore, for pest insects in particular, research has usually focused on individual species. More recently, Levine & D’Antonio (2003) have used species accumulation models to estimate the percentage increase in species invasion with increasing international trade. Such curves are based on the concept that there is not an infinite pool of invasive species from which to sample. To our knowledge, there has been no investigation of potential species invasion taking into account pest assemblages. This approach is particularly relevant as a basis for further pest risk assessment studies for conservation of biodiversity and the development of national quarantine and biosecurity strategies.

Traditional analysis tools for measuring and understanding species communities and relating community patterns to changes in spatial and temporal ecosystem conditions include the simple metrics of species richness, diversity and similarity. The description of complex species assemblage structures using a single attribute has been criticized because valuable information may be lost (Begon, Harper & Towsend 1996). Recent research in the study of communities and complex systems has shown that the newer analytical methods of non-linear artificial intelligence (Park et al. 2003a,b) can characterize species assemblages as components of ecological communities extremely well. Examples of such studies are the analysis of the Trichoptera assemblage in Danish streams (Wiberg-Larsen et al. 2000), the assessment of the Luxembourg river water quality using diatom assemblages (Gevrey et al. 2004), the investigation of macroinvertebrate assemblages in Brazil (Buss et al. 2004) and the prediction and spatial mapping of New Zealand freshwater fish and decapod assemblages (Joy & Death 2004).

A self-organizing map (SOM), which is an artificial neural network (ANN) model (Kohonen 1982), was used to identify pest species assemblages and potential invasive insects based on a comprehensive database of the global presence or absence of pests (CABI 2003). The SOM is an efficient method for analysing systems ruled by complex non-linear relationships and provides an alternative to traditional statistical methods for classifying complex data (Lek et al. 1996; Lek & Guegan 2000; Park et al. 2003a). More generally, SOM are used widely for knowledge discovery, pattern recognition, clustering and visualization of large multidimensional data sets. Potentially useful but hidden information can be revealed and further examined using more traditional techniques. Successful results in aquatic community ecology using such models have been well documented (Chon et al. 1996; Cereghino, Giraudel & Compin 2001; Giraudel & Lek 2001; Park et al. 2003b). This study highlights how these novel tools could improve risk assessments of potential insect invaders.

The objectives of this study were to (i) classify global geographical areas based on assemblages of exotic insect pest species associated with each area or cluster, and (ii) identify and quantify the potential risk of invasion of these insect species, based on their strength of association with other species within the geographical clusters. New Zealand was used as a particular example.

Materials and methods


The data used in this analysis were extracted with permission from the Crop Protection Compendium (CABI 2003). This database encompasses a wide range of different types of information on all aspects of crop protection (e.g. pests, diseases, weeds, natural enemies and crops) associated with most areas in the world. The geographical areas represented in the compendium include countries, regions and states of countries such that all continents are represented (with the exception of the Arctic and Antarctic regions). The full compendium includes many species for which only partial information is available. To ensure that we only included species with adequate distributional information, we selected those that occur in more than 2% of geographical areas (Waite 2000). Of those species, only those for which information was confirmed by experts were used. This comprised 844 mainly phytophagous insect pests for 459 geographical areas (Table 1). The presence (1) and the absence (0) of each species in each geographical area resulted in a database comprising a [844 × 459] matrix.

Table 1.  Summary of the pest species represented in the database
OrderFamily numberSpecies number
Diptera12 83
Thysanoptera 2 39
Orthoptera 2 16
Hymenoptera 7 11
Isoptera 2  3
Psocoptera 1  3
Collembola 1  1

som model

A conventional cluster analysis comprising single and complete linkage clustering using the Jaccard coefficient and Euclidean distance as similarity metrics (Krebs 2001) failed to organize the data matrix. The result was a series of long drawn-out clusters with a large number of branches that we were unable to interpret. A SOM, however, is able to analyse such high dimensional data, performing a non-linear projection of the multidimensional data space onto two-dimensional space. The SOM neural network consists of two layers of elements or neurones: the input layer and the output layer. The output layer is represented by a map or a rectangular grid with l by m neurones (or cells), laid out in a hexagonal lattice.

It is possible to use two different learning algorithms in a SOM. An incremental algorithm is commonly used but learning is highly dependent on the order of input; a batch algorithm overcomes this fault and was used here. It is significantly faster and does not require specification of any learning rate factor (Kohonen & Somervuo 1998). The batch SOM algorithm can be summarized as follows. (i) Initialize the values of the virtual vectors (VVi, 1 ≤ i ≤ c) using random values. (ii) Repeat steps (iii) to (vi) until convergence. (iii) Read all the sample vectors (SV) one at a time. (iv) Compute the Euclidean distance between SV and VV. (v) Assign each SV to the nearest VV according to the distance results. (vi) Modify each VV with the mean of the SV that were assigned to it. Details of the SOM algorithm and its theoretical basis can be found in Kohonen (1995), Kohonen & Somervuo (1998) and Kohonen (2001).

In our study the input layer comprised 844 neurons (one for each pest species) connected to the 459 sites (geographical areas) such that the representation of the presence–absence of 844 species for each site formed 459 sample vectors. The output layer comprised 108 neurons organized in an array with 12 rows and nine columns. The number of neurons or cell in the output layer was first defined using the formula inline image proposed by the Laboratory of Computer and Information Science (CIS), Helsinki University of Technology (Espoo, Finland), where c is the number of cells and n is the number of training samples (sample vectors). Each cell of the output layer is linked to the input neurons (i.e. 844) by connections that have weights associated with them, forming a virtual vector. During the learning process, a virtual pest species assemblage is then computed for each cell (Fig. 1). While initially the number of output neurons was decided using the formula cited previously, several maps were created using different sizes. Classification results were very similar. We used the topographic and quantification errors to determine final map size (Kohonen 2001). For the final map size (108 neurones or cells), the errors were, respectively, 0·022 and 0·5832. A smaller number of output neurons increased both error values; a larger number of neurons decreased the errors slightly but the limitation of computer memory was quickly reached. In addition, a larger map size made interpretation difficult.

Figure 1.

Self-organizing map architecture. The input layer is linked to the cells of the output layer by connections called weights that define the virtual assemblages of the species.

The aim of the SOM algorithm is to assist organization and visualization of data by arranging the distribution of the sample sites onto a two-dimensional space represented by the map cells. With 459 sites a traditional approach would require 0·5 × 459(459 − 1) = 105 111 similarity indices to be sorted, which is clearly unmanageable. The virtual vectors that are neighbours on the map represent neighbouring clusters of sample sites. Various distance measures can be used to organize and project similar sample vectors onto the map. In this study we selected the Euclidean distance, advised by Kohonen & Somervuo (1998). After the SOM had finished learning, each cell of the output layer had a virtual vector that could be interpreted as a virtual pest species assemblage. These vectors were composed of values over the interval [0,1], each of which could be interpreted as a risk index that indicates a species’ potential to be present or to be associated with the sites within each cell. All the sites that are associated with the same cell have a similar pest species assemblage composition both in regard to species presence as well as absence at each site. Therefore, a species that is present in one site but not in another in the same cell can be considered to pose a high risk of invasion to that site. Using a grey colour gradient, the weights associated with each virtual site vector can be used to display the strength of association of each species with the assemblage in each cell. The darker the cell, the higher the risk the species might invade the sites included in the cell.

The SOM simulator used in this study was programmed using the Matlab programming language (Mathworks 2001) and SOM toolbox (version 2.0 beta, compatible with Matlab 6.5) developed by the Laboratory of Information and Computer Science, Helsinki University of Technology (, accessed 21/6/06). The geographical areas were obtained using ArcView 3·2 (ESRI Corporation, Redlands, CA).

cluster analysis

Sites that are neighbours on the grid are expected to be more similar to each other, whereas sites distant from each other (according to their pest species assemblage) are expected to be distant in the feature space. To detect the cluster boundaries on the map, a cluster analysis is applied to the SOM model output (Park et al. 2003a,b). Some authors also use the unified-matrix (U-matrix) approach (Ultsch & Siemon 1990), where the U-matrix displays the distances between the virtual sites and provides a landscape formed by light plains separated by dark ravines. However, this method does not give crisp boundaries to the clusters. In this study a hierarchical cluster analysis with a Ward linkage method was applied to the SOM results to identify the edges of each cluster of sample vectors (Park et al. 2003a,b). The Davies–Bouldin index (DBI; Davies & Bouldin 1979) was then calculated to justify the choice of the number of clusters. To check the validity of non-intuitive groupings of geographical sites, the Jaccard similarity index and percentage similarity (Krebs 2001) calculated between selected sites were examined.


The non-linear projection of presence–absence data onto two-dimensional space allowed us to classify global geographical areas according to the similarity of their pest species assemblages (see Figure S1 and Appendix S1 in the supplementary material). A hierarchical cluster analysis was applied to the SOM results (Fig. 2a). The optimum number of clusters according to the DBI value (0·981) was six clusters (Ia, Ib1, Ib2, IIa, IIb1 and IIb2), as shown in Fig. 2b.

Figure 2.

(a) Dendrogram of the cluster analysis; (b) self-organizing map with the clusters defined by the cluster analysis: Ia, Ib1, Ib2, IIa, IIb1 and IIb2 (see text).

Cluster Ia represented geographical areas in northern latitudes that have low numbers of pest species recorded in the database. Most of the locations (85%) in this cluster had only 2% of the (844) species present and were mainly small islands, desert areas and colder areas of Greenland, Alaska and parts of Russia. The USA and most of Canada comprised cluster Ib1. Cluster Ib2 included New Zealand, several regions of Australia (south Australia, Tasmania, Victoria, western Australia), a large part of Europe, Chile and some Mediterranean countries such as Sardinia, Sicily, Cyprus, Algeria, Lebanon, Libya, Morocco and Tunisia. Cluster IIa was a specific cluster that included larger geographical areas, for example the whole of the USA and Australia. This cluster was an artefact of the original database construction and could be ignored. Cluster IIb1 included the countries of South and Central America; Cluster IIb2 comprised many African and Asian countries. These clusters were plotted on a map of the world using different grey scales and effects to visualize each cluster (Fig. 3).

Figure 3.

World map showing the 459 geographical areas represented by different shades of grey and patterns according to the cluster to which they belong.

For each cell of the SOM map, and therefore each geographical area, it was possible to define a virtual species assemblage. The strength of the association of each of the 844 species with each assemblage could be visualized on the SOM map, creating one map per species. The virtual assemblage inside each cell could be considered an index of the risk of invasion of each species in the countries associated with each cell. For New Zealand the risk values of the most invasive species were calculated (Table 2) and their presence or absence in New Zealand recorded. Risk and distribution maps of two high profile species, the Mediterranean fruit fly Ceratitis capitata (Wiedemann) (Fig. 4) and the gypsy moth Lymantria dispar L. (Fig. 5) are shown. Both species are absent in New Zealand and, while this country is currently more threatened by the Asian form of the gypsy moth, the European form is also a potential threat to the country. The analysis of the distribution of risk for C. capitata indicated that this species had a high risk (risk index 0·73) of invasion whereas the gypsy moth L. dispar had a lower risk index of 0·31, indicating that risk of this species establishing in New Zealand is in the medium range. Clearly, however, this information must be interpreted within the context of a full risk assessment for each species.

Table 2.  List of pest species that have the highest potential risk of invasion in New Zealand. For each species the risk and their presence or absence (P or A) in New Zealand is noted
NameRisk indexP or ANameRisk indexP or A
Planococcus citri 0·930 Toxoptera aurantii 0·491
Icerya purchasi 0·921 Taylorilygus pallidulus 0·490
Myzus persicae 0·871 Aleurothrixus floccosus 0·480
Cydia pomonella 0·861 Pseudaulacaspis pentagona 0·480
Nezara viridula 0·851 Pieris rapae 0·471
Brevicoryne brassicae 0·831 Hadula trifolii 0·470
Delia platura 0·801 Ephestia elutella 0·471
Phthorimaea operculella 0·791 Rhopalosiphum rufiabdominale 0·461
Pseudococcus longispinus 0·791 Liriomyza trifolii 0·460
Aphis spiraecola 0·771 Sitona discoideus 0·461
Saissetia oleae 0·771 Spodoptera exigua 0·460
Coccus hesperidum 0·771 Sitobion avenae 0·450
Aonidiella aurantii 0·761 Therioaphis trifolii 0·451
Eriosoma lanigerum 0·761 Locusta migratoria 0·451
Aphis gossypii 0·761 Prays citri 0·430
Viteus vitifoliae 0·751 Hippotion celerio 0·431
Ceratitis capitata 0·730 Pantomorus cervinus 0·431
Agrotis ipsilon 0·731 Schizaphis graminum 0·420
Bemisia tabaci 0·701 Oulema melanopus 0·420
Helicoverpa armigera 0·701 Scolytus rugulosus 0·420
Acyrthosiphon pisum 0·701 Drosophila melanogaster 0·420
Thrips tabaci 0·691 Sitona lineatus 0·421
Saissetia coffeae 0·681 Mythimna unipuncta 0·410
Rhopalosiphum maidis 0·681 Pectinophora gossypiella 0·410
Plutella xylostella 0·681 Hellula undalis 0·411
Chrysomphalus dictyospermi 0·670 Peridroma saucia 0·410
Aspidiotus nerii 0·671 Parlatoria ziziphi 0·410
Frankliniella occidentalis 0·611 Gonipterus gibberus 0·401
Rhopalosiphum padi 0·611 Acanthoscelides obtectus 0·401
Hyperomyzus lactucae 0·611 Ceroplastes floridensis 0·401
Agrius convolvuli 0·601 Parasaissetia nigra 0·401
Diaspidiotus perniciosus 0·601 Lixus juncii 0·400
Aphis fabae 0·600 Sminthurus viridis 0·401
Phoracantha semipunctata 0·591 Diaspidiotus ostreaeformis 0·391
Heliothrips haemorrhoidalis 0·591 Henosepilachna elaterii 0·390
Macrosiphum euphorbiae 0·591 Lepidosaphes ulmi 0·390
Phyllocnistis citrella 0·580 Scrobipalpa ocellatella 0·380
Ceroplastes rusci 0·570 Siphoninus phillyreae 0·381
Chrysomphalus aonidum 0·570 Antigastra catalaunalis 0·380
Parthenolecanium persicae 0·561 Unaspis citri 0·380
Trichoplusia ni 0·550 Mythimna loreyi 0·370
Cadra cautella 0·540 Thrips simplex 0·371
Lepidosaphes beckii 0·540 Bactrocera oleae 0·370
Aphis craccivora 0·541 Lipaphis erysimi 0·371
Lampides boeticus 0·541 Spodoptera littoralis 0·360
Agrotis segetum 0·540 Orthezia insignis 0·360
Sitophilus zeamais 0·530 Prays oleae 0·360
Pieris brassicae 0·530 Listroderes costirostris 0·361
Hemiberlesia lataniae 0·521 Liriomyza huidobrensis 0·350
Toxoptera citricida 0·521 Cryptoblabes gnidiella 0·351
Parthenolecanium corni 0·501 Sesamia cretica 0·350
Grapholita molesta 0·501 Acronicta rumicis 0·340
Metopolophium dirhodum 0·491 Dialeurodes citri 0·340
Hemiberlesia rapax 0·491 Gynaikothrips ficorum 0·340
Figure 4.

Risk distribution of the Mediterranean fruit fly Ceratitis capitata, which is not present in New Zealand, (a) represented on the SOM map (the white * shows the cells where New Zealand is located) and (b) represented on a world map. (c) The actual distribution (presence and absence) of the species on a world map.

Figure 5.

Risk distribution of the gypsy moth Lymantria dispar, which is not present in New Zealand, (a) represented on the SOM map (the white * shows the cells where New Zealand is) and (b) represented on the world map. (c) The actual distribution (presence and absence) of the species on a world map.


Invasive species research usually focuses on individual species to determine appropriate management strategies in response to a significant threat to a region or country. Currently, there is no objective scientific approach to prioritize and identify species that should be subject to more detailed risk assessments. In our study we used the information available on the geographical distribution of a wide range of global insect pest species to investigate whether the analysis of entire pest species assemblages can help to rank the many potential invasive insect species that threaten New Zealand. This approach can provide similar information for many countries and geographical regions.

The analysis of complex ecological data requires more advanced tools than those currently available if valuable information is not to be lost (Begon, Harper & Towsend 1996). Simple metrics are often used because of the paucity of methods that can handle large amounts of data (Giske, Huse & Fiksen 1998). More complex modelling methods, such as canonical correspondence analysis (ter Braak 1986) and ANN, can greatly assist interpretation by retaining more information. ANN are particularly tolerant to noisy data (Hepner et al. 1990) and are better able to handle outliers than more traditional statistical analysis methods (Lippman 1987). Furthermore, they can predict non-linear data and represent even more complex relationships between variables (Rumelhart, Hinton & Williams 1986). These are clearly advantages appropriate for this study.

SOM are particularly good at detecting outliers that can be confined to part of the map without affecting the other parts (Cereghino, Giraudel & Compin 2001). In this study, areas with very low numbers of species were grouped into cluster Ia and large geographical areas that essentially repeat the information contained over smaller geographical scales (a peculiarity of the compendium database design) were grouped into cluster IIa. This problem of scale, where larger geographical areas should not normally be compared with smaller geographical areas, could be interpreted as a limitation of the SOM analysis. We emphasize, however, that the results of an analysis such as that carried out here should not be interpreted in isolation and do not constitute a full risk assessment for any species. The utility of the analysis depends entirely on the questions asked of the data. For example, the risk analyst might only be interested in the combinations of species that occur in different regions.

A large number of factors can influence phytophagous insect pest species distribution and therefore pest species assemblages at any geographical location (Baker et al. 2005). The presence of a host-plant and a suitable climate are crucial, but other significant factors can include trade routes and plant or produce importations that either historically and/or currently provide invasion pathways, as well as agricultural and quarantine practices associated with the region. We hypothesize that the interplay of all these factors will result in a characteristic mix of exotic species in any given region. Much research has been focused on the climatic suitability of a location as an indicator of the potential for establishment or invasion of an exotic species. If we examine the countries that share the same cell as New Zealand, it is not immediately apparent why New Zealand's pest assemblage is also in the same cluster as a number of Mediterranean countries. However, similarity indices showed that New Zealand shares a significant percentage of its pest species with many countries in this region, for example Italy and France (59% and 58%, respectively), Turkey and Morocco (46% and 48%, respectively). Moreover, if the history of New Zealand's agriculture is examined, the vast majority of that country's cropping and pasture species originated from the Mediterranean region. In addition, the region possesses many transitional climates that are similar to parts of New Zealand. Finally, many garden plants have been imported into New Zealand from Mediterranean regions (A. Stewart, personal communication).

Our method of analysis could provide biodiversity managers and quarantine authorities of any country with a list of species that are ranked according to the risk they pose. This could help to prioritize more general risk assessment efforts. For example, in 2002 the melon thrip Thrips palmi Karny attracted attention as a possible invasive species to New Zealand and significant resources were invested in assessing the risk of establishment (Dentener, Whiting & Connolly 2002). However, in the analysis reported here this species is not strongly associated with the New Zealand assemblage (risk index 0·06) and, to date, it has not established in New Zealand. Furthermore, our analysis predicted New Zealand's most recent invasive insect pest Chrysomphalus aonidium. This species is among the 12 most highly ranked non-established species.

By grouping countries with similar pest assemblages, the SOM can indicate countries where a new invasion should alert local biosecurity authorities. For example, because of New Zealand's proximity to south Australia, with which it shares a similar pest assemblage, any new invasion in that region should put New Zealand biosecurity/quarantine authorities on high alert.

Several species with comparatively low ranks in the risk index are established in New Zealand (Table 2); such species are not strongly associated with New Zealand's exotic pest assemblage. Some endemic species that have become pests in their native country and have not yet spread widely will be given a low rank. In such situations, it may be more appropriate to study non-native invasive species on their own. Furthermore, a species may have a global distribution but be limited to specific environmental conditions and therefore not strongly associated with any pest assemblage, or it may be a newly emergent pest. At the other extreme, Planococcus citri (Risso) has the highest risk of establishing in New Zealand according to our results but is considered absent. In fact, this species was recorded as established in the 1980s but has not been recorded since (R. Henderson, personal communication). Such examples illustrate that this type of approach to prioritizing invasive species will have weaknesses under certain circumstances and that such an analysis is best used to support expert knowledge.

A limitation of the SOM approach to data classification is that every SOM is different and will find slightly different similarities among the sample vectors each time the initial conditions are changed. Recent developments involve bootstrapping the data used in this analysis at least 1000 times to test internal cluster quality and robustness (S. P. Worner & M. Watts, unpublished data). Each time the data are resampled and a new SOM model created, the change in rank of each species is noted to determine the degree of confidence in the SOM ranks. Not only will this indicate the sensitivity of the method to the presence or absence of particular species, but it will help to indicate those species for which data are insufficient or poor. Preliminary investigations show that the initial maps are very stable and the high ranked non-established species (the species in which we are most interested) show little change in their average rank.

A further limitation of the SOM analysis is that the geographical regions examined in this study may have indeterminate boundaries. That limitation is not confined to this study and is a problem in general ecological community studies. In some respects, the regional boundaries of the data used in this study may be more clearly defined than is usual because of the existence of border controls and trade practices specific to many regions.

In this study we found the SOM was able to reduce very high dimensional data into patterns that could be usefully interpreted. In addition to the fact that the SOM analysis can perform significant data reduction, it is the only method apart from k-means clustering that can give information about individual species. While k-means clustering is a more conventional method used for the analysis of high dimensional data, the SOM is an unsupervised learning algorithm able to preserve topology of the clusters. With more than 3000 potential global insect pests, a method that can rank these species in terms of their strength of association with species assemblages, and also indicate their risk of invasion, will help focus the resources and research effort of conservation, quarantine and biosecurity scientists and managers. Particular species can be easily targeted for more detailed investigation. Furthermore, while similarity between assemblages has been emphasized as containing hidden predictive information, dissimilarity or species absence may also increase our understanding of species invasions and the invasion process. We are confident that analyses similar to the one presented here will also add value to the large amounts of invasive species distribution data that are currently collected globally.


We thank CAB International for use of the data included in the Crop Compendium – Global Module, 5th edition, © CAB International, Wallingford, UK (2003). This research was funded in part by the Centre of Research Excellence-funded postdoctoral fellowship ( We also thank Dr Alan Stewart, Plant Breeder, Ceres Research Farm, Christchurch, for sharing his extensive knowledge regarding the origins of New Zealand's non-native plant species, Joel Pitt for his help extracting data and formatting the database, Brad Case for his valued assistance preparing the maps and several anonymous referees for their helpful comments.