#### Subdivision of the study arena

The study arena may be subdivided spatially with no intrinsic loss of information into contiguous regions (e.g. Fig. 1b) such as physiographic provinces (Bailey and Ropes 1998), or defined by some qualitative factor, such as habitat type, or quantitative variable, such as tree density in four classes.

If it is not possible to record data as a complete census of the study arena, then samples may be taken over smaller proportions of it. For example, an area may be sampled by several quadrats. When subareas are defined, then the quadrats are usually placed so as to represent them. Alternatively, the quadrats may be placed at random, or in some pre-determined arrangement, often to form a rectangular grid (Fig. 1c). Occasionally, samples are taken at points. Sampling results in a loss of information compared to a complete census, and caution should be taken to ensure that the effects of this are minimized (see Dungan et al. 2002). Statistical methods must be used to relate the sample to the larger population and to make inferences.

A sampling plan may be characterized by a combination of variables termed the extent, sample unit size (known in geostatistics as the “support”) and lag. The extent describes the dimensions of the study arena, and its area, A. The sample unit size is the area of the quadrat (dw in the example in Fig. 1c). The lag refers to the distance between each quadrat in a grid (l_{d} and l_{w}, in the x and y directions, respectively, in Fig. 1c). For contiguous subareas with no sampling, the lag is zero. Changes to extent, support and lag may influence the inferences drawn from analysis (Dungan et al. 2002).

#### Data types

We distinguish three prevalent spatial data types, defined by the topology of the entity to which the recorded information refers. These are 1) point-referenced (Fig. 1d), 2) area-referenced (Fig. 2), and 3) non-spatially referenced (attribute-only). Point-referenced data are common in plant ecology; forms derived from it are widespread in terrestrial ecology. The simplest point-referenced data are a complete census of the individuals recorded along a transect (Fig. 1d, below). Each individual is considered identical to all others and the only information recorded is its location. This type is denoted as (**x**). An example would be the locations of a particular weed species along a field margin. For a two-dimensional arena, the point-referenced data denoted as (**x**, **y**) describes the censussed locations of all individuals within it, with reference to two coordinate dimensions (e.g. Fig. 1d, top).

For any spatial data type, further information may be available for each individual, through the recording of an extra attribute(s), z; such data are denoted (**x**, **z**) or (**x**, **y**, **z**). Attributes may have different forms, of which the simplest is a categorical quantal variable (male or female, dead or alive, Fig. 3a). Another form of z attribute might be an ordered categorical qualitative variable, such as a life-stage. Alternatively, it could be a quantitative variable, such as the magnitude of an innoculum (Fig. 3b).

Many forms of data may be derived. One transformation frequently applied to two-dimensional point-referenced ecological data divides up the study arena according to the locations of individuals, into a meaningful tessellation of polygons, often named for Dirichlet, Thiessen or Voronoi (see Dale 1999: Fig. 1.4 for an example). A closely related technique is the Delaunay triangulation. These techniques leave the (**x**, **y**) coordinates of the data unchanged. They effectively create additional, derived, area-referenced data, similar to that described in section 2) below.

Point-referenced (**x**, **y**) data may be amalgamated into a single value, to represent a subarea. This results in partial loss of spatial information, since the original locations can no longer be recovered. The resulting derived data are still explicitly spatial and retain the form (**x′**, **y′**), but **x′** and **y′** must now be defined, for example as the centroids of the subareas, used to transform the (**x**, **y**) data of Fig. 1d to the (**x′**, **y′**, **z′**) data type in Fig. 3c. In this example, a value of z′ for each subarea has been derived from the count of the number of individuals within it. Comparing the two figures shows that almost all the information concerning clustering has been removed from the derived data. For transformations of (**x**, **y**, **z**) data, amalgamation of the **z** variable may also be achieved in very many ways, such as the mean magnitude of the z attribute, where the averaging is over the individuals located within each subarea.

Sampled data from quadrats is always derived, by definition, since all the data recorded from within the quadrat are somehow aggregated, to yield a single representative value. Usually, the derived (**x′**, **y′**) component representing the location of each quadrat is defined as its centroid, as in Fig. 3d, where the z’ value for each of the quadrats in Fig. 1c has been derived from the count of the number of individuals contained within it. Observe how much variability is induced into the counts of Fig. 3d by the sampling process, compared with the censussed values shown in Fig. 3c; whatever loss of information is involved in derived censussed data, the loss from sampled data will be greater. Also, note the difference between recording a z attribute(s) on exhaustively censussed point-referenced individuals, and taking a point sample of an attribute that is a continuously distributed variable over the study arena, such as elevation. Both may involve irregularly-spaced data of the form (**x**, **y**, **z**), but the latter has the properties of a sample, with the concomitant features of uncertainty and the intention to represent some larger unknown population.

Area-referenced data are common in landscape ecology and geography. All of the variations of point-referenced data discussed above apply equally to the area-referenced data exemplified in Fig. 2. This type is commonly represented either by a vector form, (**A**, **z**), where location coordinates defining each (possibly irregular) area are associated with attribute(s), or by a raster form, where locations are addressed by a grid of Cartesian coordinates and attributes pertain to a cell of fixed area at that location. Note that raster data can be stored as point-referenced data (**x**, **y**, **z**) but must include the crucial information of grid cell size; this representation is substantially equivalent to point-referenced data in subareas, where the loss of information is small and offset by a very large number of subareas.

In geography, a dichotomy is recognized between so-called “object” and “field” data models (Peuquet 1984, Gustafson 1998, Peuquet et al. 1999). The object model considers two-dimensional arenas populated by discrete entities whereas underlying the field model are variables assumed to vary continuously on a surface. Area-referenced data can be considered as either; within Geographic Information Systems software (Burrough and McDonnell 1998) (**A**, **z**) data are often represented as polygonal.

Sometimes, explicit spatial information might not exist, or if the data for analysis is recycled from previous results, the information may have been degraded and lack the detail of the original records. Some methods of data recording or amalgamation remove all explicit spatial information. For example, a common and powerful entity in spatial pattern analysis is termed a nearest neighbour (NN) distance (Donnelly 1978, Diggle 1983, Ripley 1988), defined for each censussed individual as the distance to its nearest neighbour. Hence, in the (**x**, **y**) data shown in Fig. 1d above, the individual labelled P has the individual labelled Q as its nearest neighbour and the distance PQ would be the NN distance associated with P. As can be seen, the relationship is not necessarily commutative, since the nearest neighbour of Q is not P. The set of NN distances for all individuals is not spatially-referenced, so is denoted by (**z)**. However, it contains much implicit information concerning location, and may be used to test for spatial randomness. A further example is the frequency distribution of counts in subareas or samples (e.g. the set **z**={3, 4, 6, 7, 7, 9} from Fig. 3c), stripped of its (**x**, **y**) coordinate information. From these six counts may be derived associated statistics such as the sample mean, sample variance, and quantities such as the variance to mean ratio, known as the index of dispersion (Fisher et al. 1922), I, which in this case is exactly 4. Note how I fails to capture any of the pattern in the original Fig. 1d.