The development of perceptual averaging: learning what to do, not just how to do it

Abstract The mature visual system condenses complex scenes into simple summary statistics (e.g., average size, location, orientation, etc.). However, children, often perform poorly on perceptual averaging tasks. Children's difficulties are typically thought to represent the suboptimal implementation of an adult‐like strategy. This paper examines another possibility: that children actually make decisions in a qualitatively different way to adults (optimal implementation of a non‐ideal strategy). Ninety children (6–7, 8–9, 10–11 years) and 30 adults were asked to locate the middle of randomly generated dot‐clouds. Nine plausible decision strategies were formulated, and each was fitted to observers' trial‐by‐trial response data (Reverse Correlation). When the number of visual elements was low (N < 6), children used a qualitatively different decision strategy from adults: appearing to “join up the dots” and locate the gravitational center of the enclosing shape. Given denser displays, both children and adults used an ideal strategy of arithmetically averaging individual points. Accounting for this difference in decision strategy explained 29% of children's lower precision. These findings suggest that children are not simply suboptimal at performing adult‐like computations, but may at times use sensible, but qualitatively different strategies to make perceptual judgments. Learning which strategy is best in which circumstance might be an important driving factor of perceptual development.

in the environment and internal noise within our own visual systemmeans that any perceptual estimate of an object's properties is subject to random error. Averaging across multiple estimates causes random errors to cancel out, 2 resulting in a single overall estimate that is more precise than any individual estimate. Second, summary statistics can reduce key computational demands. Thus, rapidly computing of the "gist" of a scene may help bypass the processing limits imposed by finite memory and attentional resources (Alvarez, 2011;Alvarez & Oliva, 2008;Awh, Barton, & Vogel, 2007;Luck & Vogel, 1997).
While adults are highly adept at computing summary statistics, children appear to struggle. For example, Manning and colleagues (Manning, Dakin, Tibber, & Pellicano, 2014) asked 5-11-year-old children to perform a global motion processing task, in which observers were required to estimate the average direction of multiple visual elements, each moving in a slightly different direction. The authors analyzed their data using an Equivalent Noise model (Lu & Dosher, 1999). In short, when the amount of external noise (Gaussian direction jitter) was low, performance was assumed to be determined solely by internal noise. When the amount of external noise was high, performance was assumed to be determined solely by under-sampling.
By observing how accuracy declined as external noise increased, the authors inferred that approximately half of the children combined information from more than one element, but that children tended to use fewer elements than adults. Similarly, Sweeny and colleagues (Sweeny, Wurnitsch, Gopnik, & Whitney, 2015) asked 4-5-year-old children to perform a size discrimination task, in which the observer must determine which of two arrays was drawn from a distribution with a larger average size (sample discrimination; Jesteadt, Nizami, & Schairer, 2003). Following Bernoulli's theorem (Feller, 1968), as the number of samples increases, response accuracy should improve, at a rate of √ N. Accordingly, children became more accurate as the number of samples increased, but the improvement was smaller than for adults, or than would be predicted by an ideal observer.
In short, both studies observed suboptimal integration in children, and in both cases the authors attributed this to children ignoring some of the available information, perhaps due to immaturities in selective attention (Jones, Moore, & Amitay, 2015;Ristic & Kingstone, 2009) or memory (Cowan, AuBuchon, Gilchrist, Ricker, & Saults, 2011;Simmering, 2012). Effectively, they suggested that children attempted to respond in the same way as adults (mean-averaging the information presented), but that their implementation was imperfect. In this case, development can be seen as a form of parametric learning, that is, learning the optimal values of the various "variables" involved in a computational process-such as the optimal "weight" to give each source of information (e.g., see Equation 3). In machine learning terms, children are struggling with a problem of optimization.
However, there exists a second class of explanation that also fits the existing data: it may be that children use qualitatively different strategies to summarize sensory information. For example, in the case of size judgments, instead of arithmetically averaging independent estimates of sizes, children may be responding based on the total surface area of the display, the size of the largest single element, the density of the elements, or so forth. These strategies may not be what the experimenter intended/expected when they designed their task, and may be suboptimal given the demands of the current task. But nonetheless, such strategies are often quite rational, and enable the participant to operate at a better-than-chance level. In this case, development can be seen as a form of structural learning (Wolpert & Flanagan, 2010), that is, learning what is the best overall "equation" to solve the task in the first place, including what the sources of information are, and how best to map these sensory inputs to a final decision. In machine learning terms, children are struggling with a problem of model selection. performing, but provide no insights into why performance may vary. To discriminate between parametric and structural hypotheses, the present study therefore employed a trial-by-trial ("molecular"; Berg, 2004) method of analysis, designed to reveal systematic differences in how children and adults average visual stimuli.
In the present study, children (6-11 years) and adults were asked to "find the middle" of a cloud of dots sampled from a 2D Gaussian distribution ( Figure 1; Gaussian-jittered spatial information). Crucially, unlike traditional methods of analysis, we did not score responses as "correct" or "incorrect". Instead, we formulated a range of plausible algorithms that observers might employ, and determined which of these best predicted the empirical, trial-by-trial response-data (irrespective of whether these responses were accurate or not). Perhaps surprisingly, a large number of strategies can be devised to perform this simple spatial averaging task ( Figure 2a). For example, an observer might (i) mean-average the Cartesian coordinates of each individual dot, (ii) fit a shape to the dot cloud and locate its center of gravity, or (iii) try to determine the smallest surface that would enclose the observed dots and locate the center of that (see Table 1). Notably, each of these strategies predicts a quantitatively different response ( Figure 2b, though cf. Figure 2c). The difference between observed and predicted behavior on each trial can therefore be used to determine the best-fitting model of decision-making.
Participants received points for hitting a target centered on the arithmetic mean of the underlying distribution. Given this rewardstructure, the statistically optimal ("maximum likelihood") strategy is to compute the arithmetic mean of the observed locations. Based on previous data (Morgan & Glennerster, 1991), we predicted that adults F I G U R E 2 (a) The nine possible decision strategies for integrating spatial information considered in the present study. See Table I for text descriptions.
(b) The predictions of each strategy, for a single representative trial (raw data not shown). (c) The mean Euclidean difference (± 1 SD) between the predictions of the convex hull and arithmetic mean strategies, as a function of the number of data points, determined using 20,000 simulated trials. Note that when N = 3 (fewest N Dots tested), the predicted responses for both strategies are identical-becoming increasingly distinct as N increases

Fitcircle geometric
Centroid of the best fitting circle, minimizing residual geometric error (sum of squared distances between the observed <x,y> points and the fitted circle), fitted using nonlinear least squares (Gauss Newton).

Fitcircle algebraic
Centroid of the best fitting circle, minimizing residual error (sum of distances between the observed <x,y> points and the fitted circle), fitted algebraically.
would respond in this way. In contrast, it was unknown how children would behave. If children use suboptimal decision strategies (structural inefficiency), then we would expect to see systematic differences in their preferred decision strategy. Alternatively, children may average spatial information in a qualitatively similar manner to adults, but may only attend to a subset of the information available (i.e., fail to "weight" every cue appropriately-parametric inefficiency). In this case, we would predict no systematic differences in preferred decision strategy but only an increase in response variability.

| Stimuli and procedure
Participants were asked to find the middle of a cloud of dots (see Figure 1), using a stimulus design developed and described previously by Juni and colleagues (Juni, Gureckis, & Maloney, 2015). Each dot was an anti-aliased circle, 1.1 mm in radius (~0.11 degrees visual angle), generated in Matlab (Mathworks, Natick, USA) using Psychtoolbox-3 (Brainard, 1997), and presented on an LCD optical touch-screen (Prolite T2452MTS; Iiyama Electric Co. Ltd, Iiyama, Japan). On each trial, N dots (see below) were randomly sampled from a randomly located symmetric bivariate Gaussian distribution, and participants were asked to locate the middle of the dots by pressing on the screen.
The location of each dot was therefore a noisy (Gaussian jittered) but unbiased estimate of the center of the underlying distribution. The center of the underlying distribution varied randomly between trials, and was constrained such that 98% of the distribution fell inside the visible screen area. Participants were given feedback after each trial: scored correct if within 12.8 mm of the arithmetic mean of the underlying sampling distribution 3 . The ideal strategy was therefore to respond to the arithmetic mean of the observed dots (see Juni et al. 2015). For 60 participants (N = 45 children) the standard deviation of the bivariate Gaussian sampling distribution, σ xy , was 12.5 mm. For the other 60 participants, σ xy = 27.5 mm. This difference was not pertinent to the present study (NB: the data reported here formed part of a wider dataset, additional data from which are reported elsewhere; Jones et al., under review), and the results did not differ qualitatively between the two conditions (NB: participants did tend to be less accurate in noisier conditions, but did not appear to differ in terms of their preferred response strategy). Data from both conditions were therefore analyzed together. Participants completed approximately 150 trials on average (the exact number of trials varied between participants: μ = 154,, σ = 34). Trials were randomly distributed across six N Dot conditions: 〈3, 4, 5, 6, 7, 15〉, for a total of approximately 4500 trials per age group.

| Potential observer strategies for locating the "middle" of a cloud of dots
Nine potential decision strategies were considered. 4 These are shown graphically in Figure 2a, and are detailed in Table 1. The set of strategies was not exhaustive, but included all plausible strategies that we were able to devise, and was representative of the types of strategies observers anecdotally reported using when questioned.

| Centroid of a minimum bounding polygon
In the minimum-bound approaches (Table 1; R1-R3), the observer visualizes the smallest shape (circle, triangle, or rectangle) that encloses the set of observed points, and then computes the centroid of this shape. Here we define "smallest" shape as the shape with least area.
Smallest could also be defined in terms of perimeter. However, minimizing area of perimeter yields identical predictions in most cases, and negligible differences in the remaining cases. The centroid ("geometric center") of a two-dimensional region is the arithmetic mean position of all the points in the shape, which can be computed from the vertices of the minimal bounding polygon as follows: where (x 1 ,y 1 ), (x 2 ,y 2 ), …, (xn,yn) are the vertices of the n-sided polygon, arranged in clockwise order around the perimeter and with the first vertex repeated at the end to close the shape (i.e., x 1 = x n+1 , and y 1 = y n+1 ). The variable A is the signed area of the polygon:

| Centroid of a convex hull
The convex hull is the smallest polygon that encloses the set of observed points. It can be thought of a rubber band around all of the observed points. It is a generalization of the "minimum bound" approaches described above. However, with a convex hull strategy the observer is not constrained to visualize a shape of any particular geometric form.

| Arithmetic mean
The arithmetic mean was defined as per usual, and was computed independently for the x and y coordinates. As detailed in the introduction, this is the ideal strategy for computing the central tendency of independent observations of a random variable, and it was the expected strategy of adults (Morgan & Glennerster, 1991). (1)

| Mean of subset (N)
This was the same as the arithmetic mean, but was based on a subset of N dots. When the number of dots presented was greater than N, the most outlying dots were excluded. In practice this is only one of an infinite number of ways to partition the data, but ignoring outliers seemed a reasonable approximation for how an observer might pick out a subset of points. Outliers were determined by the median distance between each point and every other point (Rousseux & Croux's "S n factor"; Rousseeuw & Croux, 1993). In the reported data, N was set to four, as this was the number of samples that Sweeny and colleagues reported children use when performing size-averaging (Sweeny et al., 2015). Other values of N (3, 5, 6, 7) were also analyzed, but these results are not reported as they were qualitatively identical to N = 4.

| Geometric mean
The geometric mean is analogous to the arithmetic mean, but uses multiplications and root instead of additions and division. It is equivalent to the arithmetic mean of the logarithm-transformed location values (with the product then returned to the original, unlogged scale).
The geometric mean might be appropriate, therefore, if spatial information in the brain is distributed along a logarithmic decision axis.

| Centroid of a fitted circle
In the fitted circle approach, the observer mentally fits a circle to the observed points, and responds to its centroid. Unlike the minimumbounding approaches, the circle passes through the points, rather than enclosing them. One way to fit such a circle is to algebraically specify a circle in a plane, and then determine analytically the coefficients a, b, and c, that provide the best linear fit to the data (see Gander, Golub, & Strebel, 1994): An alternative ("geometric") approach uses an iterative algorithm to minimize the sum of the squared distances from the circle to the observed points. As discussed by previous authors (Gander et al., 1994), geometric fitting often provides different results from algebraic fitting, and is liable to produce fits that are in greater accord with our intuitions.

| Analysis
All analyses were performed using the data from all participants within each age group. Concatenating data across observers was necessary to constrain the models adequately, given the relatively small amount of trials per participant (μ = 25, per N-dots condition). However, it meant that we could not examine individual differences in decision strategies (Haberman, Brady, & Alvarez, 2015). Responses time data were log 10 transformed prior to statistical analyses to ensure normality.

| Evaluating observer strategies
To evaluate how well each strategy predicted the observer's behavior we computed mean error: the mean Euclidean distance between the predicted and observed response for each trial (i.e., the residual error).
More predictive strategies should exhibit lower mean error, and for the ideal observer mean error is zero. Non-parametric bootstrapping was used to compute 95% confidence intervals around mean error values.
Subjects may not rely on one single perceptual averaging strategy, but may instead shift between two or more depending on the stimulus condition. We quantified this by computing a relative decision weight for each strategy using the reverse correlation method (Lutfi, 1995;Richards & Zhu, 1994). In brief, a multiple multivariate linear regression was performed, containing: (i) x and y error terms, (ii) two independent variables per strategy 〈x predicted , y predicted 〉, and (iii) two dependent variables 〈x observed , y observed 〉. The x and y slope coefficients were then averaged within each strategy, and normalized so that their magnitudes summed to one. This yielded one relative weight value, ω, per strategy, indicating the relative degree to which that strategy determined the observer's responses. More predictive strategies exhibit higher weight values, with the maximum being ω = 1.

| Characterizing performance
To investigate the effect of preferred summarizing strategy on performance, we computed traditional measures of precision, accuracy, and response latency.

| Precision
Response precision was quantified as the reciprocal of standard distance deviation (1/SDD). SDD is the two-dimensional equivalent of standard deviation, and is computed as: where N is the total number of trials, and d i is the residual error on trial i (the distance between predicted and observed response, in millimetres): where <R x ,R y > are the participant's response coordinates, and ⟨x,ȳ⟩ are the arithmetic mean of the observed dots. Note that implicit in this formula is an assumed decision strategy, since errors are computed relative to the arithmetic mean of the observed points. This is problematic, since an ideal observer who does not respond to the arithmetic mean of the data will exhibit SDD > 0 (despite, by definition, having infinitely high precision). Alternatively then, distance can be computed by replacing the terms ⟨x,ȳ⟩ with the target coordinates predicted by a different, more appropriate decision strategy. For example, given an observer who computes the geometric mean of the observed data, the appropriate measure of residual error can be derived by combining Equations 4 and 7, thus: In the present analysis, we therefore began by using the simple ("traditional") measure of response error given Equation 7, but went on to consider more appropriate measures, given observers' empirically estimated decision strategies.

| Accuracy
Response bias (accuracy) was quantified as the mean signed deviation of responses to the arithmetic mean of observed points: In an unbiased observer, β xy = 0.

| Response latency
Response latency was quantified as the lag between stimulus presentation and the observer's response, in seconds. Note, however, that responses were not speeded, and participants were instructed only to be as accurate as possible. To ensure statistical normality, reaction time data were log-transformed prior to analysis (Whelan, 2008).

| Analyzing age differences using bootstrapping
To evaluate differences in SDD between different age groups (e.g., 6-7-year-olds vs. adult controls) a bootstrapping procedure was used.
Samples were randomly drawn, with replacement, from each of the two age groups, and the difference in mean SDD was computed. This procedure was repeated 20,000 times. The p-value was defined as 2P, where P was the proportion of these 20,000 differences that had the opposite sign from the observed difference in SDD. This procedure is fundamentally similar to performing traditional hypothesis test (t test, Mann-Whitney U test), or to graphically comparing bootstrapped confidence intervals.

| RESULTS
To characterize overall performance, precision (1/SDD) and bias (β xy ) were computed for each age group. As shown in Figure 3, no significant response bias was present in any group (Figure 3a), but precision was significantly lower for children than adults (Figure 3b). The difference in precision between children and adults was confirmed by using bootstrapping to perform group comparisons, as described in the Methods (p < .01 for Children vs. Adults, for all combinations of

Age Group × N Dots).
To investigate which decision strategy children and adults used to perform the task, each of the nine algorithms shown in Figure 2 was fitted to the trial-by-trial response data from each age group, and mean error computed. As shown in Table 2  with some age groups showing no significant preference for one strategy or the other. Note, crucially, that even if noise was added to the arithmetic mean model, this could not, by definition, improve its predictive power. These data therefore demonstrate unambiguously that children are not simply computing the mean-average position (i.e., but noisily), when N ≲ 6. Other strategies (e.g., geometric mean, averaging Mean (± 95% CI) response bias (Eq 4) for each age group. (b) Mean (± 95% CI) response precision (see Eq 6), for each age group. Note that this is computed with respect to an ideal strategy of arithmetic mean computation. Error bars were computed using bootstrapping (N = 20,000) a subset, minimum bounding square, best-fitting circle, etc.) gave poor accounts of how children or adults responded in any condition (see Table 2).
The foregoing indicated that children differed from adults in how they localized small numbers of dots (N ≲ 6). To what extent can this qualitative difference in strategy explain children's failure to average information efficiently? That is, to what extent can differences in strategy explain the difference in localization precision between children and adults shown previously in Figure 3b? To address this question, response precision (1/SDD) was recomputed for each condition, using the expected ⟨x,ȳ⟩ values for the best fitting strategy for that agegroup/dot-condition. Once these adjustments for decision strategy were performed, observed precision within children improved substantially ( Figure 5). For example, in the N = 3-5 conditions, use of a different strategy accounted on average for 29% of the apparent difference between children and adults. 5 This indicates that some, but not all, of children's immaturities are due to their use of qualitatively different decision strategies.
To further examine age differences in decision strategy with different N dots, response latency was also analyzed (although note that responses were not speeded, and participants were instructed only to be as accurate as possible). The results are shown in Figure 6.
Adults were on average faster to respond than children (independent t test comparison of log 10 data; p ≪ .001, for all three Age Group

| DISCUSSION
Adults, and children aged 6 to 11 years were able to combine visuospatial information, in order to locate the center of a set of elements. Children exhibited higher response variance (lower precision) than adults. This was shown to partially reflect structural learning: children used a qualitatively different strategy from adults when the number of dots was low, opting to "join the dots" and respond to the center of the smallest geometric shape that enclosed the observed points (decision-strategy 4, T A B L E 2 Mean Euclidean error between observed responses and the predicted responses given each of eight putative decision strategies (see Supplemental Material for analogous assessments based on relative weights and percent best). Bold figures indicate the best fitting model for each N Dots condition. The difference between the arithmetic mean and convex hull strategies is also shown graphically in Figure 4 Mean Error, mm N Dots 3 4 5 6 7 15 6-7 years mindbound circ 6.0 8.2 7.9 8.1 9.2 10.1 minbound triang 6.6 7.8 6.9 6.8 7.3 8.9 minbound rect 6.7 8.9 7.9 8.0 8.7 9.1 convex hull 6.6 7.0 6.0 6.0 6.5 7.5 geometric mean 6.7 7.6 6.7 6.9 6.6 6.4 arithmetic mean 6.6 7.5 6.6 6.6 6.3 6.2 mean of subset (4) 6.6 7.5 10.9 12.3 11.7 10.0  Table 2). Such strategies may be perfectly sufficient in some contexts, but were suboptimal for the present task. In contrast, adults responded in the statistically optimal manner, by computing the arithmetic mean of the elements; and they did so consistently, irrespective of the number of elements. Accounting for this difference in strategy explained 29% of the difference in precision between children and adults. This finding is analogous to a recent study in audition, which showed that developmental difference in the slope of a psychometric functions can be explained by taking into account differences in decision strategy .
In the past, poorer performance on perceptual-averaging tasks has been attributed solely to the inefficient implementation of an ideal strategy (parametric learning). For example, Sweeny et al. (2015) interpreted immature size-averaging as indicating that children pool information across fewer objects than adults. Similarly, Manning et al. (2014) interpreted immature motion-averaging as indicating that older children "are able to [effectively] average across more local motion estimates". The present study expands upon this prior work by showing that children are additionally limited by their use of altogether different response strategies. In this light, development may take place at a qualitatively different level of decision-making than previously proposed: namely through children learning how to identify the "form" of a task and how best to approach the problem (structural learning), rather than by learning how best to optimize their implementation of a particular solution (parametric learning).
Of course, these two types of development are not mutually exclusive, and even in the present work much (71%) of children's increased error remained unaccounted for. This may indicate that children are further limited by how efficiently they are able to implement their preferred strategy, as has been suggested previously by others (Manning et al., 2014;Sweeny et al., 2015) (i.e., they might under-or over-"weight" some of the available cues). Alternatively/additionally, it may be that children are limited by sources of random inefficiency ("internal noise"), either at the sensory level, or in terms of high-level inattentiveness. Given the nature of the present task, internal noise at the motor level may also have limited children's ability to respond precisely.
Finally, it may be that some of the remaining developmental difference can be explained by deterministic factors that we were unable to examine in the present work, such as how response strategies differ across individuals, or vary with practice (Jones, Moore, Shub, & Amitay, 2014; i.e., with children being slower to learn the ideal response strategy).
With the present data, we are unable to test these various hypotheses, and in future it would be interesting to collect larger and more nuanced datasets that could do so. For example, one could repeat certain stimuli configurations throughout the course of the experiment, and use the inconsistency of observers' responses as an index of internal noise (Green, 1964;Jones, Shub, Moore, & Amitay, 2013). Notably though, none of these explanations can explain the systematic differences in F I G U R E 4 Reliance on convex hull versus arithmetic mean decision strategies, as assessed using relative weights. Other strategies were also included in the model, but are not shown here as they were generally given close to zero weight. (a-b) Individual weights for each individual strategy: higher ω indicates more predictive of observed behavior. (c) Direct comparison between the two strategies (Δ ω = ω meanω cvxhull ). Values greater than zero indicating that observers' responses were better predicted by the arithmetic mean strategy. Values less than zero indicating that observers' responses were better predicted by the convex hull strategy. Error bars indicate 95% Confidence Intervals, derived using bootstrapping (N = 20,000). Values significantly different from zero have been shaded grey. These data essentially confirm graphically what can also be seen in Table 2, using a different method of analysis F I G U R E 5 Response precision, when (left) residual error is computed based on the arithmetic mean of the data, and (right) when residual error is computed using the best-fitting decision strategy for that age group / dot condition (as shown in Table 2). See Methods 2.6 for further details. Bold horizontal lines show mean precision for the N = 3-6 conditions (i.e., where children's strategies differed from adults'). Dashed horizontal lines show mean precision for all dot conditions, averaged within adults and children. By comparing between panels, one can see how much of the difference in response variability (% SDD) between children and adults was explained by differences in decision strategy responses that were observed in the present study (e.g., no source of random error would cause one response strategy to consistently predict observers' responses more accurately than another). Accordingly, the present findings show unambiguously that at least part (29%) of the difference in performance between adults and children is explained by the use of qualitative different algorithms (structural learning) rather than quantitative differences in response efficiency (parametric learning).
Perhaps surprisingly, children switched to using the ideal, adultlike decision strategy when the number of dots was high (N ≳ 6). This demonstrates that children are liable to vary their decision strategy depending on the stimulus parameters. In this respect, it is interesting to note that many studies that have reported immaturities in cue integration have tended to use very small numbers of cues (Gori, Del Viva, Sandini, & Burr, 2008;Nardini, Jones, Bedford, & Braddick, 2008;Petrini, Jones, Smith, & Nardini, 2015;Sweeny et al., 2015). In the present work, children actually performed relatively poorly with such sparse inputs, and became faster and more accurate when viewing more complex stimuli, presumably because with increasing complexity, non-ideal strategies became increasingly costly. One corollary of this is that at times, by simplifying psychophysical experiments for children, we may actually be underestimating their perceptual abilities.
Children's change in decision strategy was corroborated by their response time data, with children responding faster as they switched from a convex hull to an arithmetic averaging strategy (i.e., as the number of dots increased). The counter-intuitive finding that responses actually became faster as the visual scene became more complex is consistent with similar previous findings in adults (Robitaille & Harris, 2011), and supports the notion that summary statistics represent, in part, a computationally expedient mechanism for perceptual decisionmaking (Haberman & Whitney, 2012). Our findings suggest that it may take the developing system many years to identify the most expedient summarizing strategy, but that even young children are capable of utilizing adult-like strategies when stimulus complexity makes the use of more simple heuristics untenable.
In the present study, structural learning (i.e., children's use of alternative response strategies) was shown to occur on a visual localization task. We have no reason to suppose that such development does not further generalize across a wide range of other modalities and task domains. However, it should be noted that, from a practical perspective, some other domains may not be so easily studied. With 2D localization, it is straightforward and natural to think geometrically about different ways of solving the problem, for example, in terms of drawing a certain shape around a cluster of points, or mean-averaging their locations. Furthermore, there exists an intuitive response for us to measure (pointing), which can be used to delineate between competing hypotheses. In contrast, with more complex stimuli such as faces, the decision space is hyperdimensional, and it becomes considerably more difficult for us-as human experimenters-to formulate the various different models that observers might employ, or to visualize/interpret the data.
Meanwhile, at the other end of the spectrum, some decision spaces are so simple that explanations may "come to an end" (Wittgenstein, 2009). For example, with a one-dimensional feature space, such as size, one can imagine a number of plausible statistics observers might use (arithmetic mean, geometric mean, median, mode, robust averages, etc.). However, there is no obvious, straightforward method of collecting the response required to test the competing hypotheses (i.e., method of adjustment is notoriously problematic; Wier, Jesteadt, & Green, 1976), and with the decision space containing only a single dimension, the observable differences in response may be too small to measure accurately in children.

| SUMMARY AND CONCLUSIONS
The current study demonstrates that children's difficulties in computing summary statistics does not simply represent poor implementation of an adult-like algorithm. Instead, children are liable to be limited further by the use of qualitatively different decision strategies-strategies which may be sensible in themselves, but not ideal given the task context. This suggests that structural learning, the ability to select the most efficient problem-solving model for a task, is also a crucial factor in perceptual development.
F I G U R E 6 Response time data. (a) Median (± 95% CI) response times, as function of Age Group and condition (N Dots). Confidence intervals were computed using bootstrapping (N = 20,000). (b) Association between the change in strategy from convex hull to arithmetic mean (i.e., the data points from the rightmost panel of Figure 4), and median response time. A higher value of Δω indicates greater reliance on the arithmetic mean strategy. The line indicates the least-square geometric mean ("reduced major axis") linear regression fit (a) (b)