• Please log in or register to access this feature.

SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

This paper proposes a model for trail detection and tracking that builds upon the observation that trails are salient structures in the robot's visual field. Due to the complexity of natural environments, the straightforward application of bottom-up visual saliency models is not sufficiently robust to predict the location of trails. As for other detection tasks, robustness can be increased by modulating the saliency computation based on a priori knowledge about which pixel-wise visual features are most representative of the object being sought. This paper proposes the use of the object's overall layout as the primary cue instead, as it is more stable and predictable in natural trails. Bearing in mind computational parsimony and detection robustness, this knowledge is specified in terms of perception-action rules, which control the behavior of simple agents performing as a swarm to compute the saliency map of the input image. For the purpose of tracking, multiframe evidence about the trail location is obtained with a motion-compensated dynamic neural field. In addition, to reduce ambiguity between the trail and trail-like distractors, a simple appearance model is learned online and used to influence the agents' activity. Experimental results on a large data set reveal the ability of the model to produce a success rate on the order of 97% at 20 Hz. The model is shown to be robust in situations where previous models would fail, such as when the trail does not emerge from the lower part of the image or when it is considerably interrupted. © 2012 Wiley Periodicals, Inc.

1 INTRODUCTION

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

Trails, such as those created for hikers and bikers, are usually safe pathways that are free of obstacles. A robot following a trail should thus be able to traverse large distances in off-road environments in an effortless way. First, computation for obstacle detection and path planning is simplified. Second, there are fewer chances of getting lost or encountering ground traps. A practical application of robots that can follow trails could be patrolling natural parks. Robots would be engaged in actively maintaining and cataloguing the environment and possibly providing support to hikers.

Without disregarding the benefits of using range information for the task of trail detection, provided, for instance, by stereovision or laser scanners, this paper addresses the problem from a complementary two-dimensional (2-D) vision perspective. In particular, this paper exploits the observation that trails are typically conspicuous in the robot's visual field to propose the use of visual saliency as the primary mechanism for trail detection. In other words, this work exploits the overall scene's context to guide the localization of the trail. A significant advantage of this approach is that it does not impose rigid constraints on the appearance or shape of both the trail and the background.

In addition to confirming the hypothesis that visual saliency and trail location are indeed positively correlated, it will also be shown that the conspicuity maps of a given input image correspond to efficiently computed segmentations of the latter (e.g., see Figures 13 and 10). That is, the segmentation of the input image, which can be a computationally intensive task, can be obtained as a by-product of determining which regions of the visual field detach more from the background at various scales. Furthermore, the obtained segments are already prioritized by their conspicuity level. Given that this saliency-based image segmentation is done by “labeling” each region according to its detachment level with respect to the overall scene, it does not require hard edges separating the regions, which is known to be a problem when oversegmenting an image (Cour & Shi, 2007).

From these findings, it should follow that the segment in the saliency map with the highest priority matches the location of the trail in the input image. However, in practice, this is a fragile assumption in the presence of not so well-behaved conspicuity maps, which may occur in the presence of distractors or when the trail is considerably heterogeneous. This difficulty can be diminished by top-down boosting of the set of appearance features (e.g., color) known beforehand to better describe the object being sought (Frintrop et al., 2005; Navalpakkam & Itti, 2005). However, due to the immense diversity of natural environments, the trail's appearance alone is an insufficient, and often erroneous, cue of the trail's location. Conversely, the overall shape of trails is a much more predictable feature. For example, the projection of trails onto the input image typically converges toward a vanishing point.

This type of a priori knowledge is embedded in the model herein proposed in the form of behavior rules for the motion of multiple simple agents inhabiting the conspicuity maps associated with specific visual features, namely color and intensity. The paths executed by these agents are used as skeletons of trail hypotheses. However, a system based solely on the individual behavior of such agents is unreliable when in the presence of less structured (e.g., interrupted) trails. To overcome these limitations, the proposed model exploits (1) the metaphor of collective intelligence (Franks, 1989) exhibited by insect swarms to ensure that the agents cooperatively build up, via pheromone-like interactions, a robust approximation of the actual trail's skeleton; and (2) dynamical neural fields (Amari, 1977; Rougier & Vitay, 2006) to accumulate evidence about the most likely trail location across frames, extended with a mechanism to compensate for robot motion.

Although the appearance of trails is harder to predict than their overall layout, a rough approximation of the former, provided that it is learned online, can still be useful to reduce the ambiguity between the trail and trail-like distractors. In this paper, a simple learning mechanism is used to learn trail appearance. Each agent deploys pheromone in proportion to the probability of the pixels it visited that belonged to the trail. As a result of numerous pheromone-like interactions, this mechanism allows the swarm activity to be spatially biased according to the expected trail's appearance. This approach renders a cross-influence between the perception of appearance and the perception of shape, which promotes robustness without hampering parsimony.

Extensive experimental results show that the proposed model attains a success rate on the order of 97% at 20 Hz in demanding and diverse scenarios, as depicted in Figure 1. The system's fast computation is largely due to the extensive use of bottom-up mechanisms.

Figure 1. Typical detection results (red overlay) obtained with the proposed model in a wide variety of paths, including hiking trails, biking trails, walkways, and dirt roads.

Download figure to PowerPoint

image

This paper extends three previous conference papers (Santana et al., 2010a, b, 2011b) and is organized as follows. Section 'RELATED WORK' presents an overview of related work. Then, Section 'PROPOSED MODEL' introduces the proposed model. The way in which appearance information is included into a system is then described in Section 'EXPERIMENTAL RESULTS', which is followed by experimental results in Section 'CONCLUSIONS'. Finally, some conclusions are drawn and future work is proposed in Section 6.

2 RELATED WORK

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

Current trail detection methods rely considerably on work that has been developed in the road detection domain. The detection and tracking of paved roads is facilitated by their rather predictable appearance and the presence of well-delimited limits. However, this is not the case for poorly structured unpaved rural roads. The typical solution for such roads is to segment the road region from its surroundings by considering the aggregate of pixels whose likelihood of belonging to the road surface is above a given threshold. The likelihood of a given pixel belonging to the road surface can either be learned offline from labeled images (Alon et al., 2006; Chaturvedi & Malcolm, 2005) or, for more robust operation, learned online given a set of reference regions in the input image automatically labeled as road/nonroad. In the latter case, the labeling process is done assuming that the robot is on the road, and so the region right in front of it can be labeled definitively as road (Fernandez & Casals, 1997; Fernandez & Price, 2005; Song et al., 2007; Thorpe et al., 1988), It can also be done by exploiting short-range volumetric information obtained from other sensors (e.g., stereo or laser) to discriminate the road plane from others (Thrun et al., 2006; Tue-Cuong et al., 2008). To attain depth invariance, locally labeled visual elements can be traced back to the moment they entered the camera's field of view, i.e., when they were still in the far field (Lookingbill et al., 2007).

In general, once the segmentation is concluded, a simplified model of the road (e.g., trapezoidal) is fit to the segmented image. Region growing is an interesting alternative to the model fitting process when dealing with hard-to-model roads (Chaturvedi & Malcolm, 2005; Fernandez & Price, 2005; Ghurchian et al.,2004). However, by enforcing a global shape constraint, the model-based approach enables the substitution of the road/nonroad pixel classification process by an unsupervised clustering mechanism (Crisman & Thorpe, 1991). This reduces the burden of maintaining road appearance models at the expense of increasing the number of possible ambiguities between regions of the input image with a similar shape. A known alternative to these region-based approaches is to estimate the road's vanishing point (Kong et al., 2010; Rasmussen, 2008). This is usually done by extracting the dominant texture orientations, which are usually aligned with ruts, tire tracks, and road borders. This approach is particularly interesting when the road and background share the same appearance. The orientation-based and region-based approaches can also be integrated into hybrid architectures (Alon et al., 2006; Song et al., 2007).

As mentioned, these models are the basis of most work on trail detection. An example is the use of a priori knowledge about the color distributions of both the trail and the surroundings for their segmentation (Bartel et al., 2007). Robustness can be increased if these a priori models are substituted by models learned online in a self-supervising manner (Grudic & Mulligan, 2006; Rasmussen & Scott, 2008b). In contrast to the road domain, the definition of the reference regions from which it is possible to supervise the learning process is not easy. With varying width and orientation, it is difficult to ensure that the robot is on the trail. As a result, it is hard to determine which regions of the input image can be used as references. Second, the trail and its surroundings often exhibit the same height, which hampers a straightforward use of depth information to determine a trail reference patch. The use of a global shape constraint (e.g., triangular) to avoid the learning process has also been tested in the trail detection domain (Blas et al., 2008; Rasmussen & Scott, 2008a). This is done by first oversegmenting the image and then scoring a set of trail hypotheses, built by aggregating sets of generated segments, against the global shape constraint. Accurate image oversegmentation is a computationally demanding task, thus the system's ability to perform in realtime is reduced. It usually clear edges segmenting the object from the background, which is not often the case in natural trails. This situation is aggravated by the considerable level of interruptions exhibited by natural trails. Moreover, a global shape constraint limits the type of trails that can be detected. Finally, due to the fact that dominant orientations in natural trails seldom indicate the global orientation of the trail, the vanishing point concept, a powerful concept in the road domain, hardly applies.

Instead of adapting a road detection method to the trail detection problem, this paper proposes the exploitation of a distinguishing characteristic of trails present in natural environments: they are rather conspicuous structures. Toward that end, we propose the use of visual saliency models to detect them. This approach does not impose any rigid constraints on the appearance or shape of both the trail and the background, nor does it require learning. In a parallel study, Rasmussen et al. (2009) proposed the use of local appearance contrast for trail detection. However, there is only a superficial resemblance to the concept of visual saliency. Visual saliency includes contrast information between the trail and the local surroundings, as well as contrast information between the trail and the overall scene. This is important because it is not guaranteed that the appearance of trails and their immediate surroundings always exhibit sufficient contrast to be robustly exploited. In addition, and in opposition to our model, Rasmussen et al. (2009) made the assumption that trails exhibit a perfect triangular shape when seen from perspective. They also assumed that both left and right sides share the same appearance. Although these assumptions comply with a large set of situations, they lack support in more demanding ones.

Seeking robustness and parsimony, we follow the long line of research on the use of the social insects metaphor for the design of computer vision systems (Antón-Canalís et al., 2006; Broggi & Cattani, 2006; Liu et al., 1997; Mazouzi et al. 2007; Mobahi et al., 2006; Owechko & Medasani, 2005; Poli and Valli, 1993; Ramos and Almeida, 2000; Zhang et al. 2008). The work that is most related to ours is that of Broggi and Cattani (2006), which detects the edges of poorly structured desert roads using a swarm-based system. However, trails in natural environments are seldom delimited from the background by strong edges, which is why a region-based approach is preferred. Furthermore, operating on the appearance space directly, and not on the conspicuity space, the work of Broggi and Cattani (2006) does not exploit the observation that trails are conspicuous structures in the environment. By following a region-based approach and by operating on the conspicuity space, our approach is better suited for the problem of trail detection in natural environments. We can also conclude that our model is the first swarm-based computation of visual saliency.

3 PROPOSED MODEL

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

As a large extension of other robotics and computer vision work, the work proposed herein also benefits from biological inspiration. This inspiration is summarized in Section 'Biological Inspiration: Swarm-based Models', which is followed by an overview of the proposed model in Section 'System Overview'. The subsequent sections detail its key elements.

3.1 Biological Inspiration: Swarm-based Models

The fundamental aspects considered in this work are the use of visual attention for trail detection and the use of multiple agents for its computation. Visual attention is known to be widespread in the animal kingdom and it has been studied extensively in humans (Wolfe et al., 2011). By focusing perception, computation is simplified and robustness enhanced. As a consequence, faster robot motion, lower cost, and reduced robot size are enabled. Studies on human subjects support the hypothesis that multiple covert, i.e., mental, attention processes coexist in the brain (Doran et al., 2009). This evidence is the motivation for the multiagent approach used in this work. Agents perform local covert visual attention loops, whereas the self-organizing collective behavior maintains global spatiotemporal coherence. Additionally, because these agents are sensorimotor coordinated units, they can exploit the benefits of active vision (Ballard, 1991) at the information processing level, which includes the ability to actively select and shape their sensory input (Beer, 2003).

The use of multiple agents in the task of modeling robot cognitive behavior can benefit from knowledge obtained from nature, such as that related to the mechanisms underlying the swarm cognition exhibited by social insects, whose similarities to neuronal processes are becoming apparent (Passino et al., 2008; Santana & Correia 2010; Trianni et al., 2011). In these processes, global colony behavior emerges from numerous interactions occurring among individuals. Many of these interactions are set through the environment via pheromones, a phenomenon known as stigmergy (Grassé, 1959). The computational interest in this metaphor lies in the possibility to produce complex behavior from inexpensive agents, i.e., with limited cognitive and sensing capabilities. For this reason, the swarm metaphor has inspired many in the development of computational models for solving optimization and search problems (Bonabeau et al., 1999). In these models, each agent moves stochastically on the solution space according to a set of domain-specific perception-action rules and biased by signals, like pheromones, cast by other agents. In this article, the proposed perception-action rules exploit a priori knowledge about the trail's location and shape so as to modulate the motion of the agents.

In the proposed model, a dynamical 2-D neural field simulates the physical medium in which pheromone is deposited and propagated in time. The leaky nature of dynamic neurons emulates pheromone evaporation, whereas the nonlinear lateral connectivity between neurons introduces pheromone complex spatial interactions, which promotes convergence. The option of using dynamical neural fields (Amari, 1977; Rougier & Vitay, 2006) for evidence accumulation across frames and improved focus of attention is also bio-inspired in the sense that neural fields have a long history in the modeling of human cognition (Beer, 1995; Thelen & Smith, 1996).

The coexistence of both swarm and neural paradigms is justified by the different ways pixel connectivity is handled in the steps of trail hypothesis generation and evidence accumulation. In the former, a set of pixels must be actively selected so as to approximate the trail's skeleton. This active nature calls for sensorimotor coordination, which is well handled by an agent-based design. Conversely, evidence accumulation concerns static nonlinear isotropic interactions between pixels, which is parsimoniously handled by a neural-based design.

3.2 System Overview

Figure 2 illustrates the different phases involved in the proposed model's operation. At each new frame I, two conspicuity maps, CC∈[0,1] for color and CI∈[0,1] for intensity information, are computed (see Section 'Conspicuity Maps Computation'). The intensity of a pixel in a given conspicuity map signals how much the pixel detaches from the background at several scales, in the scope of a given visual feature.

Figure 2. The system's operation overview (simplified). The model starts by creating from the input image I two conspicuity maps, CC and CI, encoding color and intensity information, respectively. The brightness level of these maps identifies the regions that detach more from the background in the corresponding visual feature. Then, two swarms of virtual agents, one per conspicuity map, operate so as to create two pheromone maps, PC and PI, which are superposed to create a final saliency map S. The brightness of the latter map indicates the most likely presence of the trail for the current frame. For evidence accumulation purposes, S feeds a neural field F. The self-link represents a transformation applied to the neural field to compensate for robot motion. The gray arrows represent the delayed use of the neural field state in the pheromone maps initialization. The red overlays in the pheromone maps represent the paths executed by two agents. Contrast and brightness of the images were manipulated for improved readability.

Download figure to PowerPoint

image

A set of n virtual ants (hereafter called p-ants, to denote perceptual-ants) is deployed on each conspicuity map (see Section 3.4.1). These p-ants interact based on the ant-foraging metaphor for several iterations to build two pheromone maps, PC∈[0,1] and PI∈[0,1] (see Section 3.4.2). The behavior of p-ants is designed to exploit a priori knowledge about the approximate layout of typical trails. Therefore, the activation of pheromone maps is expected to match the trail's location better than the activation of conspicuity maps, which are only sensory-driven. Thus, rather than combining both conspicuity maps to generate the final saliency map, as is typically done (Frintrop et al., 2005; Itti et al., 1998), in this work S is obtained by combining both pheromone fields, inline image. Additionally, by allowing p-ants on a given pheromone map to also affect the other pheromone map, cross-modality influences are implicitly (i.e., through stigmergy) maintained in the system. This increases robustness by allowing p-ants to exploit multiple cues indirectly, with a residual computational overhead.

The total pheromone deposited in the current frame, which is represented by the instantaneous saliency map S, is then used to update a permanent pheromone map, F∈[0,1], which is maintained across frames. This allows self-organization to occur at a longer time-scale and, as a consequence, enables tracking. In contrast with Kalman filters, this approach allows the tracking of several hypotheses, represented by competing pheromone traces, in parallel. Contrary to particle filters, this approach does not need a parametric model of the object being tracked. Despite these important differences with respect to typical tracking processes, the proposed model follows the typical approach of allowing the detection process to be influenced also by the tracking process. Concretely, at the onset of each frame, both instantaneous pheromone maps are initialized with a small ratio λ of the permanent pheromone map, PI←λF, PC← λF. This induces stability and robustness to noise and temporarily misbehaved conspicuity maps (i.e., those that are unable to properly discern between the trail and the background in the presence of distractors), and it enables across-frames progressive improvement.

The permanent pheromone map F is implemented with a 2-D dynamical neural field (Amari, 1977; Rougier & Vitay, 2006), which introduces two interesting properties to the model (see Section 3.6). The first is due to dynamic neurons' leaky characteristic, which simulates pheromone evaporation. The second results from the nonlinear connectivity between neurons, which introduces pheromone complex spatial interactions. That is, neurons' short-range excitatory connections enable pheromone radial diffusion, whereas neurons' long-range inhibitory connections foster pheromone concentration around a coherent focus of attention. The use of neural fields for attention modeling, although not in the context of swarms, has already been demonstrated (Rougier & Vitay, 2006).

Motion compensation between consecutive frames is included in the proposed model so that the dynamics of the neural field can be decoupled from the dynamics of the robot. The output of the system is given by the current state of the neural field, in which the higher the activation of a given neuron, the higher are its chances of being associated with a trail's pixel.

As was mentioned, the neural field and cross-modality influences are useful to modulate the creation and behavior of the p-ants. However, if allowed to propagate across frames, these influences may induce an undesirable neural field's activity buildup. To avoid this, two auxiliary pheromone maps, PI* and PC*, are created free of influences. For instance, the auxiliary map PI* only encompasses the pheromone deposited by the p-ants associated with visual feature I. These maps are used to replace the pheromone maps, PIPI*, PCPC*, just before blending them for the purpose of creating S.

To help reduce ambiguity between the trail and trail-like distractors, a simple appearance model of the trail, href, is learned online in a self-supervised way. That is, a threshold-based binarization of the neural field is used to label each pixel of the input image as trail/nontrail. These labels are then used to supervise the learning process. p-ants use the learned appearance model to specify the amount of pheromone to be deposited. The more likely the pixels traversed by the p-ant are of being part of the trail, according to the appearance model, the more pheromone is deposited.

Conspicuity maps, pheromone maps, saliency map, and neural field all share the same width w and height h. These two values are selected bearing in mind real-time performance. For a better understanding of the proposed model, Algorithm 1 outlines its pseudocode. Details are given in the following sections.

3.3 Conspicuity Maps Computation

As mentioned earlier, conspicuousness computation is about determining which regions of the input image detach more from the background at several scales in a given feature channel. Although in this paper only intensity and color channels are used, additional channels (e.g., orientations, texture) could be used for improved background-trail discrimination. The following describes the biologically inspired model proposed by Itti et al. (1998) for conspicuity computation, herein properly adapted to the task at hand.

image

The method is based on a Gaussian-Laplacian scale-space, which starts by computing, from the intensity channel, one dyadic Gaussian pyramid (Burt & Adelson, 1983) with eight levels. Two additional pyramids, also with eight levels, are computed to account for the Red-Green and Blue-Yellow double-opponency color feature subchannels. Each level corresponds to a given scale. Various scales are then used to create a set of on-off and off-on center-surround maps per pyramid Itti et al. (1998). These have a higher intensity on those pixels whose corresponding feature differs the most from their surroundings. On-off center-surround maps are built by across-scale point-by-point subtraction between a level with a fine scale and a level with a coarser one. Off-on maps are computed the other way around, i.e., subtracting the coarser level from the finer one. Rather than considering the modulo of the difference, as in the original model Itti et al. (1998), we consider both on-off and off-on center-surround maps separately, as this has been shown to yield better results (Frintrop, 2006; Frintrop et al., 2005). Then, all center-surround maps built from the intensity pyramid are resized to a common size and independently scaled in magnitude according to a method described in the next sections, and finally averaged together to produce the intensity conspicuity map. The same process applies to create Red-Green and Blue-Yellow conspicuity maps, each one subsequently weighted and then averaged together to produce a single color conspicuity map. Note that all maps are 8-bit images.

3.3.1 Typical Magnitude Scaling Functions

Magnitude scaling functions return a version of each map, obtained by a pixel-wise product of a scalar. The goal is to promote maps that have fewer conspicuous locations. As pointed out by Itti et al. (1998), this avoids the situation in which, when combining maps, salient objects appearing strongly in only a few maps are masked by noise or by less salient objects present in a larger number of maps.

In the original model (Itti et al., 1998), the scaling factor to be applied to a given map X is defined by the square of the difference between the map's global maximum, M(X), and the average of all its other local maxima, inline image. The corresponding scaling function multiplies the magnitude of each pixel by the computed factor:

  • display math(1)

A similar approach has been proposed by Frintrop et al. (2005) and Frintrop (2006). In this case, the scaling function is defined by

  • display math(2)

where M(X) is the number of the map's local maxima above a given threshold. The threshold is by default 50% of the map's global maximum (Frintrop, 2006).

Common to both methods is the use of local maxima information. Although this might be appealing, it does not always embody the information intended to be captured. Large homogeneous structures, for instance, such as the sky, generally encompass only a few local maxima. In this situation, the sky would be undesirably considered highly conspicuous, despite its large footprint in the whole image. A second aspect is that the two analyzed saliency models consider that all pixels contribute equally to the saliency computation. However, other than extreme tilt/roll angles, the upper region of the image has little relevant information for trail detection. As a consequence, without a space-variant contribution to the final saliency map, feature maps that are only discriminative in the lower part of the image, and consequently interesting for trail detection, would not be adequately promoted.

3.3.2 Novel Scaling Function

In the face of the aforementioned limitations of previous scaling functions for the task of trail detection, a new one is proposed herein. Rather than considering only the map's local maxima when averaging, as is done in inline image [Eq. (1)], we propose to use all pixels. Furthermore, the contribution of each pixel to the average is weighted according to its distance from the top row.

Formally, let X(c, r) return the gray level of the pixel in column c and row r of a given map X. Let inline image be the weight of the pixel at position (c, r). The map's weighted average, mw, is thus given by

  • display math(3)

and, similarly to function inline image [Eq. (1)], the proposed scaling function, inline image, takes the form

  • display math(4)

Prior to scaling, maps are normalized to [0,1] amplitude interval, meaning that M(X)=1 for all cases. To reduce computational cost, the proposed system uses image operators over 8-bit images. An example of two conspicuity maps obtained with the proposed model is depicted in Figure 2.

To quantitatively assess the proposed scaling function, we generated conspicuity maps for the input images present in a test data set (Santana et al. 2010b). The two conspicuity maps produced in each image are combined to produce a saliency map, which is in turn thresholded to produce a pixel-wise trail/nontrail classification. The resulting classification is subsequently compared against hand-labeled ground-truth. This process is repeated for all images to obtain an average True Positive Rate (TPR) and an average False Positive Rate (FPR). Varying the threshold between the minimum and maximum pixel magnitude, we plot for each case a point in a TPR vs. FPR graph. The interpolation between those points produces a Receiver Operating Characteristic (ROC) curve.

Figure 3 depicts the ROC curve for the proposed scaling function, inline image [Eq. (4)], as well as for its predecessors, inline image [Eq. (1)] and inline image [Eq. (2)]. These curves show that the proposed scaling function consistently produces a better trade-off between TPR and FPR. The small difference between the ROC curves could suggest that only a small quantitative improvement was obtained with the proposed model. However, the averaging procedure used to build the curves hides the fact that neither of the other two methods was able to consistently allocate higher levels of saliency to trail regions than to the background as often as the proposed one.

Figure 3. Scaling functions comparison. Each plot corresponds to the ROC curve obtained with a given scaling function.

Download figure to PowerPoint

image

Figure 3 also shows that all ROC curves are considerably above the line of no-discrimination (y=x), meaning that saliency is correlated with trail location, an important contribution by itself. However, the correlation is still lower than that required for high accuracy trail detection. That is, there is no single threshold on the saliency map that clearly segments the trail for all images in the data set. Thus, a higher level analysis of the conspicuity maps is required. As will be shown, the swarm-based system described in the following section is capable of providing it.

3.4 Appearance Model Learning

The trail's appearance model of the current frame is a simple color histogram, H, of the pixels in the region of higher neural field activity. To reduce sensitivity to illumination effects, the HSV color space is used. To further reduce this sensitivity, the H(ue) component is described by 12 bins, the S(aturation) component by only 8 bins, and the V(alue) component is discarded altogether.

This frame-wise appearance model is used to update an across-frames appearance reference model,

  • display math(5)

where Θ(F)=κ·max(F) indicates the speed at which the reference model adapts to changes in the trail's appearance proportional to the neural field's maximum activity. This weighted approach allows the appearance model to be updated more strongly when the system is more sure of its output being a correct segmentation of the trail from the background. This assumption follows from the fact that the more stable is the pheromone maps' activity across-frames, the higher is the neural field's maximum. Hence, the presence of distractors is less likely to affect the reference appearance model.

To reduce the chances of learning erroneous appearance models due to the presence of distractors, the appearance reference model is only updated if the neural field in the current frame reports the trail as being located roughly (±10% of the map's width) at the center of the image. This is a reasonable heuristic under the assumption that the robot is actively centering itself along the trail to follow it. This learning gating process allows the reference model not to learn the appearance of transient distractors appearing in the sides of the trail. Furthermore, it allows the system to delay the learning phase when the robot does not start centered on the trail.

3.5 Pheromone Maps Computation

This section describes how the two pheromone maps, PI and PC, are built from the two conspicuity maps, CI and CC. For this purpose, a given p-ant, pm, is created and associated with a given visual feature, m∈{I, C}. The other visual feature is represented by m'. While being iterated for η times, pm will move on Cm, influenced by the pheromone present in Pm. While moving, this p-ant deploys pheromone in each position visited in Pm with a magnitude Φ(pm, href) and a small portion of it, υ, in Pm':

  • display math(6)

where β is an empirically defined weighting factor, ε is an empirically defined pheromone level baseline, and inline image is the probability of the p-ants' path, inline image, to belong to the trail (T), given the learned appearance model href. Rather than having p-ants deploying a constant level of pheromone along their paths, this approach compels p-ants to deploy higher doses of pheromone on regions of the visual field whose appearance is similar to that of the trail.

The probability inline image is approximated by the average probability of pixels visited by the p-ant of belonging to the trail. These pixels are represented by the set inline image, and their individual probabilities are obtained directly from the normalized histogram href, according to a technique known as histogram backprojection (see Figure 4 for typical results). As the experimental results will show, this simple approach suffices to help p-ants tracking the trail.

Figure 4. Pixel-wise trail probability (brightness level) for two typical images.

Download figure to PowerPoint

image

After the iterations for this p-ant, a p-ant associated with the other visual feature, pm', is created and iterated following the same procedure. Afterward, the two p-ants are removed from the system and the process is repeated n times, meaning that 2n p-ants are created and iterated (see Algorithm 1).

Allowing p-ants to affect both pheromone maps enables a loosely coupled cross-modality influence, thus allowing each p-ant to exploit multiple cues indirectly while maintaining their simplicity. As will be shown, the deployed pheromone is a function of p-ants' sensations across their trajectories. Hence, it is influenced by the activity occurring in distant regions of the map. This long-range spatial connectivity allows the potentially large size of trails to be handled in a robust and parsimonious way.

Section 3.5.1 describes the p-ants' creation process, whereas the iteration process is described in Section 3.5.2.

3.5.1 p-ant's Creation

The chances of creating a p-ant pm on a given location inline image of the conspicuity map Cm depends on the level of conspicuity at that location and on the level of pheromone at the same location in the corresponding pheromone map, Pm. Hence, p-ants are progressively and probabilistically deployed where there are more chances of there being a trail, under the assumptions that (1) trails tend to be conspicuous; (2) the trail has been successfully detected in the previous frame (represented by the feedback provided by the delayed neural field state); and (3) the pheromone accumulated by p-ants deployed in the current frame builds up mostly around the actual trail's location.

By assuming that trails often start from the bottom of the image, p-ants are deployed with a small randomly selected offset z∈[0, 0.1·h] of the bottom of the conspicuity map in question, i.e., at row r∈[hz, h], where h is the height of the map.1 This random small offset reduces sensitivity to any noise potentially present at the map's boundaries.

To determine the column where pm is deployed, a unidimensional vector vm=(vm0, …, vmw) is first computed. The element vmk of vm refers to the average conspicuity level of the pixels in a small window centered on column k and with a randomly selected offset from the bottom row of the map, r:

  • display math(7)

where Cm(l, j) returns the conspicuity level in position (l, j), and δw and δh are the width and the height of the window, respectively. The same windowing process is applied to build a vector for the pheromone field in question, um=(um0, …, umw). However, this time, element umk corresponds to the maximum pheromone level found in the window:

  • display math(8)

where Pm(l, j) returns the pheromone level in position (l, j). The max operator is employed to benefit those regions where the paths of p-ants overlap more often and consequently where there is a higher consensus on the trail's skeleton position. Using these two vectors in the following test, which is repeated until it succeeds, the chances of deploying a p-ant in a randomly selected column z2·w are as high as the conspicuity and pheromone levels at the deployment region:

  • display math(9)

where z1∈[0,1] and z2∈[0,1] are numbers sampled from a uniform distribution each time the test is performed, and ρ is a weight factor used to trade off the influence of both pheromone and conspicuity information. By starting with a small value, ρ0, and by linearly growing at each iteration by an amount Δρ, ρ operates as an adaptive process, compelling the system to move from a conspicuity-driven operation (exploration) to a pheromone-driven operation (refinement/exploitation).

3.5.2 p-ant's Execution

Before specifying the p-ants' behaviors, it is necessary to specify their sensory and action spaces. To reduce both sensitivity to noise and computational cost, the sensory input is defined by five coarse receptive fields disposed around the p-ant's current position, R1, …, R5 (see Figure 5). For a given visual feature m and p-ant's position inline image, inline image and inline image return the average conspicuity and pheromone levels of the pixels constituting receptive field Rk, respectively. Parameter inline image is used to transform the p-ant's centered receptive field onto the map's frame of reference. To refer directly to the pixel-wise conspicuity and pheromone levels at the p-ant's position, inline image and inline image are used, respectively. An action aA moves the p-ant to one of the five neighbor pixels not behind the current p-ant's position. The action space is thus defined by the set A= {1, 2, 3, 4, 5} (see Figure 5).

Figure 5. p-ants' sensory and action spaces. Space is discretized in pixels and only the ones that the ant is able to perceive are represented. Regions surrounding the current p-ant's position, inline image, are segmented into a set of receptive fields, R1={1, 6, 11}, R2={2, 7}, R3={3, 8}, R4={4, 9}, R5={5, 10, 12}, whose composing pixels are numbered as in the figure. If a given action aA is selected, then the next p-ant's position will be the closest pixel to the p-ant, represented by the pixels in bold.

Download figure to PowerPoint

image

At each iteration of a maximum η, p-ant pm executes a set of behaviors B = {greedy, track, center, ahead, commit}, which independently vote on each possible action in A. Following a typical approach of behavior coordination (Rosenblatt, 1995), the most voted action is the one taken by the p-ant. Table I summarizes, for each behavior, which regions in the neighborhood of the p-ant are associated with the most preferred action.

Table I. p-ant behaviors for trail detection
BehaviorVoting Preferences
greedyRegions of higher levels of conspicuity, under the assumption that trails are salient in the input image.
trackRegions whose average level of conspicuity is more similar to the average level of conspicuity of all the pixels visited by the p-ant, under the assumption that the trails' appearance is homogeneous.
centerRegions that maintain the p-ant equidistant to the boundaries of the trail hypothesis being pursued. That is, the p-ant will prefer regions that are closer to the centroid inline image of the horizontal segment in the conspicuity map where the current p-ant is. The segment is obtained by considering all pixels, represented by the set inline image, that are connected to the p-ant's current position, inline image, through a set of pixels sharing the same row and a similar, within a given margin ζ, conspicuity level of the former. See Figure 6 for an illustration.
aheadUpward regions under the assumption that trails are often vertically elongated.
commitRegion targeted by the motor action at the previous iteration, under the assumption that the trails' orientation tends to be monotonous.

To allow the system to operate with unstructured trails, these behaviors are simple and make little assumptions regarding the trail's structure. Each behavior exploits a specific trail's shape or appearance a priori knowledge to partially contribute to the goal of producing a p-ant's trajectory representative of the trail skeleton. For instance, under the assumption that trails are somewhat monotonous structures, p-ants should move under the influence of some inertia. This is implemented by having the commit behavior voting more strongly on the action that is most similar to the one selected in the previous iteration.

Formally, for a given p-ant pm, behaviors are described as functions that return a vote in the interval [0,1] for each possible action aA:

  • display math(10)
  • display math(11)
  • display math(12)
  • display math(13)
  • display math(14)

where inline image is the Heaviside function, a'pm is the p-ant's action selected in the previous iteration (see Algorithm 1), Vpm is a list whose elements are scalars representing the conspicuity level at each p-ant's previously visited position, and dpm is the normalized deviation of inline image to centroid inline image, inline image, with col(·) returning the column coordinate of a given map position (see the third row of Table I and Figure 6).

Figure 6. An example illustrating the key aspects of the center behavior. The dotted line represents the p-ant's motion from the first iteration until the current one. The pixels composing the thicker horizontal line define the set inline image. The p-ant will try to approach this line's centroid inline image, represented by the white filled circle, which deviates from the current agent's position, inline image, by inline image pixels.

Download figure to PowerPoint

image

As will be shown, all these behaviors contribute for p-ants' trajectories that closely represent the trail's skeleton. The absence of an explicit scoring function, which would require a model-based imposition of constraints on the trail's shape, hampers a post-ranking of all deployed p-ants to determine the “best trajectory.” Moreover, not all p-ants will be deployed on the trail and so not all are able to follow the actual trail. To overcome these challenges, two ingredients of the system are key.

The first ingredient comes in the form of positive feedback brought about by the amplification of random fluctuations. With additive random fluctuations at the p-ants' actuation level, those that are deployed off the trail will diverge, whereas p-ants deployed on the trail will converge toward its vanishing point, thanks to the center behavior. Hence, there will be higher concentrations of pheromone on trail regions. This happens because the presence of the trail tends to be a global constraint that is only felt by the p-ants deployed on it. In a sense, the trail operates as an attractor for the self-organizing system.

The second ingredient is the use of stigmergy in the form of pheromone-based interactions. By making p-ants attracted by high pheromone concentration regions, we positively reinforce the difference between diverging and converging p-ants (symmetry breaking). This is further reinforced by the influence imposed by the appearance model. Hence, this second ingredient ensures that, in time, the structure imposed by the presence of the trail on the center behavior will be stronger than the effects of random fluctuations. This effect is magnified by the fact that p-ants are deployed according to the level of pheromone already present in the pheromone maps, which are in turn influenced by the neural field. This positive feedback is counterbalanced by the negative feedback resulting from pheromone evaporation occurring at the neural field level. Moreover, the fact that robot forward motion tends to make the neural field skew toward the bottom of the image allows regions of highest activity in further regions of the visual field to be more likely to recruit p-ants. The use of pheromone-based interactions has the additional advantage of overcoming the unreliability of controlling p-ants based on myopic behaviors. The local interruption of a trail, which could inhibit the center behavior from properly leading the p-ant along the trail, is overcome by having p-ants progressively building a pheromone “bridge” over the interruption thanks to commit and ahead behaviors.

To take these considerations into account, in each iteration a p-ant pm selects its action inline image by maximizing the following utility function, which incorporates behaviors' votes, pheromone-based interactions, and random fluctuations:

  • display math(15)

where αb is a user-defined weight accounting for the contribution of behavior bB, and γ is the weight accounting for stochastic behavior, where q∈[0,1] is a number sampled from a uniform distribution each time the action is evaluated. To match the randomness magnitude with the scale of the image, which is typically smaller for pixels in the upper regions of the image, the weight γ starts with an initial value γ0 and exponentially decays by a constant factor γτ at each iteration.

If an immediate loop is detected, i.e., the p-ant moving recurrently from one pixel to another, then the action for the current iteration is randomly selected. Finally, the p-ant's position inline image is updated according to the selected action. Algorithm 2 outlines the overall iteration process.

3.6 Evidence Accumulation

Once the p-ants' activity has ceased, the instantaneous saliency map, S, feeds a 2-D dynamic neural field F (Amari, 1977; Rougier & Vitay, 2006), which is a lattice of laterally connected neurons. Its goal is to integrate evidence across time, to consider competition between multiple focuses of attention, and to promote perceptual grouping.

The dynamical characteristic of the neural fields, displayed in the form of inertia, is the key element that enables information to be integrated across time. However, if not properly handled, this property causes the field to smear when the robot moves.

A way of avoiding this undesirable effect is to shift the neural field's activity according to the robot motion estimate by using asymmetrical kernels in the neurons (Zhang, 1996). However, because the neural field is a representation of the environment through a projection process, and considering the fact that the robot may incur both rotation and translation, it is more straightforward and consequently more effective to affect the neural field's activity directly, i.e., to consider the neural field's state as an image to be transformed with a warping operator. Along these lines, the following three steps compensate explicitly the neural field for the camera motion engaged between two frames (see Algorithm 1):

  1. Estimate the homography matrix H that describes the projective transformation between the current frame, I, and the previous one, I'. This step is detailed in Section 3.6.1.

  2. Obtain a motion-compensated version of the previous neural field's state by using the estimated homography matrix, FHF.

  3. Update F with the pheromone map S. This step is detailed in Section 3.6.2.

image
3.6.1 Homography Matrix Estimation

To estimate the projective transformation H, a set of corner points (Tomasi & Shi, 1994) is first detected in the previous frame, I'. These points are then tracked in the current frame, I, with a pyramidal implementation of the Lucas-Kanade feature tracker (Bouguet, 1999). The resulting sparse optical flow is then used to estimate the projective transformation relating both frames, i.e., the 3×3 homography matrix H, such that

  • display math(16)

where ui is a corner point found in I and ui is its correspondence in I'. Due to noise in the tracking process, the homography matrix is calculated as the least-squares solution that minimizes the backprojection error (Bradski & Kaehler, 2008). This process assumes that distortion introduced by the camera lens into the input images has been corrected. It also assumes that either (1) the terrain in front of the robot is planar or (2) the camera was only rotated, and not displaced, between frames. Neither of these two constraints can be strictly ensured in off-road environments. Still, in most situations the terrain is somewhat planar and the attitude of the camera changes more significantly than its position. Experiments have shown that the cosatisfaction of these two relaxed constraints is sufficient to maintain a robust operation. If a minimum of four correspondences between corner points is not found, the homography matrix is set to the identity matrix, H=diag(1, 1, 1).

3.6.2 Neural Field Update

The neural field F is a 2-D lattice of w×h neurons, each one corresponding to one pixel of the saliency map. The neurons have “Mexican-hat”-shaped lateral coupling, implemented by Difference of Gaussians (DoG). This interneuron coupling helps in the formation of a coherent focus of attention (Rougier & Vitay, 2006). On the one hand, activated neurons excite their neighbors, thus promoting perceptual grouping. On the other hand, activated neurons tend to inhibit distant ones, thus reducing ambiguities in the focus of attention.

Formally, the connection's weight w(x, x′) between a neuron in position x and a neuron in position x′ is given by a DoG function of the Euclidean distance between both. In addition to lateral connectivity, the neural field also has afferent interactions with pheromone field S. The weight d(x, y) of a connection between an element of S in position y and a neuron of F in position x is given by a Gaussian function of the Euclidean distance between both. This operation enlarges the neurons' receptive field to reduce sensitivity to noise.

In continuous time, the average membrane potential of a given neuron at position x can now be expressed by the nonlinear integrodifferential equation:

  • display math(17)

where f(x)=x in this paper, τ is a time constant, and ψ=0 is the neuron threshold. For numerical integration, and assuming a delay between consecutive frames, the Euler forward method is used to obtain an approximation of the neural field, which in matrix form results in the rearranged expression

  • display math(18)

where * is the convolution operator; A, B, and c are empirically defined weights specifying the contribution of each term; and inline image, with Gkσ as a Gaussian kernel of size k×k and width σ.

This matrix-based formulation of the neural field results in a synchronous update policy. That is, all neurons are updated based on the previous frame's network state. The problem with this update policy is that it has the potential to induce undesirable activity oscillations in the face of symmetries at the sensory input. However, due to robot motion, these singular configurations are unlikely to occur during a relevant amount of time.

The dynamical characteristic of the model in conjunction with the long-range lateral inhibition results in the following property: the higher the number of frames with the same spot with high activity, the more difficult it is, due to lateral connectivity, for other regions to become activated. Hence, transient distractors are actively inhibited once a large amount of evidence on the trail location is accumulated (see Figure 7). That is, the focus of attention's spatiotemporal competition is an implicit property of the system.

Figure 7. Example of neural field competition in a situation represented by four ordered frames. Each row includes the saliency map (center), S, and the neural field (right), F, for a given input image (left). The redness of the blobs overlaid in the input images corresponds to the activity level of the neural field above 0.85, representing the model's estimate of the trail's location. The trail is visible for several frames prior to frame 220, thus eliciting a high level of activity in the neural field at frame 190. Although the transient presence of a trail-like grass segment in the bottom-left region of the input image is felt in the pheromone field between frames 220 and 250, this distractor is actively inhibited in the neural field. The outcome is that the system's output, i.e., the red overlay on the input image, steadily segments the trail from the background. To amplify the effects of the distractor's presence, the appearance model's influence was turned off, Φ(pm, href)=ε.

Download figure to PowerPoint

image

4 EXPERIMENTAL RESULTS

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

4.1 Experimental Setup

There is an extensive data set of 33 color videos,2 encompassing a total of 24,684 frames with a resolution of 640×480. A subset of these videos, from Video 1 to Video 28, was recorded at 10 frames per second with a hand-held camera carried at an approximate height of 1.5 m, at an approximate speed of 1 ms−1. A second subset of five videos, from Video 29 to Video 33, was acquired from a camera mounted on a mountain bike moving at varying speed. Unlike the first subset, the second one was downloaded from YouTube, meaning that the videos were acquired from cameras with different sensors and fields of view. The full data set includes both natural and engineered paths in a wide variety of backgrounds (see Figure 8). Experimental results were obtained running the model offline. The model was implemented in C++ using OpenCV 2.2 (Bradski & Kaehler, 2008) for low-level routines and it was tested on a Core2 Duo 2.53 GHz P8700 running Linux Ubuntu 64 bits 10.10. To show that the model is capable of providing sufficient information to bring a robot back to the trail after a forced exit, triggered, for instance, by the presence of an obstacle, the camera was frequently moved off trail with a considerable level of oscillation. Figure 9 depicts one of these situations.

Figure 8. Dataset representative frames. The first three rows correspond to a subset of videos acquired with a hand-held camera moving at an approximate speed of 1 ms−1. The bottom row corresponds to a subset of videos acquired with a camera mounted on a mountain bicycle moving with varying speed. The redness of the blobs overlaid in the images corresponds to the activity level of the neural field above 85% of its maximum, representing the model's estimate of the trail's location.

Download figure to PowerPoint

image

Figure 9. The system's output in a sequence of images from video 19 obtained with the camera moving on and off trail. This shows the ability of the model to provide enough information about the trail's location for a robot to return to the trail after a forced exit. The redness of the blobs overlaid in the images corresponds to the activity level of the neural field above 85% of its maximum, representing the model's estimate of the trail's location.

Download figure to PowerPoint

image

The following describes the model's parametrization used in the experiments. The number of p-ants per map, n, has been empirically defined as 20. A smaller number may not ensure convergence, whereas a larger one did not exhibit considerable improvement in the tested data set. The same reasoning applies to the number of iterations applied to each p-ant, η, which has been set to 50. The pheromone baseline deployed by a given p-ant on its associated pheromone map (see Section 3.5), ε, has been set to 0.008. The gain controlling the speed at which the appearance model adapts to changes in the trail's appearance, κ, is set to 0.001. The gain of the appearance model's contribution to the deployed pheromone, β, is set to 0.01. The specific values of these last two parameters is a function of the system's frame rate. The small portion of ε deployed in the other pheromone map (see Section 3.5), υ, has been set to 0.3. These values should not be set too high to avoid pheromone saturation, which would inhibit the emergence of collective behavior. The ratio of the robot motion-compensated neural field used to initialize the pheromone maps at the onset of each frame (see Section 3.2) has been set to λ=0.1. The contribution of each behavior is αgreedy=0.45, αtrack=0.35, αcenter=0.10, αahead=0.05, and αcommit=0.05. By making αgreedycenteraheadcommit, we ensure that p-ants exploit more strongly the conspicuity cue than the a priori knowledge on the expected trail's shape. This follows from the observation that trails in natural environments are highly unstructured; still, some a priori knowledge can help to prevent p-ants from being stuck or deviated by distractors in the environment. With a relatively high αtrack, the collective influences the individual considerably to further reduce the problems associated with noise and distractors. Note, however, that making αtrackgreedycenteraheadcommit is important to ensure that self-organization occurs under exogenous influence; otherwise, a nonsituated consensus among p-ants could be reached. The margin ζ used to center the p-ant on the segment it is currently on has been set to 0.06. The width, δw, and the height, δh, of the window used to create p-ants in Section 3.4.1 have been set to 9 and 5, respectively. The initial values of the random factor ρ [see Eq. (9)], ρ0, and its increment at each iteration, Δρ, have been set to 0.3 and 0.02, respectively. The initial values of the random factor γ [see Eq. (15)], γ0, and the rate of its exponential decay at each iteration, γτ, have been set to 0.4 and 0.02, respectively. The neural-field-free parameters (see Section 3.5.2) have been empirically defined, σ1=4.25, σ2=14.15, σ3=2.15, k1=25, k2=91, k3=11, a=2, b=2.5, c=8, and τ=0.03. The system showed robustness to small variations around these values as long as the proportions are roughly maintained.

4.2 Results

This section presents quantitative results obtained with the proposed model in the presented data set. The trail is considered correctly detected if the biggest blob of neural field activity above 85% of its maximum is fully localized within the trail's boundaries and roughly aligned with the trail's orientation.

Figure 10 shows a set of input images obtained from (Rasmussen et al., 2009) as well as their corresponding conspicuity maps, computed with the scaling function proposed in Section 3.3.2. It is observable that at least one of the conspicuity maps tends to segment the trail from the background. However, the trail is not always represented by the most conspicuous segment. In these cases, the typical blend of conspicuity maps used to produce the final saliency map, from which the object of interest is directly captured (Frintrop et al., 2005; Itti et al., 1998), is likely to fail. The first hypothesis being tested in this work is that the intermediate pheromone maps are able to introduce the required added value to ensure robust operation in these situations. Table II confirms this hypothesis. That is, in the tested data set, the proposed swarm-based saliency model, inline image, predicts the trail location 4.8 times more than a classical saliency model based only on the conspicuity maps, inline image. For the sake of a fair comparison, the neural field F, which is fed by S, is used to generate the output in both cases. To assess the potential effects that the probabilistic nature of the p-ants' behaviors might have on the model's repeatability, the results are obtained from averaging five runs per video. With a rather low standard deviation (see Table IV), the model's repeatability is demonstrated.

Table II. Comparative detection results in the data set of 33 videos between the proposed swarm-based saliency model, inline image, and a classical saliency model based only on the conspicuity maps, inline image. Aggregate detection rate (mean ± standard deviation) computed as the average of the detection rates obtained per video. Refer to Table IV in Appendix A for details
 Classical modelProposed model
Detection rate (%)17.40±25.0896.90±0.10

Figure 10. Intensity, CI (middle row), and color, CC (bottom row), conspicuity maps computed with the scaling function proposed in Section 3.3.2, for a set of images (top row) obtained from Rasmussen et al. (2009). These figures show that the trail is usually conspicuous and segmented in at least one of the conspicuity maps.

Download figure to PowerPoint

image

The second hypothesis being assessed is whether the proposed model exhibits sufficient accuracy and computational efficiency to ensure robust trail following in off-road environments. Computational efficiency is attained at nearly 20 Hz operation (see Table III). This processing rate is adequate for visual servoing at moderate speeds on the order of a few meters per second. Remarkably, the swarm-based pheromone maps computation, which is the only trail-specific operation, takes only 2 ms on average per frame. With an average success rate of 97% (see Table II), the model shows itself to be capable of providing a great deal of, and possibly sufficient, information for a control system to guide a robot along most trails.

Table III. Proposed model's average computation times. The timing reported for the neural field update also includes optical flow computation, homography estimation, and neural field wrapping
 Neural fieldConspicuity maps computationPheromone maps computationTotal
Time (ms)1833253

The results are more stringent if the difficulty of the tested data set is taken into account. To the best of our knowledge, no previous work has been tested against a data set with trails simultaneously as narrow, unstructured, and discontinuous as the ones considered herein. Moreover, contrary to previous work, the model succeeds in situations in which the trail is not starting from the bottom of the image [e.g., Figures 1(d) and 9]. It is worth noting that in 13 of the 33 videos, the proposed model shows a 100% success rate for all five runs. Video 5 is categorized as a long run, at almost 5 min length, composed of more than 2,800 frames. In addition to being often interrupted and highly unstructured, the trail also exhibits a variable width in this video. Moreover, the terrain surrounding the trail is also heterogeneous and highly populated with potential distractors, such as trees and bushes. The 96% success rate of the model in this video clearly shows its robustness in demanding situations.

In a second experiment, the proposed model was tested on an additional set of 15 images (see Figure 11), employed by Rasmussen et al. (2009) to assess the ability of their model to perform without the support of 3-D data. In this second data set, our model only fails to determine the trail location in the image depicted in the bottom-right of Figure 11. The reason for this failure is that without a symmetry constraint, it is impossible to define as nontrail the region signaled by the system. This could be overcome easily by explicitly adding a symmetry behavior to the p-ants. However, the more specialized the system gets, the less robust it is in the face of unforeseen trails. Figure 12 depicts the successful output of the proposed model in a situation in which the ability to operate without hard assumptions on the trail's shape or appearance is key.

Figure 11. Results of the proposed model on the 15 images data set obtained from Rasmussen et al. (2009). This data set is composed of static images, so the appearance model's influence was turned off, Φ(pm, href)=ε. The red blob corresponds to the proposed model's output after 50 frames running on the same image. This high number of frames is intended to demonstrate that the system actually converges toward a good solution. The redness of the blobs overlaid in the images corresponds to the activity level of the neural field above 0.85, representing the model's estimate of the trail location.

Download figure to PowerPoint

image

Figure 12. Proposed model operating on an image of a trail with a nonmonotonous shape and not starting from the bottom, taken from a high vantage point. The state of the neural field after 50 frames running on the same image clearly shows the ability of the model to stably determine the location of the trail.

Download figure to PowerPoint

image

The paths present in these images are well-structured, uninterrupted, and large, thus they are in agreement with the assumptions made by Rasmussen et al. That is, these paths have a triangular shape and good contrast symmetric borders. As a result, the model-based work of Rasmussen et al. successfully determines the location of the path in all images. However, being model-free and able to exploit both local and global contrast, as is the case of our model, is important to properly handle less structured paths, such as those ones considered in the first data set (see Figure 8) or in situations such as the one shown in Figure 12.

4.3 Failure Cases

Despite the overall good results, the model still presents some weaknesses that should be addressed in future work. One of the weaknesses is rather profound in the sense that there is no straightforward solution. Namely, the presence of strong shadows tends to break down both the segmentation produced in the conspicuity maps and, as a consequence, the system's ability to robustly track the trail. Shadow removal strategies are hard to apply when strong camera motion occurs in unstructured environments. An alternative to reduce sensitivity to shadows is to consider a different color space. However, strong shadows remove chromatic information, thus limiting the impact of the color space. Figure 13 illustrates one of these situations, along with a segmentation produced by a classic clustering-based approach (Rasmussen et al., 2009). The figure shows that the conspicuity maps produce segmentations similar to the clustering approach. This confirms the ability of saliency maps to produce interesting segmentations of the input image. It also illustrates the difficulty classical segmentation approaches have in handling strong shadows. Hence, this failure can be credited more to the poor signal-to-noise ratio of the input image than to any limitation of the proposed model. Given that shadows are cast mostly by tall objects, the solution to this problem may lie in the inclusion of 3-D information.

Figure 13. Failure caused by strong shadows. A situation in which the proposed model fails to track the trail due to the presence of strong shadows. Note that conspicuity maps in (b) and (c) are themselves producing a segmentation of the input image similar to the one produced by a classical clustering-based segmentation approach (Rasmussen et al. 2009) in (d). Hence, failure is just as likely in both cases.

Download figure to PowerPoint

image

With 81%, video 17 exhibits the lowest success rate. This is due to a temporary miscompensation of the neural field for the robot motion, which was in turn caused by a failure to recover the optic flow. Figure 14 depicts a sequence of images representative of this failed case. Without optic flow during the depicted right turn and, consequently, the ability to perform motion compensation, the neural field remained static despite the robot motion. The result is a concentration of neural field activity off the trail, which causes the appearance model to erroneously learn the background (see the second row of Figure 14). The result is a strong competition between p-ants on and off the trail, i.e., between evidence and conspicuity information, which lasts for a few frames until symmetry is broken (see the third row of Figure 14). Once this happens, the appearance model is relearned and the tracking process fully recovered (see the fourth row of Figure 14). This demonstrates that the model's ability to recover from dramatic mismatches between the representation built so far and the sensory input.

Figure 14. Failure caused by erroneous motion compensation. A situation in which the proposed model fails momentarily to track the trail due to optic-flow miscomputation. From top to bottom, rows represent the model's state for key frames along the fail sequence. The figure shows that the highly informative conspicuity maps boost system recovery.

Download figure to PowerPoint

image

4.4 Discussion

A key issue in the proposed model is the considerably high number of parameters that must be set. This is partly the price one must pay for a bottom-up feedforward model and, therefore a low computational cost. However, the actual parametrization is easy to attain given that most of the parameters can be tuned independently. Moreover, a single parametrization is robust enough to cope with different situations. This ability can be verified by the success rate above 98% in videos 26, 27, and 28, which include natural and engineered paths in both natural and man-made environments.

When the trail is highly conspicuous in the environment, as most often occurs, ambiguity is rarely present in the system's output. When this assumption fails and distractors are scattered, the model is still often able to perform correctly, as demonstrated by the quantitative results. This robustness is due to the synergistic operation between neural field inertia and the p-ants' sensorimotor coordination capabilities, which allow for an opportunistic exploitation of the trail-background prioritized segmentation present in the conspicuity maps. Although the appearance model learned online is also responsible for this success, it is possible to depict in Figure 15 that it is insufficient alone when the camera is compelled to change between trails of different appearance. Alone the same lines, Figure 16 shows two additional situations in which the system successfully tracks the trail despite its sudden appearance change.

Figure 15. The system's output in a sequence of images from video 27 obtained with the camera moving from the trail depicted in (a) to the trail depicted in (c). In (d), the pixel-wise trail probability map of image (b) shows the inability of the learned appearance model to indicate the presence of the new trail. This is a result of both trails having different appearances. Nevertheless, the system is able to switch from one trail to the other thanks to the robust operation of the underlying swarm.

Download figure to PowerPoint

image

Figure 16. The system's output in two sequences of images (top row from video 26 and bottom row from video 27) obtained with the camera moving along two trails whose appearance suddenly changes.

Download figure to PowerPoint

image

5 CONCLUSIONS

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

This paper proposed a swarm-based visual saliency model capable of embedding a priori knowledge on the overall layout of the object being sought. The model was shown to perform well and fast during the difficult task of tracking unstructured trails in natural environments. In particular, the model exhibited a success rate on the order of 97% at 20 Hz against a highly demanding and heterogeneous set of environments. This solution is suitable for visual servoing at moderate speeds, which is often the case in narrow and sinuous trails.

These results are largely due to the multiagent design, which enables a robust self-organization of visual search, perceptual grouping, and multiple hypotheses tracking. In other words, the resulting system solves a complex problem in a bottom-up way. The first outcome is a low computational footprint, which in turn promotes faster robot motion. The second outcome is that, being purely bottom-up, the system does not rely on explicit internal representations, i.e., a priori models, such as the “triangle” concept. In addition to fostering robustness in unstructured environments, this is also an important asset if trail detection is expected to emerge as a product of incremental self-supervised learning. Along these lines, an additional advantage of being purely bottom-up is that the system's design space is restricted to the simple perception-action rules, i.e., behaviors, encapsulated in the set of homogeneous agents. Given their simplicity, it should be possible to substitute these rules by simple recurrent neural networks, thus confining the design space to a small set of weights.

Another key and novel aspect of the model is that by using visual saliency, both local and global cues on the trail location are naturally exploited. That is, saliency maps provide contrast information not only between the trail and its surroundings, but also between the trail and the overall scene. Typically, this is the case because the appearance of the materials composing trails (e.g., soil) differ from the appearance of the materials composing its surroundings (e.g., grass) and the remaining elements present in the overall scene (e.g., trees, sky). The complementarity of both local and global contrast information results in a robust handling of less structured trails.

The high success rate across the heterogeneous data set shows that the selected parametrization is not overfit to a specific environment, highlighting its robustness. Nonetheless, an automatic p-ant's behavior generation and parametrization are desirable, and will be exploited in future work. Furthermore, online specialization and generalization of p-ants' behavior can also be addressed for improved performance and robustness, respectively.

Other perceptual modalities, such as texture and depth, and even alternative conspicuity computation methods can also be considered. Additionally, we plan to test the swarm-based saliency model on other visual search tasks. Related to this, the hypothesis partially analyzed in this article regarding the usefulness of visual saliency as a general purpose prioritized image segmentation process should be further validated. Along these lines, we intend to explore the possibility of using the neural field's activity directly as a utility map to support motion planning.

A limitation of the proposed model is its inability to handle bifurcations in the trail. To deal with this limitation, we expect to explicitly track multiple blobs of activity in the neural field. The selection of which blob to track can be suggested by feedback of the robot's action selection process. This approach has been explored by us in the context of swarm-based obstacle detection (Santana & Correia 2010). The parallel nature of conspicuity maps, neural field, and swarm-based saliency will be exploited in future implementations on parallel hardware, such as Graphics Processing Units (GPU).

Finally, the obtained results add to previous evidence on the usefulness of visual attention for the control of off-road robots (Santana & Correia 2010, 2011; Santana et al., 2011a). The proposed method also contributes to the emerging swarm cognition field, which attempts to uncover the basic principles of cognition, i.e., adaptive behavior, recurring in self-organizing principles, mainly those exhibited by social insects (Santana & Correia 2010, 2011; Trianni et al., 2011).

Acknowledgements

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

This work was partially supported by FCT/MCTES Grant No. SFRH/BD/27305/2006 and by CTS multiannual funding through the PIDDAC Program. We gratefully acknowledge the useful comments and support provided by our colleague Magno Guedes. We also want to thank Vítor Matos from the University of Minho for the valuable discussion we had regarding the neural field's activity shift aspects. Finally, we want to acknowledge the fruitful comments provided by the anonymous reviewers.

Appendix A

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES

RESULTS DETAILED

Table IV. Detailed trail detection results obtained with a classical model, inline image, and with the proposed model, inline image, in the 33 videos data set. To handle the probabilistic nature of the p-ants' behaviors, the results for the proposed model (mean ± standard deviation) refer to the average of five runs performed in each video
Video IDNr. of FramesClassical model detection rate (%)Proposed model detection rate (%)
127844.60100.00±0.00
220461.76100.00±0.00
34224.74100.00±0.00
41350.00100.00±0.00
5285432.4896.29±0.03
618627.96100.00±0.00
71210.00100.00±0.00
81240.00100.00±0.00
930118.7793.33±0.39
1014749.6699.18±1.09
113860.00100.00±0.00
121580.0095.44±0.25
1313440.30100.00±0.00
1467644.2398.02±0.07
1568326.5093.82±0.06
167704.5590.99±0.10
1740334.9981.04±0.20
1833597.01100.00±0.00
1923084.7898.70±0.27
204396.3895.81±0.18
214903.67100.00±0.00
2223010.87100.00±0.00
236006.00100.00±0.00
248020.0099.05±0.10
259070.0093.65±0.05
261,6112.4898.26±0.00
273,0110.0097.51±0.02
281,1960.0099.46±0.07
291,02718.7093.57±0.11
301,0832.9591.51±0.00
311,6492.9192.37±0.14
3238817.2795.41±0.10
332,69634.8594.24±0.03
 24,68417.40±25.0896.90±0.10
  1. 1

    Rows are indexed in increasing order from the top to the bottom of the map.

  2. 2

    Videos with the proposed model's output are available from the authors' website (Santana et al., 2012).

REFERENCES

  1. Top of page
  2. Abstract
  3. 1 INTRODUCTION
  4. 2 RELATED WORK
  5. 3 PROPOSED MODEL
  6. 4 EXPERIMENTAL RESULTS
  7. 5 CONCLUSIONS
  8. Acknowledgements
  9. Appendix A
  10. REFERENCES
  • Alon, Y., Ferencz, A., & Shashua, A. (2006). Off-road path following using region classification and geometric projection constraints. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 689–696. IEEE.
  • Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biological Cybernetics, 27(2), 7787.
  • Antón-Canalís, L., Hernández-Tejera, M., & Sánchez-Nielsen, E. (2006). Particle swarms as video sequence inhabitants for object tracking in computer vision. In Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA), pp. 604–609, IEEE Computer Society, Washington, DC.
  • Ballard, D. H. (1991). Animate vision. Artificial Intelligence, 48(1), 5786.
  • Bartel, A., Meyer, F., Sinke, C., Wiemann, T., Nchter, A., Lingemann, K., & Hertzberg, J. (2007). Real-time outdoor trail detection on a mobile robot. In Proceedings of the 13th IASTED International Conference on Robotics, Applications and Telematics, pp. 477482.
  • Beer, R. D. (1995). A dynamical systems perspective on agent-environment interaction. Artificial Intelligence, 72(1-2), 173215.
  • Beer, R. D. (2003). The dynamics of active categorical perception in an evolved model agent. Adaptive Behavior, 11(4), 209243.
  • Blas, M., Agrawal, M., Konolige, K., & Sundaresan, A. (2008). Fast color/texture segmentation for outdoor robots. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4078–4085, IEEE Press, Piscataway, NJ.
  • Bonabeau, E., Dorigo, M., & Theraulaz, G. (1999). Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press, Oxford.
  • Bouguet, J. (1999). Pyramidal implementation of the lucas kanade feature tracker description of the algorithm. Intel Corporation, Microprocessor Research Labs, OpenCV Documents.
  • Bradski, G., & Kaehler, A. (2008). Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc., Sebastopol, CA.
  • Broggi, A., & Cattani, S. (2006). An agent based evolutionary approach to path detection for off-road vehicle guidance. Pattern Recognition Letters, 27(11), 11641173.
  • Burt, P., & Adelson, E. (1983). The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4), 532540.
  • Chaturvedi, P., & Malcolm, A. (2005). Real-time road following in natural terrain. In Proceedings of the IEEE Conference on Cybernetics and Intelligent Systems, vol. 2, pp. 815820, IEEE.
  • Cour, T., & Shi, J. (2007). Recognizing objects by piecing together the segmentation puzzle. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8, IEEE Computer Society, Washington, DC.
  • Crisman, J., & Thorpe, C. (1991). Unscarf-a color vision system for the detection of unstructured roads. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 2496–2501, IEEE Press, Piscataway, NJ.
  • Doran, M. M., Hoffman, J. E., & Scholl, B. J. (2009). The role of eye fixations in concentration and amplification effects during multiple object tracking. Visual Cognition, 17(4), 574597.
  • Fernandez, D., & Price, A. (2005). Visual detection and tracking of poorly structured dirt roads. In Proceedings of the International Conference on Advanced Robotics (ICAR), pp. 553560. IEEE.
  • Fernandez, J., & Casals, A. (1997). Autonomous navigation in ill-structured outdoor environment. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), vol. 1, pp. 395–400, IEEE Press, Piscataway, NJ.
  • Franks, N. R. (1989). Army ants: A collective intelligence. American Scientist, 77(2), 138145.
  • Frintrop, S. (2006). VOCUS: A visual attention system for object detection and goal-directed search. Ph.D. thesis, INAI, vol. 3899, Germany.
  • Frintrop, S., Backer, G., & Rome, E. (2005). Goal-directed search with a top-down modulated computational attention system. In Proceedings of the DAGM 2005, Lecture Notes on Computer Science, 3663, pp. 117–124, Springer-Verlag, Berlin, Germany.
  • Ghurchian, R., Hashino, S., & Nakano, E. (2004). A fast forest road segmentation for real-time robot self-navigation. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 406–411, vol.1, IEEE Press, Piscataway, NJ.
  • Grassé, P.-P. (1959). La reconstruction du nid et les coordinations inter-individuelles chez bellicositermes et cubitermes sp. la théorie de la stigmergie: Essai d'interprétationdu comportement des termites con- structeurs. Insectes Sociaux, 6, 4180.
  • Grudic, G., Mulligan, J. (2006). Outdoor path labeling using polynomial mahalanobis distance. In Proceedings of Robotics: Science and Systems, pp. 16–19, MIT Press, Cambridge, MA.
  • Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 12541259.
  • Kong, H., Audibert, J., & Ponce, J. (2010). General road detection from a single image. IEEE Transactions on Image Processing, 19(8), 22112220.
  • Liu, J., Tang, Y., & Cao, Y. (1997). An evolutionary autonomous agents approach to image feature extraction. IEEE Transactions on Evolutionary Computation, 1(2), 141158.
  • Lookingbill, A., Rogers, J., Lieb, D., Curry, J., & Thrun, S. (2007). Reverse optical flow for self-supervised adaptive autonomous robot navigation. International Journal of Computer Vision, 74(3), 287302.
  • Mazouzi, S., Guessoum, Z., Michel, F., & Batouche, M. (2007). A multi-agent approach for range image segmentation. In Proceedings of the 5th international Central and Eastern European Conference on Multi-Agent Systems and Applications (CEEMAS), LNAI 4696, vol. 4696, pp. 1–10, Springer-Verlag, Berlin, Germany.
  • Mobahi, H., Ahmadabadi, M. N., & Araabi, B. N. (2006). Swarm contours: A fast self-organization approach for snake initialization. Complexity, 12(1), 4152.
  • Navalpakkam, V., & Itti, L. (2005). Modeling the influence of task on attention. Vision Research, 45(2), 205231.
  • Owechko, Y., & Medasani, S. (2005). A swarm-based volition/attention framework for object recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition Workshop (CVPRW), pp. 9198. IEEE Computer Society, Washington, DC.
  • Passino, K. M., Seeley, T. D., & Visscher, P. K. (2008). Swarm cognition in honey bees. Behavioral Ecology and Sociobiology, 62(3), 401414.
  • Poli, R., & Valli, G. (1993). Neural inhabitants of MR and echo images segment cardiac structures. In Proceedings of the Computers in Cardiology, pp. 193–196, IEEE Computer Society, Washington, DC.
  • Ramos, V., Almeida, F. (2000). Artificial ant colonies in digital image habitats - a mass behavior effect study on pattern recognition. In Proceedings of the 2nd International Workshop on Ant Algorithms - From Ant Colonies to Artificial Ants (ANTS), pp. 113–116, Belgium.
  • Rasmussen, C. (2008). Roadcompass: Following rural roads with vision+ ladar using vanishing point tracking. Autonomous Robots, 25(3), 205229.
  • Rasmussen, C., & Scott, D. (2008a). Shape-guided superpixel grouping for trail detection and tracking. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4092–4097, IEEE Press, Piscataway, NJ.
  • Rasmussen, C., & Scott, D. (2008b). Terrain-based sensor selection for autonomous trail following. In Proceedings of the 2nd International Workshop on Robot Vision (Robvis 2008), pp. 341355.
  • Rasmussen, C., Lu, Y., & Kocamaz, M. (2009). Appearance contrast for fast, robust trail-following. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS). IEEE Press, Piscataway, NJ.
  • Rosenblatt, J. K. (1995). DAMN: A distributed architecture for mobile navigation. In Proceedings of the AAAI Spring Symposium on Lessons Learned from Implemented Software Architectures for Physical Agents, Stanford, CA.
  • Rougier, N., & Vitay, J. (2006). Emergence of attention within a neural population. Neural Networks, 19(5), 573581.
  • Santana, P., & Correia, L. (2010). A swarm cognition realization of attention, action selection and spatial memory. Adaptive Behavior, 18(5), 428447.
  • Santana, P., & Correia, L. (2011). Swarm cognition on off-road autonomous robots. Swarm Intelligence, 5(1), 4572.
  • Santana, P., Alves, N., Correia, L., & Barata, J. (2010a). Swarm-based visual saliency for trail detection. In Proceedings of the IEEE/RSJ 2010 International Conference on Intelligent Robots and Systems (IROS), pp. 759–765, IEEE Press, Piscataway, NJ.
  • Santana, P., Alves, N., Correia, L., & Barata, J. (2010b). A saliency-based approach to boost trail detection. In Proceedings of the International Conference on Robotics and Automation (ICRA), pp. 1426–1431, IEEE Press, Piscataway, NJ.
  • Santana, P., Guedes, M., Correia, L., & Barata, J. (2011a). Stereo-based all-terrain obstacle detection using visual saliency. Journal of Field Robotics, 28(2), 241263.
  • Santana, P., Mendonça, R., Correia, L., & Barata, J. (2011b). Swarms for robot vision: The case of adaptive visual trail detection and tracking. In Proceedings of the European Conference on Artificial Life (ECAL), pp. 712–719, MIT Press, Cambridge, MA.
  • Santana, P., Mendonça, R., Alves, N., Correia, L., & Barata, J. (2012). Trail detection experimental results supporting videos, http://www.uninova.pt/∼pfs/index/TrailVideos.html.
  • Song, D., Lee, H., Yi, J., & Levandowski, A. (2007). Vision-based motion planning for an autonomous motorcycle on ill-structured roads. Autonomous Robots, 23(3), 197212.
  • Thelen, E., & Smith, L. B. (1996). A dynamic systems approach to the development of cognition and action, MIT Press, Cambridge, MA.
  • Thorpe, C., Hebert, M., Kanade, T., & Shafer, S. (1988). Vision and navigation for the Carnegie-Mellon Navlab. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(3), 362373.
  • Thrun, S., Montemerlo, M., Dahlkamp, H., Stavens, D., Aron, A., Diebel, J., Fong, P., Gale, J., Halpenny, M., Hoffmann, G., Lau, K., Oakley, C., Palatucci, M., Pratt, V., Stang, P., Strohband, S., Dupont, C., Jendrossek, L.-E., Koelen, C., Markey, C., Rummel, C., van Niekerk, J., Jensen, E., Alessandrini, P., Bradski, G., Davies, B., Ettinger, S., Kaehler, A., Nefian, A., & Mahoney, P. (2006). Stanley: The robot that won the darpa grand challenge. Journal of Field Robotics, 23(9), 661692.
  • Tomasi, C., & Shi, J. (1994). Good features to track. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 593–600, IEEE Computer Society, Washington, DC.
  • Trianni, V., Tuci, E., Passino, K., & Marshall, J. (2011). Swarm cognition: An interdisciplinary approach to the study of self-organising biological collectives. Swarm Intelligence, 5(1), 318.
  • Tue-Cuong, D.-S., Dong, G., Hwang, Y. C., & Heng, O. S. (2008). Extraction of shady roads using intrinsic colors on stereo camera. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 341346. IEEE.
  • Wolfe, J. M., Võ, M. L.-H., Evans, K. K., & Greene, M. R. (2011). Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences, 15(2), 7784.
  • Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16(6), 2112.
  • Zhang, X., Hu, W., Maybank, S., Li, X., & Zhu, M. (2008). Sequential particle swarm optimization for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18. IEEE Computer Socienty, Washington, DC.