Figure–ground segregation can rely on purely temporal information, that is, on short temporal delays between positional changes of elements in figure and ground (Kandil, F.I. & Fahle, M. (2001) Eur. J. Neurosci., 13, 2004–2008). Here, we investigate the underlying mechanisms by measuring temporal segregation thresholds for various kinds of motion cues. Segregation can rely on monocular first-order motion (based on luminance modulation) and second-order motion cues (contrast modulation) with a high temporal resolution of ≈ 20 ms. The mechanism can also use isoluminant motion with a reduced temporal resolution of 60 ms. Figure–ground segregation can be achieved even at presentation frequencies too high for human subjects to inspect successive frames individually. In contrast, when stimuli are presented dichoptically, i.e. separately to both eyes, subjects are unable to perceive any segregation, irrespective of temporal frequency. We propose that segregation in these displays is detected by a mechanism consisting of at least two stages. On the first level, standard motion or flicker detectors signal local positional changes (flips). On the second level, a segregation mechanism combines the local activities of the low-level detectors with high temporal precision. Our findings suggest that the segregation mechanism can rely on monocular detectors but not on binocular mechanisms. Moreover, the results oppose the idea that segregation in these displays is achieved by motion detectors of a higher order (motion-from-motion), but favour mechanisms sensitive to short temporal delays even without activation of higher-order motion detectors.