Distinguishing between speakers and focusing attention on one speaker in multi-speaker environments is extremely important in everyday life. Exactly how the brain accomplishes this feat and, in particular, the precise temporal dynamics of this attentional deployment are as yet unknown. A long history of behavioral research using dichotic listening paradigms has debated whether selective attention to speech operates at an early stage of processing based on the physical characteristics of the stimulus or at a later stage during semantic processing. With its poor temporal resolution fMRI has contributed little to the debate, while EEG–ERP paradigms have been hampered by the need to average the EEG in response to discrete stimuli which are superimposed onto ongoing speech. This presents a number of problems, foremost among which is that early attention effects in the form of endogenously generated potentials can be so temporally broad as to mask later attention effects based on the higher level processing of the speech stream. Here we overcome this issue by utilizing the AESPA (auditory evoked spread spectrum analysis) method which allows us to extract temporally detailed responses to two concurrently presented speech streams in natural cocktail-party-like attentional conditions without the need for superimposed probes. We show attentional effects on exogenous stimulus processing in the 200–220 ms range in the left hemisphere. We discuss these effects within the context of research on auditory scene analysis and in terms of a flexible locus of attention that can be deployed at a particular processing stage depending on the task.