Previously, we have shown that spatial attention to a visual stimulus can spread across both space and modality to a synchronously presented but task-irrelevant sound arising from a different location, reflected by a late-onsetting, sustained, negative-polarity event-related potential (ERP) wave over frontal–central scalp sites, probably originating in part from the auditory cortices. Here we explore the influence of cross-modal conflict on the amplitude and temporal dynamics of this multisensory spreading-of-attention activity. Subjects attended selectively to one of two concurrently presented lateral visually-presented letter streams to perform a sequential comparison task, while ignoring task-irrelevant, centrally presented spoken letters that could occur synchronously with either the attended or unattended lateral visual letters and could be either congruent or incongruent with them. Extracted auditory ERPs revealed that, collapsed across congruency conditions, attentional spreading across modalities started at approximately 220 ms, replicating our earlier findings. The interaction between attentional spreading and conflict occurred beginning at approximately 300 ms, with attentional-spreading activity being larger for incongruent trials. Thus, the increased processing of an incongruent, task-irrelevant sound in a multisensory stimulation appeared to occur some time after attention has spread from the attended visual part to the ignored auditory part, presumably reflecting the conflict detection and associated attentional capture requiring accrual of some multisensory interaction processes at a higher-level semantic processing stage.