By combining a modified version of the cross-modal oddball paradigm (Nöstl, Marsh, & Sörqvist, 2012) with sequence learning the current study examines how expectation processes contribute to distraction by auditory events. The visual targets in the oddball task were preceded by tones that formed a repetitive cross-trial standard sequence. In Experiment 1, the standard sequence …-660-440-660-880-… Hz was used. Occasionally, either the 440 Hz or the 880 Hz standard was replaced by one of two novel tones (220 Hz and 1100 Hz), that either differed slightly (220 Hz) or markedly (660 Hz) from the replaced standard. In Experiment 2, with a more complex standard tone sequence …-220-660-440-660-880-660-1100-… Hz, the 440 Hz and the 880 Hz standard was occasionally replaced by either the 220 Hz or the 1100 Hz standard. Both experiments demonstrate that a large difference (i.e. 660 Hz) between the expected and replacing tone is more captivating than a small difference (i.e. 220 Hz). Collectively the results imply that the magnitude of attentional capture elicited by novel sound events depends on the discrepancy between the novel event and the expected event rather than on the amount of local perceptual change.