Orienting responses to audiovisual events have shorter reaction times and better accuracy and precision when images and sounds in the environment are aligned in space and time. How the brain constructs an integrated audiovisual percept is a computational puzzle because the auditory and visual senses are represented in different reference frames: the retina encodes visual locations with respect to the eyes; whereas the sound localisation cues are referenced to the head. In the well-known ventriloquist effect, the auditory spatial percept of the ventriloquist's voice is attracted toward the synchronous visual image of the dummy, but does this visual bias on sound localisation operate in a common reference frame by correctly taking into account eye and head position? Here we studied this question by independently varying initial eye and head orientations, and the amount of audiovisual spatial mismatch. Human subjects pointed head and/or gaze to auditory targets in elevation, and were instructed to ignore co-occurring visual distracters. Results demonstrate that different initial head and eye orientations are accurately and appropriately incorporated into an audiovisual response. Effectively, sounds and images are perceptually fused according to their physical locations in space independent of an observer's point of view. Implications for neurophysiological findings and modelling efforts that aim to reconcile sensory and motor signals for goal-directed behaviour are discussed.