The major aim was to examine the effect of the perpetrator's tone of voice and time delay on voice recognition. In addition, the effect of two types of voice description interviews intended to strengthen voice encoding was tested. Both 11- to 13-year-olds (n = 160) and adults (n = 148) heard an unfamiliar voice for 40 s. The perpetrator either spoke in a normal tone at encoding and in the lineup (congruent), or in an angry tone at encoding and a normal tone in the lineup (incongruent). Witnesses were then interviewed about the voice with global questions or by rating voice characteristics. Half of the witnesses were presented with a lineup shortly after the interview (immediate) and the others after 2 weeks (delayed). Children tested immediately made significantly more correct identifications. This was not the case for adults. (In)congruency between tone of voice and interview type did not significantly affect voice recognition. Witnesses in the congruent–immediate condition performed the best. However, only 25% of the children and 19% of the adults made correct identifications. Poor identification accuracy and the fact that the majority of witnesses believed they would recognise the voice later are reasons for treating voice identification evidence with great caution. Copyright © 2013 John Wiley & Sons, Ltd.