Recent ITV drama The Long Shadow has provided a reminder of one of the earliest and most notorious examples of the police having an audio recording of a crime. In 1979, hoaxer John Humble sent a tape to the police ‘identifying’ himself as the Yorkshire Ripper. The late Stanley Ellis, one of the early pioneers of forensic phonetics, was able to locate the speaker as being from the Sunderland area. Despite Ellis’s concerns about the validity of the claims made by the speaker, police resources were diverted to finding ‘Wearside Jack’. Almost 25 years later, Humble, a lifelong resident of the Castletown area of Sunderland, was identified by DNA in 2006 and admitted making the tape, validating the accuracy of Ellis’s accent profile.

Today, developments in digital technology have meant that the circumstances in which there may be a speech recording of evidential value in a trial are growing. Voice memos, ‘ring’ doorbells and voicemail messages are all potential means of capturing the voice of a perpetrator. In such cases, the party who wishes to establish that the defendant was (or was not) the speaker may be in a position to instruct an expert to provide an opinion on whether or not the recording is of the person of interest through comparison of the unknown speaker on the incriminating recording with a recording of the suspect’s voice (‘forensic speaker comparison’). This article focuses on criminal cases, but voice experts have been instructed in a wide range of cases which have involved attributing a recording to a known speaker.

Methods of forensic speaker comparison

There are two main approaches to forensic speaker comparison currently used in combination in forensic contexts. The first is auditory comparison. This involves the expert phonetician listening to the disputed speech and the known recording of the speaker and transcribing the pronunciation of the words, usually using the International Phonetic Alphabet. The expert is then able to provide a comparison, identifying the areas of similarity and difference focusing on qualities such as the pronunciation of vowels and consonants, intonation, speech rhythm and tempo, voice quality with reference to the frequency of occurrence of such features in the greater population.

The second method is acoustic analysis. This involves the analysis of the physical quantities of speech as can be measured using spectrograms and waveforms. An expert can examine the image of the recordings and compare factors such as pitch, resonance and duration.

Auditory analysis has been the subject of some criticism. It often requires the analyst to form a view based on an impressionistic view of similarities. These concerns were such that the Court of Appeal for Northern Ireland ruled in R v O’Doherty [2002] NICA 20 that auditory analysis should not be admissible as evidence against an accused unless it was being used in conjunction with an acoustic analysis. The Court of Appeal in England and Wales, in R v Flynn & St John [2008] EWCA Crim 970, did not feel it ‘possible or desirable’ to go as far.

Attempts to use speaker recognition software

As might be expected, this is an area where technology may have a role to play. Speaker recognition technologies have the potential to perform some of the roles of a human. Attempts have been made to use automatic speaker recognition software in the criminal courts but have not yet been successful. In R v Slade & Ors [2015] EWCA Crim 71, the defence wished to use a technology called Batvox to establish that the defendant was not the speaker on a covert recording. This was being introduced as ‘fresh evidence’ at appeal stage and therefore there was a high bar to pass in terms of cogency. The Court of Appeal (who allowed the appeal on other grounds) considered that the technology had not been rigorously tested enough and there was insufficient explanation of how it worked. The court made clear that it was not making a definitive assessment of whether such evidence may be admissible in the future or what the conclusion might be in the light of further scientific developments.

Since the decision in Slade, research has been ongoing to address the concerns. The ‘Person-specific ASR’ project at the University of York is examining what makes a voice easy or difficult for AI-based technology to recognise. The aim is to better understand how such technology works and how it might reliably be utilised as a form of forensic evidence in cases where the identity of a speaker is unknown. The project is currently running experiments both with small-scale, highly controlled recordings, and very large scale, forensically realistic data sets, and has identified how the technology can respond to different properties of speech (for example, whispered speech is extremely difficult for automatic speaker recognition systems to deal with, while regionally accented speech is relatively easy). Further work will involve directly comparing output of such systems with the analyses of expert phoneticians.

‘Great caution and great care’

While using experts may prove challenging, it is preferable to relying on those without specialist training to undertake the comparison. In Flynn, the officers who had been involved in the arrest and interview of the defendant listened to the covert recordings and positively identified the speakers as the defendants. The recordings were of a very poor quality and there was no record of the circumstances in which the exercise had been conducted, leading to inconsistencies between the officers. The Court of Appeal stated such evidence should be treated with ‘great caution and great care’ and that it was always desirable to have an expert analysis as well. If such an identification was conducted solely by the police, its admissibility would depend very much on the circumstances of the case.

If there is a recording available in circumstances where the court is going to hear the defendant speak, does the court need an expert? Very much so! Just as a dock identification is an unsafe way of conducting of an eyewitness identification, asking a jury to make their own assessment of whether a recording is the person they have heard is hugely unreliable. Each juror will perceive sounds differently, with older jurors being less likely to hear the entire range of the speech signal. The way in which the sound is produced will vary depending on where the speakers are placed and whether the jurors are using headphones or not. In cases where the recording is of a poor quality, experimental research has shown that if the poor quality recording is played before a clear sample of speech (equivalent to the accused speaking in court), then there is higher chance of an incorrect identification being made. Then there is the inevitable issue of bias which will arise in any case where a jury has been told that there is other evidence against a defendant. In Khan [2023] EWCA 347, the Court of Appeal declined to prohibit this practice, but said the problems could be addressed with a clear direction to the jury. However regardless of this, speaker identification by non-expert jurors remains an unsatisfactory process which should be avoided.

Voice identification will always be a challenging area. There is no such thing as a ‘voiceprint’ which can definitively resolve a case. Using voice evidence in a case requires a careful and methodical approach which makes clear the limitations of voice evidence. The problems of synthetically produced AI voices add an extra layer of complexity which researchers in this area are starting to tackle. But as we live more of our lives digitally and our actions are captured in different ways, this is an issue counsel are likely to encounter more frequently. 

Identification by voice (1)’ by Dr Jeremy Robson and Dr Harriet Smith appeared in the December 2023 issue of Counsel.

Jeremy and Kirsty will be hosting a free CPD event for criminal practitioners on Voice Identification and the Law at Leicester Castle’s Civil Courtroom on 21 March at 6pm. To find out more, please click here.