SPECTROGRAPHIC VOICE RECOGNITION

    In the 1960s, a scientist named Lawrence Kersta, working with an invention called the sound spectrograph created back in 1944 by Bell Laboratory scientists, claimed that "voiceprints" were a unique way to identify individuals. He trademarked the word "voiceprint", founded his own company (Voiceprint Laboratories), and established a professional association, the International Association of Voice Identification (IAVI) which eventually was absorbed in 1980 by the International Association for Identification (IAI), mainly a group of fingerprint experts.  Kersta's brightest followers were Ernest Nash, who helped develop the Michigan State Police crime lab, and Oscar Tosi, a professor who helped develop the Michigan State University Forensic Science program.  Together, Kersta, Nash, and Tosi were the leading (and quite possibly only) "voiceprint" experts in the United States. They testified all over the country, and put together training courses and certification programs which are still somewhat sought-after today.

    Today, nobody uses the word "voiceprint" anymore because of its erroneous association with "fingerprints". The former is a method of expert interpretation and opinion while "fingerprinting" is a matter of absolute certainty and infallibility. To use the word "voiceprint" gives the method more scientific credibility than it deserves. At best, voiceprint identification is like polygraphy (lie detection) and only admissible in 35 states, although it (like polygraphy) makes for a valuable investigative tool to screen potential suspects. Even the phrase "voiceprint identification" may be improper and should probably be abandoned in favor of the broader term, spectrographic voice recognition.

    Do NOT for a moment, however, think this is dead technology. Consider, for example, all those software companies trying to perfect speech recognition programs that allow you to talk to computers. Same underlying theory; same practical problems:

Person: It's hard to recognize speech. I want to check in.
Computer: It's hard to wreck a nice beach. I want two chicken.

    Consider also the promise and potential of individual voice recognition. If there's one thing police departments have plenty of, it's tape recordings of phone calls.  Sure, there's enhanced 911 which displays the address and residency info. on a caller, but what about all those anonymous tips, bomb threats, and ransom demands. If the technology were perfected, you could practically eliminate things like terrorism and kidnapping. Furthermore, let's take another criminal justice problem area -- eyewitness identification. It seems hypocritical to allow the "sound of voice" as eyewitness evidence, and then exclude scientific methods of voice comparison. Research clearly indicates that "earwitness" evidence is unreliable and can result in wrongful conviction. And (to wear the point out), what about the potential for crowd control. With scientific methods of voice identification, you could find out who those Klansmen are under the hoods or who those shouting rioters are looting the grocery store. It's not even a constitutional issue. Speech is protected, but voice isn't. There's nothing prohibiting every single voice in the country being recorded, indexed, and cataloged in an FBI database.

    It all depends, of course, on whether each and every individual voice is different. That is precisely the unproven hypothesis behind the THEORY of spectrographic voice recognition. In order to prove it, you need to have research done by scientists in recognized disciplines of study, like speech, audiology, phonetics, acoustics, physiology, or anatomical kinesiology (exercise science).  Unfortunately, no one in any of these disciplines seems to be interested in such research. You can't even find a college course anywhere, at any level, graduate or undergraduate, dealing with the topic. There's also no journals, and very few books on the subject. It might be fair to say that the scientific community, on whole, hasn't exactly warmed up to the idea of spectrographic voice recognition.

HOW IT WORKS

    The human voice is incapable of producing one pitch at a time. Instead, it produces a simultaneous series of fundamentals and overtones. Some overtones are random and others are multiples of the fundamental, called harmonics. Of all the characteristics of voice, two of the most important ones are frequency and intensity. Frequency is the speed at which air particles vibrate, measured in centimeters per second. Humans can only produce and hear frequencies in the 60-16,000 cps range.  Intensity is the amount of energy (loudness) in a sound wave or pulse. Variation in intensity does not affect frequency, but no two sound waves (even those produced by the same individual) will have exactly the same frequencies and intensities. That is, a sound, once made (even by the same individual) can never be exactly replicated in all its characteristics. 

    Uniqueness in voice is a product of both physiology and learning. With physiology, the two most important things are resonators (nasal, oral, and pharyngeal passages) and articulators (lips, teeth, tongue, soft palate, and jaw muscles). With learning, the process is mostly imitation and trial and error, but the brain's speech center is actually sending signals to various organs. There's no such thing as spontaneous speech; it's all controlled by the brain. A person may try to disguise their voice, or even learn a foreign language, but the way their brain controls speech habit and especially the way their resonators and articulators are shaped and used cannot be changed. This makes each and every individual voice unique. At the very least, intraspeaker variability is less than interspeaker variability.

    It should be noted, however, that some of this has not been empirically established. That is to say, there haven't been exhaustive clinical trials done on all the different ways to disguise, alter, muffle, or mimic the voice.  There's also no research on the impact of age, dentures, tooth extraction, respiratory illness, emotional state, and the like. There's some scientific support about foreign language making no difference, however. 

    It used to be thought (according to Kersta) that there were 10 so-called "cue words" (the, to, and, me, on, is, you, I, it and a) that lent themselves well to voice comparison, then that similar "soundsamples" had to be obtained (requiring wired undercover operatives to get suspects to say silly things), but today, it's generally agreed that an expert MUST have samples similar in the following respects:

    The sound samples are played (unknown voice first) in a continuous loop through a sound spectrograph, either one of the fairly old analog machines resembling an IBM mainframe or one of the newer digital signal analysis workstations (there's some worries about "enhancements" with the digital PCs). It "reads" the frequencies and intensities, and produces printout that look a lot like recordings of earthquake tremors. These printouts are sliced into 2.5 second segments called spectrograms, and they portray three spectra: time (horizontal axis); frequency (vertical axis), and intensity (degree of darkness in the ink). 

    The expert then engages in a two-step process of comparison: aural (listening) and visual (looking over the spectrograms). In the aural stage, the analyst is making note of pronunciation similarities or differences, such as if the word "the" is said with a short "a" or long "e" sound. They will also index and splice certain start and stop points in each sample, to create a new playback loop where one sample ends and another begins (this won't usually be admissible in court). Finally, the expert scrutinizes for speech habits, psycholinguistic features, dialect, inflection, syllable grouping, and breath patterns. Like profilers, they are trying to put themselves in the mind of the suspect.

    In the visual stage, the 2.5 second spectrograms representing the same sound are visually compared. The analyst studies bandwidth, mean frequencies, trajectories, striations, stops, plosives, and fricatives. Differences as well as similarities are looked at.  In fact, accounting for the differences plays a part in one of five standard conclusions the expert arrives at:

    The courts have not been all that accepting of the formulation of these five standard conclusions. The expert can usually expect on cross, redirect, or recross to be asked for a more quantitative conclusion. Particularly with "probable identification", and since some defendant's life may be at stake, the expert will be asked to express in percentage terms how confident they are in a "match". Appeals on the basis of "this is the way our profession does things" isn't probably going to help much when the judge orders them to "answer the question". Then, the whole issue of error crops up, and this is a field with an interesting history on that subject.

ERROR AND ADMISSIBILITY

    In short, there's an error rate, ranging from 6% to 29%. Back in 1962, Kersta claimed 99% accuracy, but Tosi (who was denigrated in one case for just being a "college professor") would claim error rates far in excess of 1%. Nash and Tosi even bumped heads as opposing experts (unbeknownst to either one) in a California case with positive identification versus probable elimination opinions.  The true error rates may never be known, and for that reason, the technique is unlikely to qualify under Daubert, although that depends upon how you interpret proof of "reliability" as spelled out in Daubert, which actually permits novel scientific evidence of known reliability and replicability. 

    Spectrographic voice recognition hasn't fared any better under Frye rules, either, with it's requirement of "general acceptance" in the scientific community. Both the listening and looking stages of the method are somewhat subjective or at least qualitative in how they're combined. The standard decision reached by one expert may be at odds with the standard decision reached by another. Moreover, the process of becoming an expert' is far from being calibrated itself.  By laying down guidelines for standard opinions, the profession appears to be working with a Coppolino standard, but popularity of the "voiceprint" legacy and/or seriousness of the case lead courts to require a points of comparison or probability estimate approach. Judicial notice and case precedent are mixed, an example of bean-counting.

    Several of the major big states (like Indiana, Arizona, New York, California, Michigan, Massachusetts, New Jersey, and Florida) accept the methodology. Others (like Pennsylvania and Texas) want to, but are waiting for "general acceptance". Still others (like Maryland and Louisiana) have always rejected Frye standards anyway, and are perhaps typical in using general Rules of Evidence to get this type of scientific expertise on the stand as long as it's relevant and provides some material assistance to the jury.

    Intelligence agencies, like the CIA and NSA, also regularly use voice recognition in their work which sometimes involves truly identifying whether the voice on a tape recording is the terrorist they think it is or not.

INTERNET RESOURCES
Alaska v. Coon (1999)
International Association for Identification
Speech as a Biometric

Steven Cain's Voice Expert Homepage

Voice Identification: The Aural/Spectrographic Method

PRINTED RESOURCES
Comment, C. (1972). "The Evidentiary Value of Spectrographic Voice Identification" Journal of Criminal Law, Criminology & Police Science 63: 343.
Gray, C. & G. Kopp. (1944). Voiceprint identification. Bell Telephone Report, Bell Laboratories.
Kersta, L. (1962). "Voiceprint Identification", 196 Nature Magazine 1253, Dec. 29, 1962.
Koenig, B. (1986). "Spectrographic Voice Identification" Journal of Acoustical Society of America 79: 2088.
Moenssens, Inbau, & Starrs. (1986). Scientific Evidence in Criminal Cases, 3rd ed. Mineola, NY: Foundation Press.
National Academy of Science. (1979). On the Theory and Practice of Voice Identification. Washington D.C.
Tosi, O. (1981). "Methods of Voice Identification for Law Enforcement Agencies" Identification News, April, 6.

Last updated: 11/12/03
Lecture List for JUS 425
MegaLinks in Criminal Justice