Dieses Dokuwiki verwendet ein von Anymorphic Webdesign erstelltes Thema.
Prijevodi ove stranice:

Identification of speech and speakers in an audio recording

Abstract

Identification of Speech is important in the field of forensics, as are recordings of speech. As recordings tend to contain unwanted noise and signals, forensic experts use a multitude of techniques to extract relevant speech from recordings. The first is called “Audio Enhancement”, which is the process of removing unwanted noise while preserving the relevant speech in the recording. The second is called “Speech Analysis”, which is the process of proving that the speech recording resulting from the enhanced audio was said by a specific person.

Introduction

The usefulness of forensic speech analysis can not be understated. Forensic speech analysis can be a turning point in any court case, as the evidence is irrefutable if presented under the right conditions. These conditions include definite proof the original recording was not tampered with and proving the speech in the recording was without a doubt said by the accused. However, such evidence is almost always recorded under non-ideal conditions, such as a phone call or covertly recorded in public, and as such the speech can barely be deciphered. This is where audio analysis and enhancement come into play. Using such techniques, it is possible to filter out interference such as white noise, barking, screaming and engine noises, and we are then left with an audio recording of understandable speech. This speech can then be further analyzed to prove it was spoken by the accused.

Audio Enhancement

The first step in this process involves enhancing the audio file so it can be understood by humans. Forensic audio enhancement is the process or clarifying audio recordings using non-destructive techniques to preserve speech quality. This process is also sometimes referred to as speech enhancement or voice enhancement. Put simply this process involves identifying the unwanted sounds and reducing said sounds. However, care must be taken to ensure the quality of any speech in said recording does not fall. As a result, the first step is crucial – identifying sounds to be preserved and sounds to be removed. This is sometimes done using a trained human ear, but in some cases the recording is too noisy to distinguish any form of speech. In this case, it’s possible to perform the noise reduction and speech enhancement process indiscriminately across the whole recording in an attempt to find recognizable speech, however this should only be done on a copy of the recording. Once the file has been analyzed to identify areas of interest, an equalization filter is applied. This filter will raise or lower specific frequencies depending on how the filter curve was specified. In the case of human speech, a typical adult male’s speech ranges from 80Hz to 180Hz, while adult female speech ranges from 165Hz to 255Hz. Therefore, a possible filter would look like this:

audio.jpeg

This filter lowers all frequencies below approximately 80Hz and above 300Hz by 27dB, while boosting frequencies between 80Hz and 300Hz by 12dB. This filter however does not always produce ideal results, as the volume levels can still be too low or imbalanced. To resolve this, compression is applied to further balance volume levels of parts of the audio file, and in the case of excessive background noise a gate can be used to suppress said noise. Compressors are a type of amplifier where the gain is dependent on the signal passing through. For example, a compressor can be set to boost quieter sounds more than louder ones, or to reduce gains on any sounds where the amplitude is above a certain threshold. Noise gates reduce unwanted sounds in a similar fashion, however they completely remove any signals that are below the set amplitude, meaning any quiet distractions below the threshold are removed entirely [5]. The final step involving audio enhancement is de-reverb and, once again, noise reduction. Audio recordings to be used as evidence are seldom recorded in ideal conditions and even with the aforementioned enhancements, the resulting speech is sometimes still hard to understand. Removing reverb will obviously make any speech much easier to understand. The noise reduction is now taken to the extreme. Instead of using frequency ranges and gates to reduce noise, each type of noise has its own noise profile which is used to remove that noise in its entirety. For example, we would use a different noise profile when removing the sounds of an engine running or rain [4].

Speech Analysis

Once the audio is enhanced to an acceptable level, the problem we are now faced with is recognizing the speaker, which is known as Forensic Speaker Recognition (FSR).

The Human Voice

FSR is based on the theory that every human’s voice is unique, like fingerprints and DNA. The human voice is determined by a multitude of factors which, as a result of DNA being unique, are considered to influence a human’s voice and cause it to be unique. Speech is comprised of three main mechanisms, those being respiration, phonation and articulation. Respiration is obviously done by the lungs and trachea Phonation is defined as the production of sound, and the main organ used for phonation is the larynx, colloquially referred to as “the voice box” as the larynx is where the vocal folds are located, also known as vocal cords. These folds vibrate to create different sounds. For example, a hissing “s” sound is created without any vibrations from the vocal folds, also known as a voiceless sound, while a buzzing “z” sound is created with vocal vibrations. The final step in creating speech is articulation. Articulation modifies sound created in the larynx and turns it into understandable words. The articulators are the lips, tongue and soft palate. There is however another set of features that make a human’s voice unique through resonance, and those would be the throat, mouth cavity and nasal passages. These are considered to be the main factor in making a person’s voice recognizable and unique[3][7].

In most cases, the result of audio enhancement is an audio file that has amplified speech and muffled noise. Speech analysis takes that file as input and with uses some techniques to classify speech as positive, negative or an unresolved identification. Before the existence of computers as we know them today, people also tried to identify speakers using only parts of the human voice that can be considered a factor of difference. Those factors are:

  • language,
  • accent,
  • use of specific words,
  • voice tone…

From all these factors it can be determined if the person is male or female, their approximate age, where that person is from, and in some cases even what emotions that person is feeling. The problem with that approach is that people that are trying to identify speakers need to have excellent hearing and a lot of knowledge, experience and training, especially when audio files are of low quality.

Automated FSR

With advances in technology, new techniques were developed and many more will be. Some of those techniques include artificial intelligence, algorithms from machine learning, deep learning, NLP etc. There is a project that recognizes Sepedi home language speakers and for that are used four classifier models such as Support Vector Machines, K-Nearest Neighbors, Multilayer Perceptrons (MLP) and Random Forest (RF) [1]. In another article [2] speaker recognition is done by deep learning model and usage of convolution neural network (CNN). That model is text-independent, which means it doesn’t take text meaning in the equation, and if the model were text-dependent it would be much more complex. Model works with spectrograms extracted from speech. Deep learning models are also capable of outperforming human analysts when it comes to recognizing speakers from short, so-called “trivial events”, trivial events being sneezes, coughs, “hmmm” sounds and so on [9]. Datasets for training such models do exist, such as this one on Kaggle which features 1500 samples from five prominent world leaders, as well as background noise which can be combined into the training. VoxCeleb features a much larger scale with 7000 speakers, but the library is unable to be downloaded from the site at the time of writing and it is unclear whether it will ever be available again. Currently there are 203 public GitHub repositories with a topic for “speaker-recognition”.

Tools for FSR

Spectrograms are great tools when it comes to speakers recognition. One of the applications to analyze spectrograms from audio files is Audacity. Audacity is a free open-source audio editing and recording software. It comes with all the features necessary described above in the audio enhancement method, along with spectrogram imaging used for visual identification, which will be explained below. Professional analysts may prefer the use of paid software such as Adobe Audition, which claims to provide the “industry's best audio cleanup, restoration,…” along with other features, or Phonexia which claims to provide a “speaker recognition solution designed explicitly for forensic experts”.

Methods of FSR

In practice, aural and spectrographic methods are used in forensic speech recognition. The analyst is typically sent multiple recordings that need to be identified, along with multiple recordings of the suspect repeating the same thing. However, in the case that the second voice sample is to be obtained without the suspects knowledge, a method that is used is discreetly recording a conversation, where the interviewer must manipulate the conversation so the suspect repeats as many words as possible from the recording being analyzed. Ideally, the second recording should be recorded using the same device as the first recording, or at least using the same method, for example if the first recording was from a recorded telephone line, the second audio sample should be recorded through a telephone line. Once the samples have been obtained, the analyst begins with the aural analysis of the recording. This involves comparing similarities and differences and deciding which parts of the samples are useful and which are not. The examiner looks out for key features of speech such as accent, dialect, syllable grouping and breathing patterns, along with any other peculiar speech habits such as speaking speed and mental status. Once the analyst has decided which parts of the original and comparison recording are similar enough, the spectrographic analysis is used. The speech samples are recorded on the sound spectrograph and is then analyzed in small increments. The result is a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies over time. The spectrograms of the unknown and known recording are visually compared. Factors include:

  • Bandwidth
  • Mean frequency
  • Trajectory of vowel formants
  • Vertical striations
  • Nasal resonance
  • Stops
  • Atriculation
  • Acoustic patterns

Among others. These are closely examined if differences are due to a different pronunciation or a different speaker. In the case of sufficient evidence a positive identification or elimination is reached. In the case of insufficient evidence, a probable identification or elimination is reached. In the case that the audio is of too poor quality or contains too little information with which to compare, the conclusion is described as unresolved [6][8].

Conclusion

Forensic Speech Recognition is more important than ever in the modern digital age. Communication through audio channels is at an all time high with the use of modern phone networks and internet voice call applications. These communication methods are used by all kinds of people including criminals, and as such being able to prove who said what is important, even with anonymous calls and VPN's. Forensic Speech Recognition has been used to help solve cases ranging all the way back to 1923 [6], and will surely continue to be used in the future.

Literature

[1] T. B. Mokgonyane, T. J. Sefara, T. I. Modipa, M. M. Mogale, M. J. Manamela and P. J. Manamela, "Automatic Speaker Recognition System based on Machine Learning Algorithms," 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), 2019, pp. 141-146, doi: 10.1109/RoboMech.2019.8704837.

[2] Bunrit, Supaporn, et al. "Text-independent speaker identification using deep learning model of convolution neural network." International Journal of Machine Learning and Computing 9.2 (2019): 143-148.

[3] Anderson, Catherine. “2.1 How Humans Produce Speech.” Essentials of Linguistics, McMaster University, 15 Mar. 2018.

[4] “Forensic Audio Enhancement, Voice Enhancement.” Audio Forensic Expert, .

[5] “How to Use Dynamics Processing: Getting Started with Compressors, Gates, and More.” PreSonus, .

[6] Owen, Jennifer Jennifer. Owen Forensic Services, LLC, 29 July 2018, .

[7] “Voice Anatomy & Physiology.” THE VOICE FOUNDATION, 30 July 2015.

[8] M. M. Karakoç and A. Varol, "Visual and auditory analysis methods for speaker recognition in digital forensic," 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 1113-1116, doi: 10.1109/UBMK.2017.8093505.

[9] M. Zhang et al., "Human and Machine Speaker Recognition Based on Short Trivial Events," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5009-5013, doi: 10.1109/ICASSP.2018.8462027.

Rasprave

Juraj Petrović, 2022/06/02 10:14

Razradite Speech Analysis poglavlje u podpoglavlja da je lakše čitljivo. Dodajte još novijih radova iz IEEEXplore i prenesite njihove ideje ili rezultate. Dopunite i opišite glavna svojstva praktičnih alata kojima se može napraviti nešto ili sve o čemu govorite. Postoje li projekti na githubu, baze za treniranje koje se mogu koristiti ili se standardno koriste u ovom području?

Unesite vaš komentar. Wiki sintaksa je dopuštena:
 
racfor_wiki/fdd/identifikacija_govornika.txt · Zadnja izmjena: 2023/06/19 18:17 (vanjsko uređivanje)
Dieses Dokuwiki verwendet ein von Anymorphic Webdesign erstelltes Thema.
CC Attribution-Share Alike 4.0 International
www.chimeric.de Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0