| Starije izmjene na obje strane
Starija izmjena
Novija izmjena | Starija izmjena | 
                        
                | racfor_wiki:fdd:identifikacija_govornika [2022/05/26 11:24] vmuzevic [Conclusion]
 | racfor_wiki:fdd:identifikacija_govornika [2024/12/05 12:24] (trenutno) 
 | 
        
| ====== Identification of speech and speakers in a sound recording ====== | ====== Identification of speech and speakers in an audio recording ====== | 
| ===== Abstract ===== | ===== Abstract ===== | 
|  |  | 
|  |  | 
|  |  | 
| IMAGE HERE | {{ racfor_wiki:fdd:audio.jpeg?500x200 }} | 
|  |  | 
|  |  | 
| This filter lowers all frequencies below approximately 80Hz and above 300Hz by 27dB, while boosting frequencies between 80Hz and 300Hz by 12dB. | This filter lowers all frequencies below approximately 80Hz and above 300Hz by 27dB, while boosting frequencies between 80Hz and 300Hz by 12dB. | 
| This filter however does not always produce ideal results, as the volume levels can still be too low or imbalanced. To resolve this, compression is applied to further balance volume levels of parts of the audio file, and in the case of excessive background noise a gate can be used to suppress said noise. Compressors are a type of amplifier where the gain is dependent on the signal passing through. For example, a compressor can be set to boost quieter sounds more than louder ones, or to reduce gains on any sounds where the amplitude is above a certain threshold. Noise gates reduce unwanted sounds in a similar fashion, however they completely remove any signals that are below the set amplitude, meaning any quiet distractions below the threshold are removed entirely. The final step involving audio enhancement is de-reverb and, once again, noise reduction. Audio recordings to be used as evidence are seldom recorded in ideal conditions and even with the aforementioned enhancements, the resulting speech is sometimes still hard to understand. Removing reverb will obviously make any speech much easier to understand. The noise reduction is now taken to the extreme. Instead of using frequency ranges and gates to reduce noise, each type of noise has its own noise profile which is used to remove that noise in its entirety. For example, we would use a different noise profile when removing the sounds of an engine running or rain. | This filter however does not always produce ideal results, as the volume levels can still be too low or imbalanced. To resolve this, compression is applied to further balance volume levels of parts of the audio file, and in the case of excessive background noise a gate can be used to suppress said noise. Compressors are a type of amplifier where the gain is dependent on the signal passing through. For example, a compressor can be set to boost quieter sounds more than louder ones, or to reduce gains on any sounds where the amplitude is above a certain threshold. Noise gates reduce unwanted sounds in a similar fashion, however they completely remove any signals that are below the set amplitude, meaning any quiet distractions below the threshold are removed entirely [5]. The final step involving audio enhancement is de-reverb and, once again, noise reduction. Audio recordings to be used as evidence are seldom recorded in ideal conditions and even with the aforementioned enhancements, the resulting speech is sometimes still hard to understand. Removing reverb will obviously make any speech much easier to understand. The noise reduction is now taken to the extreme. Instead of using frequency ranges and gates to reduce noise, each type of noise has its own noise profile which is used to remove that noise in its entirety. For example, we would use a different noise profile when removing the sounds of an engine running or rain [4]. | 
|  |  | 
|  |  | 
| ===== Speech Analysis ===== | ===== Speech Analysis ===== | 
|  |  | 
| Once the audio is enhanced to an acceptable level, the problem we are now faced with is recognizing the speaker, which is known as Forensic Speaker Recognition (FSR). FSR is based on the theory that every human’s voice is unique, like fingerprints and DNA. The human voice is determined by a multitude of factors which, as a result of DNA being unique, are considered to influence a human’s voice and cause it to be unique. Speech is comprised of three main mechanisms, those being respiration, phonation and articulation. Respiration is obviously done by the lungs and trachea Phonation is defined as the production of sound, and the main organ used for phonation is the larynx, colloquially referred to as “the voice box” as the larynx is where the vocal folds are located, also known as vocal cords.  These folds vibrate to create different sounds. For example, a hissing “s” sound is created without any vibrations from the vocal folds, also known as a voiceless sound, while a buzzing “z” sound is created with vocal vibrations. The final step in creating speech is articulation. Articulation modifies sound created in the larynx and turns it into understandable words. The articulators are the lips, tongue and soft palate. There is however another set of features that make a human’s voice unique through resonance, and those would be the throat, mouth cavity and nasal passages. These are considered to be the main factor in making a person’s voice recognizable and unique[3][7]. | Once the audio is enhanced to an acceptable level, the problem we are now faced with is recognizing the speaker, which is known as Forensic Speaker Recognition (FSR). | 
|  | ==== The Human Voice ==== | 
|  | FSR is based on the theory that every human’s voice is unique, like fingerprints and DNA. The human voice is determined by a multitude of factors which, as a result of DNA being unique, are considered to influence a human’s voice and cause it to be unique. Speech is comprised of three main mechanisms, those being respiration, phonation and articulation. Respiration is obviously done by the lungs and trachea Phonation is defined as the production of sound, and the main organ used for phonation is the larynx, colloquially referred to as “the voice box” as the larynx is where the vocal folds are located, also known as vocal cords.  These folds vibrate to create different sounds. For example, a hissing “s” sound is created without any vibrations from the vocal folds, also known as a voiceless sound, while a buzzing “z” sound is created with vocal vibrations. The final step in creating speech is articulation. Articulation modifies sound created in the larynx and turns it into understandable words. The articulators are the lips, tongue and soft palate. There is however another set of features that make a human’s voice unique through resonance, and those would be the throat, mouth cavity and nasal passages. These are considered to be the main factor in making a person’s voice recognizable and unique[3][7]. | 
|  |  | 
| In most cases, the result of audio enhancement is an audio file that has amplified speech and muffled noise. Speech analysis takes that file as input and with uses some techniques to classify speech as positive, negative or an unresolved identification. Before the existence of computers as we know them today, people also tried to identify speakers using only parts of the human voice that can be considered a factor of difference. Those factors are: | In most cases, the result of audio enhancement is an audio file that has amplified speech and muffled noise. Speech analysis takes that file as input and with uses some techniques to classify speech as positive, negative or an unresolved identification. Before the existence of computers as we know them today, people also tried to identify speakers using only parts of the human voice that can be considered a factor of difference. Those factors are: | 
| From all these factors it can be determined if the person is male or female, their approximate age, where that person is from, and in some cases even what emotions that person is feeling. The problem with that approach is that people that are trying to identify speakers need to have excellent hearing and a lot of knowledge, experience and training, especially when audio files are of low quality. | From all these factors it can be determined if the person is male or female, their approximate age, where that person is from, and in some cases even what emotions that person is feeling. The problem with that approach is that people that are trying to identify speakers need to have excellent hearing and a lot of knowledge, experience and training, especially when audio files are of low quality. | 
|  |  | 
| With advances in technology, new techniques were developed and many more will be. Some of those techniques include artificial intelligence, algorithms from machine learning, deep learning, NLP etc. There is a project that recognizes Sepedi home language speakers and for that are used four classifier models such as Support Vector Machines, K-Nearest Neighbors, Multilayer Perceptrons (MLP) and Random Forest (RF) [1]. In another article [2] speaker recognition is done by deep learning model and usage of convolution neural network (CNN). That model is text-independent, which means it doesn’t take text meaning in the equation, and if the model were text-dependent it would be much more complex. Model works with spectrograms extracted from speech. | ==== Automated FSR ==== | 
|  | With advances in technology, new techniques were developed and many more will be. Some of those techniques include artificial intelligence, algorithms from machine learning, deep learning, NLP etc. There is a project that recognizes Sepedi home language speakers and for that are used four classifier models such as Support Vector Machines, K-Nearest Neighbors, Multilayer Perceptrons (MLP) and Random Forest (RF) [1]. In another article [2] speaker recognition is done by deep learning model and usage of convolution neural network (CNN). That model is text-independent, which means it doesn’t take text meaning in the equation, and if the model were text-dependent it would be much more complex. Model works with spectrograms extracted from speech. Deep learning models are also capable of outperforming human analysts when it comes to recognizing speakers from short, so-called "trivial events", trivial events being sneezes, coughs, "hmmm" sounds and so on [9]. Datasets for training such models do exist, such as this one on [[https://www.kaggle.com/datasets/kongaevans/speaker-recognition-dataset|Kaggle]] which features 1500 samples from five prominent world leaders, as well as background noise which can be combined into the training. [[https://www.robots.ox.ac.uk/~vgg/data/voxceleb/|VoxCeleb]] features a much larger scale with 7000 speakers, but the library is unable to be downloaded from the site at the time of writing and it is unclear whether it will ever be available again. Currently there are 203 public [[https://github.com/topics/speaker-recognition|GitHub]] repositories with a topic for "speaker-recognition". | 
|  |  | 
| Spectrograms are great tools when it comes to speakers recognition. One of the applications to analyze spectrograms from audio files is Audacity. It is needed to have an audio file of the speaker that is verified to compare it to the audio file used for identification. | ==== Tools for FSR ==== | 
|  | Spectrograms are great tools when it comes to speakers recognition. One of the applications to analyze spectrograms from audio files is Audacity. Audacity is a free open-source audio editing and recording software. It comes with all the features necessary described above in the audio enhancement method, along with spectrogram imaging used for visual identification, which will be explained below. Professional analysts may prefer the use of paid software such as [[https://www.adobe.com/products/audition.html|Adobe Audition]], which claims to provide the "industry's best audio cleanup, restoration,..." along with other features, or [[https://www.phonexia.com/#|Phonexia]] which claims to provide a "speaker recognition solution designed explicitly for forensic experts". | 
| In practice, aural and spectrographic methods are used in forensic speech recognition. The analyst is typically sent multiple recordings that need to be identified, along with multiple recordings of the suspect repeating the same thing. However, in the case that the second voice sample is to be obtained without the suspects knowledge, a method that is used is discreetly recording a conversation, where the interviewer must manipulate the conversation so the suspect repeats as many words as possible from the recording being analysed. Ideally, the second recording should be recorded using the same device as the first recording, or at least using the same method, for example if the first recording was from a recorded telephone line, the second audio sample should be recorded through a telephone line. Once the samples have been obtained, the analyst begins with the aural analysis of the recording. This involves comparing similarities and differences and deciding which parts of the samples are useful and which are not. The examiner looks out for key features of speech such as accent, dialect, syllable grouping and breathing paterns, along with any other peculiar speech habits. Once the analyst has decided which parts of the original and comparison recording are similar enough, the spectrographic analysis is used. The speech samples are recorded on the sound spectrograph and is then analyzed in small increments. The result is a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies over time. The spectrograms of the unknown and known recording are visually compared. Factors include: | ==== Methods of FSR ==== | 
|  | In practice, aural and spectrographic methods are used in forensic speech recognition. The analyst is typically sent multiple recordings that need to be identified, along with multiple recordings of the suspect repeating the same thing. However, in the case that the second voice sample is to be obtained without the suspects knowledge, a method that is used is discreetly recording a conversation, where the interviewer must manipulate the conversation so the suspect repeats as many words as possible from the recording being analyzed. Ideally, the second recording should be recorded using the same device as the first recording, or at least using the same method, for example if the first recording was from a recorded telephone line, the second audio sample should be recorded through a telephone line. Once the samples have been obtained, the analyst begins with the aural analysis of the recording. This involves comparing similarities and differences and deciding which parts of the samples are useful and which are not. The examiner looks out for key features of speech such as accent, dialect, syllable grouping and breathing patterns, along with any other peculiar speech habits such as speaking speed and mental status. Once the analyst has decided which parts of the original and comparison recording are similar enough, the spectrographic analysis is used. The speech samples are recorded on the sound spectrograph and is then analyzed in small increments. The result is a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies over time. The spectrograms of the unknown and known recording are visually compared. Factors include: | 
| * Bandwidth | * Bandwidth | 
| * Mean frequency | * Mean frequency | 
| * Atriculation | * Atriculation | 
| * Acoustic patterns | * Acoustic patterns | 
| Among others. These are closely examined if differences are due to a different pronunciation or a different speaker. In the case of sufficient evidence a positive identification or elimination is reached. In the case of insufficient evidence, a probable identification or elimination is reached. In the case that the audio is of too poor quality or contains too little information with which to compare, the conclusion is described as unresolved [6]. | Among others. These are closely examined if differences are due to a different pronunciation or a different speaker. In the case of sufficient evidence a positive identification or elimination is reached. In the case of insufficient evidence, a probable identification or elimination is reached. In the case that the audio is of too poor quality or contains too little information with which to compare, the conclusion is described as unresolved [6][8]. | 
|  |  | 
|  |  | 
|  |  | 
| [7] [[https://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/#:~:text=Resonance%3A%20Voice%20sound%20is%20amplified,lips)%20modify%20the%20voiced%20sound|“Voice Anatomy & Physiology.” THE VOICE FOUNDATION, 30 July 2015.]] | [7] [[https://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/#:~:text=Resonance%3A%20Voice%20sound%20is%20amplified,lips)%20modify%20the%20voiced%20sound|“Voice Anatomy & Physiology.” THE VOICE FOUNDATION, 30 July 2015.]] | 
|  |  | 
|  | [8] [[https://ieeexplore.ieee.org/document/8093505|M. M. Karakoç and A. Varol, "Visual and auditory analysis methods for speaker recognition in digital forensic," 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 1113-1116, doi: 10.1109/UBMK.2017.8093505.]] | 
|  |  | 
|  | [9] [[https://ieeexplore.ieee.org/document/8462027|M. Zhang et al., "Human and Machine Speaker Recognition Based on Short Trivial Events," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5009-5013, doi: 10.1109/ICASSP.2018.8462027.]] | 
|  |  | 
|  | ~~DISCUSSION~~ | 
|  |  |