Starije izmjene na obje strane
Starija izmjena
Novija izmjena
|
Starija izmjena
|
racfor_wiki:fdd:identifikacija_govornika [2022/06/06 16:21] vmuzevic [Speech Analysis] |
racfor_wiki:fdd:identifikacija_govornika [2024/12/05 12:24] (trenutno) |
| |
==== Automated FSR ==== | ==== Automated FSR ==== |
With advances in technology, new techniques were developed and many more will be. Some of those techniques include artificial intelligence, algorithms from machine learning, deep learning, NLP etc. There is a project that recognizes Sepedi home language speakers and for that are used four classifier models such as Support Vector Machines, K-Nearest Neighbors, Multilayer Perceptrons (MLP) and Random Forest (RF) [1]. In another article [2] speaker recognition is done by deep learning model and usage of convolution neural network (CNN). That model is text-independent, which means it doesn’t take text meaning in the equation, and if the model were text-dependent it would be much more complex. Model works with spectrograms extracted from speech. | With advances in technology, new techniques were developed and many more will be. Some of those techniques include artificial intelligence, algorithms from machine learning, deep learning, NLP etc. There is a project that recognizes Sepedi home language speakers and for that are used four classifier models such as Support Vector Machines, K-Nearest Neighbors, Multilayer Perceptrons (MLP) and Random Forest (RF) [1]. In another article [2] speaker recognition is done by deep learning model and usage of convolution neural network (CNN). That model is text-independent, which means it doesn’t take text meaning in the equation, and if the model were text-dependent it would be much more complex. Model works with spectrograms extracted from speech. Deep learning models are also capable of outperforming human analysts when it comes to recognizing speakers from short, so-called "trivial events", trivial events being sneezes, coughs, "hmmm" sounds and so on [9]. Datasets for training such models do exist, such as this one on [[https://www.kaggle.com/datasets/kongaevans/speaker-recognition-dataset|Kaggle]] which features 1500 samples from five prominent world leaders, as well as background noise which can be combined into the training. [[https://www.robots.ox.ac.uk/~vgg/data/voxceleb/|VoxCeleb]] features a much larger scale with 7000 speakers, but the library is unable to be downloaded from the site at the time of writing and it is unclear whether it will ever be available again. Currently there are 203 public [[https://github.com/topics/speaker-recognition|GitHub]] repositories with a topic for "speaker-recognition". |
| |
==== Tools for FSR ==== | ==== Tools for FSR ==== |
Spectrograms are great tools when it comes to speakers recognition. One of the applications to analyze spectrograms from audio files is Audacity. It is needed to have an audio file of the speaker that is verified to compare it to the audio file used for identification. | Spectrograms are great tools when it comes to speakers recognition. One of the applications to analyze spectrograms from audio files is Audacity. Audacity is a free open-source audio editing and recording software. It comes with all the features necessary described above in the audio enhancement method, along with spectrogram imaging used for visual identification, which will be explained below. Professional analysts may prefer the use of paid software such as [[https://www.adobe.com/products/audition.html|Adobe Audition]], which claims to provide the "industry's best audio cleanup, restoration,..." along with other features, or [[https://www.phonexia.com/#|Phonexia]] which claims to provide a "speaker recognition solution designed explicitly for forensic experts". |
==== Methods of FSR ==== | ==== Methods of FSR ==== |
In practice, aural and spectrographic methods are used in forensic speech recognition. The analyst is typically sent multiple recordings that need to be identified, along with multiple recordings of the suspect repeating the same thing. However, in the case that the second voice sample is to be obtained without the suspects knowledge, a method that is used is discreetly recording a conversation, where the interviewer must manipulate the conversation so the suspect repeats as many words as possible from the recording being analyzed. Ideally, the second recording should be recorded using the same device as the first recording, or at least using the same method, for example if the first recording was from a recorded telephone line, the second audio sample should be recorded through a telephone line. Once the samples have been obtained, the analyst begins with the aural analysis of the recording. This involves comparing similarities and differences and deciding which parts of the samples are useful and which are not. The examiner looks out for key features of speech such as accent, dialect, syllable grouping and breathing patterns, along with any other peculiar speech habits. Once the analyst has decided which parts of the original and comparison recording are similar enough, the spectrographic analysis is used. The speech samples are recorded on the sound spectrograph and is then analyzed in small increments. The result is a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies over time. The spectrograms of the unknown and known recording are visually compared. Factors include: | In practice, aural and spectrographic methods are used in forensic speech recognition. The analyst is typically sent multiple recordings that need to be identified, along with multiple recordings of the suspect repeating the same thing. However, in the case that the second voice sample is to be obtained without the suspects knowledge, a method that is used is discreetly recording a conversation, where the interviewer must manipulate the conversation so the suspect repeats as many words as possible from the recording being analyzed. Ideally, the second recording should be recorded using the same device as the first recording, or at least using the same method, for example if the first recording was from a recorded telephone line, the second audio sample should be recorded through a telephone line. Once the samples have been obtained, the analyst begins with the aural analysis of the recording. This involves comparing similarities and differences and deciding which parts of the samples are useful and which are not. The examiner looks out for key features of speech such as accent, dialect, syllable grouping and breathing patterns, along with any other peculiar speech habits such as speaking speed and mental status. Once the analyst has decided which parts of the original and comparison recording are similar enough, the spectrographic analysis is used. The speech samples are recorded on the sound spectrograph and is then analyzed in small increments. The result is a spectrogram. A spectrogram is a visual representation of the spectrum of frequencies over time. The spectrograms of the unknown and known recording are visually compared. Factors include: |
* Bandwidth | * Bandwidth |
* Mean frequency | * Mean frequency |
* Atriculation | * Atriculation |
* Acoustic patterns | * Acoustic patterns |
Among others. These are closely examined if differences are due to a different pronunciation or a different speaker. In the case of sufficient evidence a positive identification or elimination is reached. In the case of insufficient evidence, a probable identification or elimination is reached. In the case that the audio is of too poor quality or contains too little information with which to compare, the conclusion is described as unresolved [6]. | Among others. These are closely examined if differences are due to a different pronunciation or a different speaker. In the case of sufficient evidence a positive identification or elimination is reached. In the case of insufficient evidence, a probable identification or elimination is reached. In the case that the audio is of too poor quality or contains too little information with which to compare, the conclusion is described as unresolved [6][8]. |
| |
| |
| |
[7] [[https://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/#:~:text=Resonance%3A%20Voice%20sound%20is%20amplified,lips)%20modify%20the%20voiced%20sound|“Voice Anatomy & Physiology.” THE VOICE FOUNDATION, 30 July 2015.]] | [7] [[https://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/#:~:text=Resonance%3A%20Voice%20sound%20is%20amplified,lips)%20modify%20the%20voiced%20sound|“Voice Anatomy & Physiology.” THE VOICE FOUNDATION, 30 July 2015.]] |
| |
| [8] [[https://ieeexplore.ieee.org/document/8093505|M. M. Karakoç and A. Varol, "Visual and auditory analysis methods for speaker recognition in digital forensic," 2017 International Conference on Computer Science and Engineering (UBMK), 2017, pp. 1113-1116, doi: 10.1109/UBMK.2017.8093505.]] |
| |
| [9] [[https://ieeexplore.ieee.org/document/8462027|M. Zhang et al., "Human and Machine Speaker Recognition Based on Short Trivial Events," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5009-5013, doi: 10.1109/ICASSP.2018.8462027.]] |
| |
~~DISCUSSION~~ | ~~DISCUSSION~~ |
| |