Original Paper
Abstract
Background: Speech recognition is a technology that enables machines to understand human language.
Objective: In this study, speech recognition of isolated words from a small vocabulary was applied to the field of mental health counseling.
Methods: A software platform was used to establish a human-machine chat for psychological counselling. The software uses voice recognition technology to decode the user's voice information. The software system analyzes and processes the user's voice information according to many internal related databases, and then gives the user accurate feedback. For users who need psychological treatment, the system provides them with psychological education.
Results: The speech recognition system included features such as speech extraction, endpoint detection, feature value extraction, training data, and speech recognition.
Conclusions: The Hidden Markov Model was adopted, based on multithread programming under a VC2005 compilation environment, to realize the parallel operation of the algorithm and improve the efficiency of speech recognition. After the design was completed, simulation debugging was performed in the laboratory. The experimental results showed that the designed program met the basic requirements of a speech recognition system.
doi:10.2196/18677
Keywords
Introduction
Constraints on speech recognition such as small vocabularies, specific speakers, and isolated words need to be relaxed. At the same time, there are many new problems that must be solved. First, expanding the vocabulary makes it difficult to select and build templates. Second, in continuous speech, there is no obvious boundary between each phoneme, syllable, and word, and there is a phenomenon of coordinated pronunciation that is strongly influenced by the context of each pronunciation unit. Third, different people say the same words with different acoustic characteristics. Even when the same person speaks the same content multiple times, their physiological and psychological states may differ and cause notable differences in their speech. Fourth, there is often background noise or other interference accompanying speech. Therefore, the original template matching method is no longer applicable.
There have been further breakthroughs in using speech recognition technology for various applications for smartphones. This study focused on mental health issues and investigated the interaction between smartphone software and users' mental health based on speech recognition technology. This study involves basic application research on the use of intelligent software design and speech recognition technology in the context of mental health.
Methods
Programming of a Speech Recognition System Based on VC2005 Isolated Words
In this study, C language programming was used to implement data feature extraction based on the Markov model. It was then used to programmatically realize speech recognition for specific speech instances, as well as write speech recognition functions into functions that can be called by other modules. Additionally, it was used to implement a speech recognition system foundation, and to cultivate and improve the ability of the system to consult the literature and comprehensively use new knowledge [
].Speech recognition is essentially a pattern recognition process, one by which an unknown speech pattern is compared with known reference patterns of speech, and the best-matched reference pattern is the recognition result.
is a block diagram of an automatic speech recognition system based on the pattern matching principle [ ].Composition of an Isolated Word Speech Recognition System
The reference pattern is based on the template word unit shown in
. The main technical items of the isolated word speech recognition system are shown in .Technical items | Details |
Vocabulary | Vocabulary fixed or variable, content (numbers, commands, place names, etc), acoustic similarity |
Speaker | Specific speakers, nonspecific speakers |
Generative method | Isolated vocalization, continuous vocalization |
Analysis | Frequency domain analysis, cestrum domain analysis, linear prediction analysis |
Mode change | Fixed or variable length, feature extraction, speech segmentation, factor recognition |
Model approach | Multiple reference pattern matching method, statistical decision method, word formation recognition method |
Standard mode | Standard template (multiple), word dictionary, probability distribution, generation rules |
Input method | Phone-microphone (near microphone) |
Vocal environment | Signal to noise ratio >30 decibels (dB) |
Surroundings | Quiet office, spacious office, inside a moving car |
Level | 40-50 dB, 60-70 dB, 65-75 dB |
Speech Recognition Design Process
Sample Voice Collection
The standard Chinese numerals 0-9 were spoken and recorded indoors as a sample. The recording software used Microsoft Visual C++ Windows Media Player (Microsoft), with a sampling rate of 16 kHz and sampling bits of 16 bits. The voice data is stored in the .wav file format, and its audio format is Windows PCM (pulse-code modulation) [
].Speech Signal Preprocessing
There were several elements involved in speech signal preprocessing. First, to digitize voice signals, data was extracted from the speech signal by sampling and quantizing. During data extraction, it is extremely important to master the storage form of the voice file, and to effectively extract and ascertain the meaning of each part of the data to improve the analysis of the data, and lay the groundwork for the next step.
Second, the high-frequency portion of the signal spectrum was enhanced and flattened, in order to facilitate channel parameter analysis or spectral analysis. Pre-emphasis of the speech signal is done by using the mean power spectrum and muzzle glottal excitation radiation effects; the high end at about 6 dB/octave is above 800 Hz, ie, 6 dB/octave (2 octaves) or 20 dB/decade (10 octaves). When seeking a voice signal spectrum, the higher the frequency, the smaller the corresponding component. For this reason, pre-emphasis is performed as part of preprocessing. The purpose of pre-emphasis is to flatten the signal spectrum, and hold the entire band from low to high frequency. The signal to noise ratio requirements can use the same spectrum or spectral analysis to analyze channel parameters. Pre-emphasis generally uses a first-order digital filter of the formula μ: H(Z) = 1 – μz-1, where μ has a value close to 1, or formula y(n) = x(n) – αx(n-1), where x(n) is the original signal sequence, y(n) is the pre-emphasis sequence, and α is the pre-emphasis coefficient [
].Third, preprocessing included endpoint detection and framed windowing. Breakpoint detection is mainly used to extract the effective part of the data. The threshold value is 0.3 (maximum value-minimum value). The speech signal is a typical nonstationary signal. In processing, a window function is generally used to intercept one segment for analysis. Part of the extracted signal is short-term stable. Another effect of windowing is to eliminate the Gibbs effect caused by the truncation of infinite sequences. Common window functions [
] are as follows:Both the Hamming window and the Haning window belong to the generalized raised cosine function. By analyzing their frequency response amplitude characteristics, it can be found that the rectangular window has good spectral smoothing performance, but the side lobe is too high, which may cause spectrum leakage and loss of high-frequency components. The Haning window decays too quickly and the low-pass characteristics are not smooth; the Hamming window is widely used because of its smooth low-pass characteristics and because it has the lowest side lobe height [
].Mel Frequency Cepstral Coefficient Feature Representation
The training process of Mel Frequency Cepstral Coefficient (MFCC) parameters and Pearson Linear Correlation Coefficient (PLCC) parameters was extracted, that is, state transition matrix A, mixed Gaussian distribution weight matrix C, mean vector µ and covariance matrix U. A maximum likelihood estimation was performed.
MFCC Extraction
The human ear has different perception capabilities for speech at different frequencies; this is a nonlinear relationship. Combining the physiological structure of the human ear and using the logarithmic relationship to simulate the human ear's perception of speech at different frequencies, Davies and Merenstein proposed the concept of Mel frequency in 1980 [
]. The meaning of 1 Mel is 1/1000 of the tone perception degree of 1000 Hz. The conversion relationship between Hz frequency fHz and Mel frequency fMel is as follows:The MFCC is proposed based on the above Mel frequency concept, and its computer flow is shown in
.First, the original voice signal is pre-emphasized, and a frame of voice signal is obtained after frame-by-frame windowing. Second, the fast Fourier transform (FFT) is performed on a frame of speech signal to obtain the discrete power spectrum X (k) of the signal. Third, triangle filter center frequency f(m) and frequency response H (k) are calculated as follows:
In Equation 5, fl and fh are low-pass frequency filter bank coverage and high-pass frequency, respectively; F is the sampling frequency with the unit Hz; M is the number of filter bank filters; N represents the points that are FFT; B–1 is the inverse function of Equation 6.
B–1(b) = 700(eb/127 – 1) (6)
Fourth, each filter produces an output spectral energy, after taking the number of coefficients so as to obtain the following set [
]:A discrete cosine transform is used to convert S(m) to the time domain. The calculation process of the MFCC c(i) is as follows:
The curve and filter bank distribution corresponding to the MFCC's Hz-Mel scale are shown in
.HMM Pattern Matching
HMM pattern matching is a double random process evolved from Markov chains. An HMM with IV states is usually represented by λ = (A,B,π). The meaning of these parameters is explained as follows: N is the number of states of the model. An input observation sequence O = o1,o2,…oT can only be in {S} at a certain moment, which is one of the N states of {S1,S2,…,SN). A = {aij} is the state transition probability matrix defined by the following equation: aij = P(qt+1 = Sj | qi = Si), 1 ≤ i, j ≤ N. It is an implicit Markov chain. The probability of each transition from state Si to state Sj is only related to state Si, and is not related to its previous state. The matrix elements must satisfy the following equation:
π = {π1,π2,…,πN} is the initial probability distribution of each state, which represents the probability value that the observation sequence O = o1,o2,…oT may be in each state of the model at ==========t = 1, that is, pi = P(q1 = Si), i = 1,2,…,N, and it satisfies the following equation:
B is the output probability of any observation oi in the input speech feature sequence O = o1,o2,…oT in each state. It has two types: discrete and continuous. For the discrete HMM, B is a probability matrix B = {bj(k)}, j = 1,2,…,M; where bj(k) = P(ok | qt = Sj), and M is the total number of symbols in the coded symbol set and satisfies the following condition:
For the continuous HMM, B = {bj(o)}, 1 ≤ j ≤ N and cji; among these, Jo is any feature vector K in the speech feature parameters, M is the number of Gaussian elements contained in each state, L is the weight of the jth state and the lth mixed Gaussian function, N is the normal Gaussian probability density function, mji represents the mean vector of the l mixed Gaussian element in the j state, and Uji represents the covariance matrix of the l mixed Gaussian element in the j state, and it satisfies the following condition:
Results
Depending on different parameters of the HMM, it has different classification methods. One type of classification is to divide the HMM into two structures, ergodic and left to right, according to the transition probability matrix A = {aij}. The HMM experienced by each state is that any state in the model can reach all other states through a finite step; from left to right, the HMM increases with time, and the state serial number is nondecreasing. This model is divided into spanning and no spanning. The HMMs of various states are mostly used for speaker recognition, language recognition, etc. The content of speech has a strong correlation with timing. This timing can be expressed by the state relationship, so speech recognition must use the left to right HMM structure. This study is based on isolated word speech recognition, and it is not allowed to skip a certain part of the middle of a speech fragment, so the HMM structure of left to right without crossing must be used. Its state transition probability matrix A = {aij} must satisfy aij = 0, j ≠ i and j ≠ i + 1 [
].Another classification method is to divide HMMs into continuous, discrete, and semicontinuous based on different output probabilities B. The output probability B of each state of the discrete HMM is a discrete probability matrix, and the vector of the feature parameter of the speech signal must be vector quantized before use. The output probability B of the continuous HMM is a continuous output probability density function. It has three forms: single, mixed, and differentiated Gaussian probability density function. The semicontinuous HMM is a method that combines discrete HMM and continuous HMM. This paper uses a continuous HMM.
The following problems are to be solved by the isolated word speech recognition system based on the HMM: First, how to determine an optimal state transition sequence q = (q1,q2,…,qT), and calculate the output probability P(O|λ) of the observation sequence O = o1,o2,…oT to the HMM, and judge the recognition result of the voice command based on this probability. Second, how to adjust the parameters that λ = (A,B,π) to maximize the output probability P(O|λ). This is a problem of parameter training of the HMM. In the process of solving the above two problems, the output probability needs to be calculated, which is another key problem that needs to be solved by this algorithm [
].Discussion
Small Vocabulary Speech Recognition System Applied in the Field of Mental Health
Speech Recognition System and Acquisition Method
For different speech types that need to be recognized [
], the system collected data in different ways. For mobile phone software, the intelligent degree of speech recognition is completely dependent on the preset scheme. The same speaker's speech may get completely different results due to different collection methods preset by the recognition system. Therefore, for users with special voice types, the mobile phone software adopts multiple (1-3) collection methods to reduce errors.Speech Processing System
The speech processing system mainly analyzes and processes speech to achieve the purposes of transmission, automatic recognition, and machine understanding. The analysis and processing are implemented based on the filtering, sampling, and Fourier transform algorithms; the mobile phone software runs the experimental results. The speech processing system also processes voice signals such as echo, user's voice disturbance, and voice noise to manage some typical voice transmission problems.
Establishment of Related Databases
A psychological database that contains psychological cases and current user psychological data was established. It establishes a relationship between all data in the database and uses the data dictionary to expand the function of the table to make the database design simpler. The database also needs to regularly update relevant information to better enable the software platform to provide users with mental health information. The user steps are as follows: (1) After opening the mobile phone software, the system prompts the user to fill in relevant information such as gender and age (personal information). (2) The voice chat system will conduct a human-machine voice chat, with humorous and interesting content occasionally mixed with some questions. (3) After the chat is over, the user is notified that there is a waiting time. The software system analyzes the voice chat data and further analyzes the experimental results. (4) The user is notified of the analysis result, and the software performs the first operation on the user if they have identified psychological problems. (5) The software then establishes a specific personal psychological treatment plan for users with mental disorders.
Application of Analysis Software
The experimental data showed that our mobile phone mental health software meets the requirements for accuracy, practicability, and simplicity. The software was able to realize specific operations on related data by programming, to obtain the most reliable parameters and achieve an accurate probability of the user's voice information, thereby inferring any psychological changes. The program was able to make a scientific, professional, and safe analysis of users' mental health with different personality characteristics. Using this software is convenient for users.
Conclusion
In response to the special requirements of speech recognition, the design of this software system is based on digital signal processing and uses a fast Fourier transform. Overall, the design requirements were met. However, due to time and knowledge limitations, there are still existing problems with the design, such as the incomplete treatment of environmental noise effects. There is room for improvement in this software system. This article introduces this research and factual issues such as the application of mobile phone mental health software. The software platform is quantified and modularized using user needs. It analyzes and processes specific experimental data, emphasizing that mental health software in a mobile phone is convenient.
Acknowledgments
The author would like to thank Dr Yingmei Xu of the faculty from Psychological Counseling and Treatment Center of Peking University Hospital for her help.
Conflicts of Interest
None declared.
References
- Boussaid L, Hassine M. Arabic isolated word recognition system using hybrid feature extraction techniques and neural network. Int J Speech Technol 2017 Nov 23;21(1):29-37. [CrossRef]
- Bennettlevy J, Martin E, Bridgman H, Carey T, Isaacs AN, Little F. Mental health academics in rural and remote Australia. Rural and remote health 2016;16(3):3793-3793.
- Wu S. A Traffic Motion Object Extraction Algorithm. Int J Bifurcation Chaos 2016 Jan 14;25(14):1540039. [CrossRef]
- Fujii Y, Fujii K, Yoon J, Sugahara H, Kitano N, Okura T. The Effects Of Low-intensity Exercise On Depressive Symptoms In Socially-isolated Older Adults. Medicine & Science in Sports & Exercise 2016;48:1052. [CrossRef]
- Elovainio M, Hakulinen C, Pulkki-Råback L, Virtanen M, Josefsson K, Jokela M, et al. Contribution of risk factors to excess mortality in isolated and lonely individuals: an analysis of data from the UK Biobank cohort study. The Lancet Public Health 2017 Jun;2(6):e260-e266. [CrossRef]
- Wu S, Wang M, Zou Y. Research on internet information mining based on agent algorithm. Future Generation Computer Systems 2018 Sep;86:598-602. [CrossRef]
- Davis SB, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions On Acoustics, Speech and Signal Processing 1980:357-366.
- Banbury A, Nancarrow S, Dart J, Gray L, Parkinson L. Telehealth Interventions Delivering Home-based Support Group Videoconferencing: Systematic Review. J Med Internet Res 2018 Feb 02;20(2):e25. [CrossRef]
- R.P K. Cognitive deterioration following intracranial haemorrhage: Predominantly dependent on cognitive health before the event. Nederlands tijdschrift voor geneeskunde 2016;160(21):9697-9697.
- Wu S, Wang M, Zou Y. Bidirectional cognitive computing method supported by cloud technology. Cognitive Systems Research 2018 Dec;52:615-621. [CrossRef]
Abbreviations
dB: decibel |
FFT: fast Fourier transform |
HMM: Hidden Markov Model |
MFCC: Mel Frequency Cepstral Coefficient |
PLCC: Pearson Linear Correlation Coefficient |
Edited by K Kalemaki; submitted 12.03.20; peer-reviewed by J Firas, X Liu, Y Cao; comments to author 19.03.20; revised version received 21.03.20; accepted 21.03.20; published 03.06.20
Copyright©Weifeng Fu. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 03.06.2020.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.