This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
According to a World Health Organization report in 2017, there was almost one patient with depression among every 20 people in China. However, the diagnosis of depression is usually difficult in terms of clinical detection owing to slow observation, high cost, and patient resistance. Meanwhile, with the rapid emergence of social networking sites, people tend to share their daily life and disclose inner feelings online frequently, making it possible to effectively identify mental conditions using the rich text information. There are many achievements regarding an English web-based corpus, but for research in China so far, the extraction of language features from web-related depression signals is still in a relatively primary stage.
The purpose of this study was to propose an effective approach for constructing a depression-domain lexicon. This lexicon will contain language features that could help identify social media users who potentially have depression. Our study also compared the performance of detection with and without our lexicon.
We autoconstructed a depression-domain lexicon using Word2Vec, a semantic relationship graph, and the label propagation algorithm. These two methods combined performed well in a specific corpus during construction. The lexicon was obtained based on 111,052 Weibo microblogs from 1868 users who were depressed or nondepressed. During depression detection, we considered six features, and we used five classification methods to test the detection performance.
The experiment results showed that in terms of the F1 value, our autoconstruction method performed 1% to 6% better than baseline approaches and was more effective and steadier. When applied to detection models like logistic regression and support vector machine, our lexicon helped the models outperform by 2% to 9% and was able to improve the final accuracy of potential depression detection.
Our depression-domain lexicon was proven to be a meaningful input for classification algorithms, providing linguistic insights on the depressive status of test subjects. We believe that this lexicon will enhance early depression detection in people on social media. Future work will need to be carried out on a larger corpus and with more complex methods.
Depression, one of the major reasons for suicide in recent years, is a severe mental disorder characterized by persisting low mood states in the affected person. It is expected to be the largest contributor to disease burden worldwide by 2030, especially in China with a high-pressure lifestyle. According to a World Health Organization (WHO) report in 2017 [
Diagnosis of potential depression in an early stage can provide more opportunities for those affected to receive appropriate treatment and overcome the disease. However, owing to the lack of mental health knowledge, the lack of regular counseling, and the fact that mental health diseases are greatly different from physical diseases as there is no pain, many patients with depression do not recognize it. Although some know a little about depression, they are often reluctant to seek professional help because of a sense of shame [
The traditional clinical diagnosis of depression mainly relies on standardized assessments, which are highly accurate but have limitations in detection efficiency [
Things are changing with the development of social media. Nowadays, many methods combining machine learning algorithms and text mining techniques have been developed to diagnose potential depression in an early stage [
Actually, when coping with textual depression data, word-based features like frequency and embedding are commonly used and a domain lexicon might be valuable to understand the author of the text [
In this paper, based on a well-labeled depression data set on Weibo, which is one of the largest Chinese user-generated content platforms, we constructed a depression-domain lexicon containing more than 2000 words. This lexicon can be used to assist in the early diagnosis of depression. We crawled more than 144,000 microblog tweets of nearly 2000 users within a time span of 16 months to obtain depressed and nondepressed data sets. Some manual screening was implemented to remove “fake” depression microblogs from the data sets, which is clarified in the “Data Preprocessing” subsection. We extracted 80 words as seeds and then built a semantic association graph with the similarities between the seeds and candidate words and utilized the label propagation algorithm (LPA) to automatically mark new words in the graph. The LPA is a good method in such a construction, which has been further explained in the “Related Work” subsection. We then tested the effectiveness of this method and compared it with some baseline approaches. We found that this autoconstruction of a depression-domain lexicon performed the best and had the most stable performance when parameters changed. For further research, this lexicon was used as an input for machine learning algorithms, providing insights into the depressive status of test subjects, so as to improve detection accuracy. According to our research, the detection models with lexicon features outperformed the models without lexicon features by 2% to 9% in terms of evaluation scores.
The main contributions are as follows: (1) We extracted a set of depressive words and constructed our domain lexicon in Chinese, which is a good contribution to web-related depression signal detection, to assist in identifying users who have the potential to experience depression in an early period. We applied an efficient semisupervised automatic construction method in the depression domain. The lexicon was proven to be meaningful in several detecting classification models in our study; (2) We constructed a benchmark depression data set (some of the data were used to construct the lexicon [our main research objective] and the other data were used in the detection test) based on microblogs, which could assist in further depression detection, diagnosis, and analysis. Meanwhile, we released the data set and lexicon together [
For decades, there have been many ways to detect depression. Beck [
Since the 21st century, new scales are continuously being improved. Diagnostic and Statistical Manual of Mental Disorders (DSM-IV) is a standard classification of mental disorders used by mental health professionals, which was improved by Hu [
Overall, traditional ways of depression detection have been highly validated and accepted in the real world for decades. However, they mainly rely on the scores of scales or questionnaires, face-to-face interviews, and self-reports, and often require a lot of labor and time [
In recent years, with abundant data on social media, some researchers are attempting to detect depression by leveraging web-based data. Park et al [
Word-based features could be shown in a lexicon. Tsugawa et al [
Many previous studies achieved a lot with regard to a depression lexicon, which can greatly help in diagnosis; however, most of the words are in English. It is not proper to use the translated version of an English lexicon to detect depression in a Chinese corpus because of cultural differences. In addition, only PMI (mainly about co-occurrence frequency) and W2V (word embeddings) techniques cannot keep up with today’s semantic developments. We can see the feasibility of detecting depression on social media with a lexicon, and more efforts are needed to construct a better Chinese depression-domain lexicon.
Many methods have been used to efficiently construct a domain lexicon. Das et al [
The LPA, which was first proposed by Zhu and Ghahramaniy [
In order to build a depression-domain lexicon for further detection via social media, we constructed two data sets of users with depression and without depression based on data from Weibo microblogs, which is very popular in China. Weibo has 462 million monthly active users according to a report in 2018 [
In light of the fact that depression is a long-standing illness, the text of users should not be collected from only one microblog. Thus, our data sets contained all Weibo microblogs within a year published by the same users. In addition, personal profile information like comments, number of follows, and number of followers was also included.
Based on Weibo microblogs from January 2017 to April 2018, we used the keywords “I’m diagnosed with depression” [
Details of the collected data sets from Weibo microblogs.
Data set | Users | Total posts | Mean | Standard deviation | Skewness | Kurtosis | Time span |
Depressed data set |
965 | 58,265 | 60.374 | 31.327 | −0.451 | 1.788 | January 2017-April 2018 (16 months) |
Nondepressed data set |
903 | 52,787 | 63.697 | 30.086 | −0.615 | 2.066 | January 2017-April 2018 (16 months) |
If a user never posted any text with a depression-related word like “depress,” the user was labeled as nondepressed. In this way, we constructed a nondepressed data set
Before the experiment, we found that there were some unrelated microblogs, irregular words, and emoji in our data sets. These noisy texts can affect the accuracy of our model, so we adopted the following preprocessing procedures: (1)
Domain adaptability is always a difficult problem in natural language processing. Therefore, a domain-based lexicon can help us perform analysis in a more accurate and deeper way. For example, “excitement,” “life,” and “forever” are common words in our daily life, but they can be abnormal signals of a patient with depression. Thus, through our study, we will try our best to determine which words used on the internet indicate depression and which do not indicate depression.
There are many ways to construct a domain-based lexicon according to a survey [
Inspired by Hamilton [
An illustration of the framework. DT: decision tree; LR: logistic regression; NB: naive Bayes; RF: random forest; SVM: support vector machine; TF-IDF: term frequency-inverse documentation frequency.
Seed words are those that can be representative of a specific domain. In order to extract the key seed words in the depressed and nondepressed data sets, we leveraged the TF-IDF algorithm, which is a widely used feature extraction algorithm in natural language processing. Salton and Yu [
TF and IDF can be formulated as follows:
where
Intuitively, this calculation of TF-IDF will show us how important and special a given word is in our depression domain. Words with a higher
Now that we had the seeds
In this paper, cosine similarity was used to calculate the similarity between words. When a word whose similarity with the seed words in the training corpus was greater than the given threshold, we extracted it as a new word and added it as a candidate word to the candidate word set
The LPA is a common semisupervised approach on a graph [
The LPA builds a graph based on the similarity between nodes, which are the words in our study. After the initialization of the graph, the nodes in the graph can be divided into labeled nodes and unknown nodes. The basic idea of LPA is to predict the label of unknown nodes based on information from labeled ones, and labels are propagated mainly by the weight on the edge between the nodes. In the process of label propagation, unknown nodes can update their own labels through the information of adjacent known labels. If the similarity of the adjacent node is large, the influence of its label will be large.
In our algorithm, the seeds
Assuming that there are
If there are 10 nodes in the graph, in which
The label of each unknown candidate word is obtained by iteratively applying the transition matrix on the initial labels of the words. The calculation method is as follows:
where
In each iteration, the labels of the seeds should remain the same. When the labels of all words in the graph no longer change after continuous iteration, the iteration is over. At the end of the iteration process, the final candidate words are those words whose absolute value of label probability is greater than a certain threshold. In this way, we obtained a well-labeled domain lexicon. The previous algorithms can be concluded as the steps in
A simple structure of a semantic graph. i: seed word; j: candidate word.
1) Initialize the lexicon and candidate words.
2) Preprocess the corpus and learn the word embedding with Word2Vec.
3) For every seed,
For a word
4) After obtaining all the extended candidate words
5) In the whole graph,
6) Reset the labels of the seeds in
7) Repeat steps 5) and 6) until the labels of
8) Obtain the final
9) Combine
We employed our data set to construct a depression-domain lexicon. We needed two types of microblogs combining users with depression or those without depression to extract domain seed words and to finish the automatic construction with the LPA. Our original data crawled from Weibo had some noise, especially in
After our lexicon was automatically built, we labeled it depressed or nondepressed for further evaluation. Three volunteers, who had carefully read the depressed microblogs and research articles, were invited to perform the lexicon labeling job [
Chinese word segmentation has a great influence in lexicon construction, especially when it comes to Weibo microblogs and the depression domain. In order to segment Chinese words properly in Weibo text, we used the following three steps to segment words as accurately as possible: (1) domain dictionary; (2) large word embedding; and (3) incorrect word removal.
When coping with mental disease, especially depression, over the internet, some depression-domain words like paroxetine (“帕罗西汀”), which is a common antidepressant, and self-rating scale (“自评量表”), which is a tool for individuals to measure depression, were difficult to recognize. Other words like MLGB (“马勒戈壁”), which means damn it, and Yali (“鸭梨”), which means pressure, were network vocabularies that could be confusing for the computer. Domain-specific segmentation should combine a domain dictionary [
A richer corpus is associated with more precise word embedding. Instead of using our collected data, which were relatively rare, we leveraged the W2V models by Shen et al [
We planned to remove incorrect words from our lexicon. Actually, after evaluation, we found that the error rate was less than 2% to 3%. Among 2385 words in our depression-domain lexicon, there were 64 errors.
During our experiments, we constructed the depression-domain lexicon with an automatic method, compared our method with some baseline approaches, and analyzed key parameters like number of seeds and threshold in the model.
For the evaluation metrics, we employed precision, recall, and F1 measure (F1) in equations (8), (9) and (10), respectively, to evaluate the performance of our model and the baseline approaches. We used area under the curve (AUC) to evaluate the model of unbalanced data. In terms of the number of words in the lexicon, we also compared the numbers under different circumstances. The equations are as follows:
where
Before construction, we used the TF-IDF to extract the seed words and obtained a list of the top 2000 words. The samples of the TF-IDF of
By artificially screening the list, we could obtain some seed words. Moreover, we added a few general sentiment words with high levels to our seed words and finally obtained a set of seed words of 40 depressive seeds and 40 nondepressive seeds. From parameter sensitivity analysis, we noted that 80 seeds in total will lead to a sufficiently large lexicon with high accuracy. The samples of the 80 seeds are shown in
TF-IDF values of depressed D1 samples.
Depressed |
TF-IDFa value |
Myself (自己) | 0.041383 |
Really (真的) | 0.032475 |
Depression (抑郁症) | 0.024328 |
Hope (希望) | 0.013336 |
Life (生活) | 0.012043 |
Forever (永远) | 0.006965 |
Pain (痛苦) | 0.006871 |
Sad (难过) | 0.006756 |
Live (活着) | 0.006583 |
Mood (心情) | 0.006386 |
Night (晚上) | 0.006347 |
Always (总是) | 0.005984 |
Hate (讨厌) | 0.005475 |
Exhausted (好累) | 0.005469 |
Fear (害怕) | 0.005030 |
Lonely (孤独) | 0.004413 |
Idiot (傻逼) | 0.004380 |
Emotion (感情) | 0.004031 |
Insomnia (失眠) | 0.003950 |
Sorry (对不起) | 0.003867 |
Despair (绝望) | 0.003410 |
Antidepressant (抗抑郁药) | 0.002305 |
aTF-IDF: term frequency-inverse documentation frequency.
Summary of the seeds.
Category | Seeds |
Nondepressive (40 words) | Stability, comfort, happy, happiness, successful, confidence, sunshine, struggle, positive, brave, enjoy, peace, enthusiasm, healthy, satisfied, active, grow up, pride, good, admire, strong, perfect, praise, precious, progress, congratulate, love, welcome, kindness, robust, earnest, agree, support, award, advantage, good deal, develop, warm, bright colored, and understand |
Depressive (40 words) | Depression, collapse, stress, suicide, apastia, anxious, sad, tired, death, lonely, insomnia, bad, desperate, give up, low, leave, fear, danger, close, sensitive, lost, shadow, destroy, suspect, crash, dark, helpless, guilt, negative, frustration, nervous, melancholy, rubbish, jump, forget, goodbye, cut wrist, edge, haze, and antidepressant |
In order to verify the effectiveness of the lexicon autoconstruction method applied in this paper, we selected the following methods as baseline approaches: (1)
To obtain a fair comparison, we set the same parameters for all methods where
From
Performance of lexicon construction methods.
Construction method | Precision | Recall | F1 | Size of the lexicon |
W2V-LPAa | 0.880 | 0.906 | 0.893 | 2321 |
W2Vb | 0.878 | 0.903 | 0.890 | 2321 |
SO-PMIc | 0.879 | 0.877 | 0.877 | 2024 |
SO-W2Vd | 0.854 | 0.877 | 0.862 | 2321 |
aW2V-LPA: label propagation algorithm-Word2Vec.
bW2V: Word2Vec.
cSO-PMI: semantic orientation pointwise mutual information.
dSO-W2V: semantic orientation Word2Vec.
F1 of methods when the seed size changed. LPA: label propagation algorithm; SO-PMI: semantic orientation pointwise mutual information; SO-W2V: semantic orientation from Word2Vec; W2V: Word2Vec.
Throughout our experiment, the size of seeds
First, we fixed
We then fixed the size of seeds at 80 with varying
Overall, it is pleasing that our W2V-LPA method performed quite smoothly and steadily even when the parameters were changed, so we believe that a high-quality lexicon can be constructed. It is difficult to find an optimal solution, and given
Size of the lexicon when the size of seeds and threshold for candidate words changed.
Performance of the W2V-LPA method when S and
|
|
Precision | Recall | F1 | Size of the lexicon |
60 | 0.5 | 0.882 | 0.911 | 0.896 | 1694 |
60 | 0.55 | 0.910 | 0.935 | 0.922 | 788 |
60 | 0.6 | 0.926 | 0.944 | 0.935 | 446 |
60 | 0.65 | 0.951 | 0.963 | 0.954 | 275 |
60 | 0.7 | 0.804 | 0.897 | 0.848 | 89 |
80 | 0.5 | 0.880 | 0.906 | 0.893 | 2321 |
80 | 0.55 | 0.916 | 0.937 | 0.926 | 1072 |
80 | 0.6 | 0.934 | 0.948 | 0.941 | 558 |
80 | 0.65 | 0.954 | 0.963 | 0.958 | 320 |
80 | 0.7 | 0.918 | 0.909 | 0.892 | 113 |
100 | 0.5 | 0.874 | 0.899 | 0.886 | 3070 |
100 | 0.55 | 0.906 | 0.924 | 0.915 | 1589 |
100 | 0.6 | 0.927 | 0.937 | 0.931 | 792 |
100 | 0.65 | 0.953 | 0.959 | 0.955 | 418 |
100 | 0.7 | 0.937 | 0.932 | 0.925 | 144 |
120 | 0.5 | 0.855 | 0.879 | 0.866 | 3696 |
120 | 0.55 | 0.889 | 0.904 | 0.896 | 1942 |
120 | 0.6 | 0.924 | 0.933 | 0.928 | 894 |
120 | 0.65 | 0.952 | 0.958 | 0.954 | 454 |
120 | 0.7 | 0.944 | 0.940 | 0.934 | 170 |
a
b
After construction of the depression-domain lexicon, we could apply it to actual depression detection in a new Weibo microblog data set to find out if our work would help existing detection models perform better. The detection process included data collection, feature selection, and classification methods.
In addition to our data set used for lexicon construction, we collected 745 users who were depressed and 10,118 users who were not depressed with their 1-year tweets as a new data set. Data details are shown in
Details of the data set for depression detection.
Data set | Users | Total posts | Mean | Standard deviation | Skewness | Kurtosis | Time span |
Depressed data set |
745 | 179,600 | 240.44 | 486.28 | 6.21 | 56.32 | January 2018-June 2019 (18 months) |
Nondepressed data set |
10,118 | 3,150,000 | 310.93 | 327.72 | 3.50 | 48.52 | January 2018-June 2019 |
Features like topic-level keywords, posting behaviors, number of tweets, first-person words, and linguistic style are meaningful in detecting depression on the internet [
We chose naive Bayes (NB), decision tree, logistic regression (LR), random forest, and support vector machine [
The model was based on a data set with 50% users who were depressed and 50% users who were not depressed. When we varied the scale of depressed users, the data set became imbalanced and the AUC was more important to test the performance.
In the real word, people with depression make up less than 10% of the population, and we will determine how to properly detect depression with imbalanced data in a further study.
Detection model performance with the depression-domain lexicon.
Detection model | Precision | Recall | F1 | Accuracy |
NBa | 67% | 67% | 67% | 67% |
Lb-NB | 74% | 73% | 73% | 73% |
LRc | 76% | 76% | 75% | 76% |
L-LR | 77% | 77% | 77% | 77% |
RFd | 68% | 68% | 68% | 68% |
L-RF | 77% | 77% | 76% | 77% |
SVMe | 65% | 65% | 65% | 65% |
L-SVM | 74% | 72% | 72% | 72% |
DTf | 67% | 67% | 67% | 67% |
L-DT | 69% | 69% | 69% | 69% |
aNB: naive Bayes.
bL: depression-domain lexicon as a feature.
cLR: logistic regression.
dRF: random forest.
eSVM: support vector machine.
fDT: decision tree.
Scales of users who were depressed. AUC: area under the curve.
Diagnosis of users with potential depression via social media has attracted increasing attention because it is a more cost-effective and active approach dealing with massive valuable data than traditional diagnosis. In previous studies, most of the achievements about a lexicon involved an English corpus. Instead of translating an English lexicon, this paper aimed to apply an automatic construction method for a Chinese depression-domain lexicon based on the LPA. With Word2Vec and a semantic relationship graph, the LPA was used to predict the label of candidate words in the graph, and finally, our lexicon was constructed. Experiment results showed that our method was superior to baseline construction methods and had good performance and robustness. In addition, when our lexicon was included as an input for the detection models, their performance became more accurate and effective when compared with the models without the depression-domain lexicon.
In the next step, experiments are expected to be carried out on a larger depression corpus, and more linguistic knowledge like conjunction will be incorporated into our method to enlarge the range of the depression-domain lexicon. Meanwhile, more complex construction methods like deep neural networks and hierarchical topic models will be adopted in further research. We expect that our lexicon will act as a useful feature in depression detection and will be able to provide more insights for depression diagnosis in terms of advanced depression detection among patients.
decision tree
label propagation algorithm
logistic regression
naive Bayes
random forest
semantic orientation pointwise mutual information
semantic orientation from Word2Vec
support vector machine
term frequency-inverse documentation frequency
Word2Vec
GL led the method application, experiment conduction, and result analysis. LH and SH participated in data extraction, preprocessing, and manuscript revision. BL provided theoretical guidance and revised the paper. This work was supported by the National Social Science Fund Project, China (No. 16BTQ065) “Multi-source intelligence fusion research on emergencies in big data environment” and the Foundation for Disciplinary Development of the School of Information and Technology in the University of International Business and Economics.
None declared.