This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
The widely known terminology gap between health professionals and health consumers hinders effective information seeking for consumers.
The aim of this study was to better understand consumers’ usage of medical concepts by evaluating the coverage of concepts and semantic types of the Unified Medical Language System (UMLS) on diabetes-related postings in 2 types of social media: blogs and social question and answer (Q&A).
We collected 2 types of social media data: (1) a total of 3711 blogs tagged with “diabetes” on Tumblr posted between February and October 2015; and (2) a total of 58,422 questions and associated answers posted between 2009 and 2014 in the diabetes category of Yahoo! Answers. We analyzed the datasets using a widely adopted biomedical text processing framework Apache cTAKES and its extension YTEX. First, we applied the named entity recognition (NER) method implemented in YTEX to identify UMLS concepts in the datasets. We then analyzed the coverage and the popularity of concepts in the UMLS source vocabularies across the 2 datasets (ie, blogs and social Q&A). Further, we conducted a concept-level comparative coverage analysis between SNOMED Clinical Terms (SNOMED CT) and Open-Access Collaborative Consumer Health Vocabulary (OAC CHV)—the top 2 UMLS source vocabularies that have the most coverage on our datasets. We also analyzed the UMLS semantic types that were frequently observed in our datasets.
We identified 2415 UMLS concepts from blog postings, 6452 UMLS concepts from social Q&A questions, and 10,378 UMLS concepts from the answers. The medical concepts identified in the blogs can be covered by 56 source vocabularies in the UMLS, while those in questions and answers can be covered by 58 source vocabularies. SNOMED CT was the dominant vocabulary in terms of coverage across all the datasets, ranging from 84.9% to 95.9%. It was followed by OAC CHV (between 73.5% and 80.0%) and Metathesaurus Names (MTH) (between 55.7% and 73.5%). All of the social media datasets shared frequent semantic types such as “Amino Acid, Peptide, or Protein,” “Body Part, Organ, or Organ Component,” and “Disease or Syndrome.”
Although the 3 social media datasets vary greatly in size, they exhibited similar conceptual coverage among UMLS source vocabularies and the identified concepts showed similar semantic type distributions. As such, concepts that are both frequently used by consumers and also found in professional vocabularies such as SNOMED CT can be suggested to OAC CHV to improve its coverage.
There is a widely known language gap between health consumers and health care professionals [
To bridge the vocabulary gap between health professionals and consumers, early researchers have collected and analyzed diverse textual data generated by consumers to identify medical terms used by consumers. Brennan and Aronson [
A controlled vocabulary is “an organized arrangement of words and phrases used to index content and/or to retrieve content through browsing or searching[
Domain coverage—the extent to which a controlled vocabulary covers the intended domain—is one of the most desired properties for a controlled vocabulary [
To keep up with continuous evolution of medical knowledge, CHV needs to be updated and maintained by incorporating new, consumer-provided terms and expressions [
Consumers, however, may also learn and use professional terms [
However, this method cannot be directly applied to CHV, because it does not have hierarchical relationships (eg. parent-child relationship) that are necessary to construct topological patterns [
Most previous studies concerning CHV development concentrated on the identification of new terms used by consumers [
In this study, we focus on diabetes, which is recognized as one of the most important public health problems with escalating health concerns by the World Health Organization (WHO) [
In this study, we collected diabetes-related consumer-generated blog postings from Tumblr and diabetes-related questions and answers from Yahoo! Answers. We carried out text mining to identify UMLS concepts from our datasets. Thus, we formulated the 2 research questions (RQs): (1) To what degree do the concepts from UMLS source vocabularies cover the concepts used by consumers describing their diabetes-related concerns on health postings of social media, especially blogs and social Q&A? Which concepts do or do not overlap? (2) To what degree are the UMLS semantic types applicable to analyzing the concepts used by consumers when describing their diabetes-related concerns in social media, especially blogs and social Q&A? Which semantic types are observed?
In the first research question, we evaluated the coverage of all of the 178 English source vocabularies of the UMLS in our 2 datasets from Tumblr and Yahoo! Answers. In the second research question, we analyzed the semantic types of the UMLS concepts identified in our datasets.
The current study mainly investigated the overlap between consumer concepts from social media and professional concepts in the UMLS. Indeed, consumers often proactively seek and share online health information on social media [
The UMLS, maintained by the NLM of the National Institutes of Health, is the largest biomedical terminological system. Its 2-level structure consists of Metathesaurus and Semantic Network. The UMLS Metathesaurus is “a large, multi-purpose, and multi-lingual thesaurus that contains millions of biomedical and health related concepts, their synonymous names, and their relationships” [
The UMLS semantic types represent “a set of broad subject categories that provide a consistent categorization of all concepts represented in the UMLS Metathesaurus” [
OAC CHV has been used in various health-related applications to improve patients’ access to health information. Zeng et al developed a translator specifically to convert texts in electronic health records to consumer-friendly text in patient health records by replacing UMLS terms to their corresponding OAC CHV terms [
2 types of social media were analyzed in the current study, namely blogs and social Q&A, as they allow consumers to generate and freely exchange health information in text format. Health-related blogs are one of the most popular social media venues for health information distribution. Bloggers typically describe their personal experiences with diseases along with their encounters with health care professionals [
Tumblr and Yahoo! Answers were chosen for the current study due to their popularity and the convenience of using their Application Program Interfaces (APIs), which allowed us to collect data automatically from these sites. Also, both Tumblr and Yahoo! Answers do not limit the number of words in postings. As such, their users can elaborate their health concerns and information on postings with sufficient details, thereby providing us ample opportunities to extract and analyze relevant concepts from the postings.
Tumblr is one of the fastest-growing blog sites with nearly twenty-fold increase in the number of blogs from October 2012 to October 2015 [
Yahoo! Answers is one of the most popular social Q&A sites with approximately 5.6 million visitors per month as of February 2016 [
Once we collected text data from Tumblr and Yahoo! Answers, we mined the text data for “concepts,” a unit of understanding which represents a fundamental component of terminology [
We used a widely adopted biomedical text processing framework Apache cTAKES™ [
Conceptual framework of the study. Dots refer to concepts extracted from the dataset and gray dots refer to concepts mapped to the concepts in one of the UMLS source vocabularies.
We first analyzed the basic characteristics of the overall concept coverage across our datasets collected from Tumblr and Yahoo! Answers: (1) blog postings from Tumblr, (2) questions in Yahoo! Answers, and (3) answers in Yahoo! Answers. We then analyzed the coverage of each source vocabulary in the UMLS across the datasets. SNOMED CT and CHV are the 2 vocabularies with the highest concept coverage in our datasets. Thus, we conducted a concept coverage analysis of SNOMED CT and CHV based on our datasets. We also analyzed the semantic types of the concepts identified from our datasets.
We identified 2415 UMLS concepts from blog postings, 6452 UMLS concepts from questions, and 10,378 UMLS concepts from answers.
There was a noticeable variation across the datasets. Over 80% of the documents from questions and answers contained 1 or more UMLS concepts whereas less than half of the documents from blogs did. Over half of the sentences from questions and answers contained at least 1 UMLS concept, while only 27% of those from blog posts contained at least 1 UMLS concept.
Basic characteristics of UMLS concept coverage in our datasets.
Tumblr | Yahoo! Answers | |||||
Blog postings | Questions | Answers | ||||
Total # | # with UMLS concepts | Total # | # with UMLS concepts | Total # | # with UMLS concepts | |
Document | 3711 | 1388 (37.4%) | 58,422 | 51,850 (88.8%) | 58,422 | 51,550 (88.2%) |
Sentence | 47,413 | 12,802 (27.0%) | 249,013 | 142,802 (57.3%) | 348,793 | 216,736 (62.1%) |
Concepts | – | 2415 | – | 6452 | – | 10,378 |
The concepts in the blogs were covered by 56 UMLS source vocabularies, while those in questions and answers were covered by 58 source vocabularies.
Top 20 mostly covered UMLS source vocabularies.
Tumblr | Yahoo! Answers | |||||||||||
Rank | Blogs (n=2415) | Questions (n=6452) | Answers (n=10,378) | |||||||||
Source vocabulary | # of concepts | % | Source vocabulary | # of concepts | % | Source vocabulary | # of concepts | % | ||||
1 | SNOMED CT | 2315 | 95.9 | SNOMED CT | 5476 | 84.9 | SNOMED CT | 9032 | 87.0 | |||
2 | CHV | 1931 | 80.0 | CHV | 4928 | 76.4 | CHV | 7625 | 73.5 | |||
3 | MTH | 1774 | 73.5 | MTH | 3899 | 60.4 | MTH | 5780 | 55.7 | |||
4 | NCIt | 1156 | 47.9 | MeSH | 2957 | 45.8 | MeSH | 4796 | 46.2 | |||
5 | MeSH | 1130 | 46.8 | NCIt | 2917 | 45.2 | NCIt | 4485 | 43.2 | |||
6 | CSP | 812 | 33.6 | CSP | 1840 | 28.5 | NDFRT | 2999 | 28.9 | |||
7 | AOD | 775 | 32.1 | NDFRT | 1775 | 27.5 | CSP | 2839 | 27.4 | |||
8 | LCH_NW | 771 | 31.9 | LCH_NW | 1627 | 25.2 | LCH_NW | 2436 | 23.5 | |||
9 | LOINC | 697 | 28.9 | AOD | 1585 | 24.6 | AOD | 2335 | 22.5 | |||
10 | NDFRT | 659 | 27.3 | LOINC | 1510 | 23.4 | RXNORM | 2099 | 20.2 | |||
11 | LCH | 587 | 24.3 | RXNORM | 1421 | 22.0 | LOINC | 2081 | 20.1 | |||
12 | NCI_NCI-GLOSS | 475 | 19.7 | LCH | 1187 | 18.4 | LCH | 1730 | 16.7 | |||
13 | MEDLINEPLUS | 402 | 16.6 | NCI_NCI-GLOSS | 952 | 14.8 | NCI_FDA | 1387 | 13.4 | |||
14 | CST | 365 | 15.1 | NCI_FDA | 868 | 13.5 | DXP | 1322 | 12.7 | |||
15 | COSTAR | 362 | 15.0 | COSTAR | 835 | 12.9 | NCI_NCI-GLOSS | 1321 | 12.7 | |||
16 | NCI_FDA | 345 | 14.3 | DXP | 830 | 12.9 | COSTAR | 1257 | 12.1 | |||
17 | OMIM | 342 | 14.2 | CST | 794 | 12.3 | OMIM | 1234 | 11.9 | |||
18 | RXNORM | 338 | 14.0 | OMIM | 790 | 12.2 | CST | 1206 | 11.6 | |||
19 | DXP | 326 | 13.5 | MEDLINEPLUS | 721 | 11.2 | VANDF | 1117 | 10.8 | |||
20 | ICD9CM | 241 | 10.0 | VANDF | 644 | 10.0 | MTHSPL | 1033 | 10.0 |
Top 10 frequently observed concepts covered by both SNOMED CT and CHV.
Rank | Tumblr | Yahoo! Answers | ||||
Questions | Answers | |||||
Concept | Freq. | Concept | Freq. | Concept | Freq. | |
1 | Blood (C0005767) | 816 | Blood (C0005767) | 30,654 | Blood (C0005767) | 54,689 |
2 | Pain (C0030193) | 798 | Sugars (C0242209) | 29,593 | Sugars (C0242209) | 49,207 |
3 | Insulin (C0021641) | 744 | Insulin (C0021641) | 10,816 | Insulin (C0021641) | 27,887 |
4 | Pharmaceutical preparations (C0013227) | 719 | Glucose (C0017725) | 7394 | Glucose (C0017725) | 26,420 |
5 | Sugars (C0242209) | 699 | Problem (C0033213) | 5111 | Pharmaceutical preparations (C0013227) | 11,571 |
6 | Disease (C0012634) | 617 | Water (C0043047) | 4781 | Diseases (C0012634) | 9733 |
7 | Problem (C0033213) | 568 | Pharmaceutical preparations (C0013227) | 4456 | Carbohydrates (C0007004) | 9517 |
8 | Diabetes mellitus (C0011849) | 501 | Hematologic tests (C0018941) | 3784 | Problem (C0033213) | 9248 |
9 | Teeth structure (C0040426) | 424 | Pain (C0030193) | 3625 | Water (C0043047) | 5994 |
10 | Operative surgery procedures (C0543467) | 375 | Urine (C0042036) | 2550 | Fasting (C0015663) | 5848 |
Top 10 frequently observed concepts covered by CHV but not SNOMED CT.
Tumblr | Yahoo! Answers | ||||||
Rank | Questions | Answers | |||||
Concept (CUI)a | Freq. | Concept (CUI) | Freq. | Concept (CUI) | Freq. | ||
1 | Cider vinegar (C0937941) | 54 | Stomach (C0038351) | 1050 | Lantus (C0876064) | 689 | |
2 | Apple cider vinegar (C1178459) | 54 | Lantus (C0876064) | 571 | Actos (C0875954) | 659 | |
3 | Lantus (C0876064) | 15 | Humalog (C0528249) | 260 | Avandia (C0875967) | 628 | |
4 | Gentle (C0720654) | 11 | NovoLog (C0939412) | 180 | HumaLog (C0528249) | 289 | |
5 | Corrective (C0719519) | 9 | Glucophage (C0591573) | 131 | NovoLog (C0939412) | 255 | |
6 | Botox (C0700702) | 9 | Levemir (C1314782) | 122 | Levemir (C1314782) | 184 | |
7 | RID (C0073361) | 6 | Actos (C0875954) | 95 | Glucophage (C1314782) | 161 | |
8 | HumaLog (C0528249) | 5 | Seroquel (C0287163) | 78 | Novolin (C0028467) | 112 | |
9 | Bead Dosage Form (C0991566) | 3 | Synthroid (C0728762) | 62 | Viagra (C0663448) | 105 | |
10 | Actos (C0875954) | 3 | Coumadin (C0699129) | 54 | Triphosphat (C0146894) | 77 |
aCUI: concept unique identifier
There was significant overlap between the concepts from the top 2 source vocabularies, SNOMED CT and CHV⎯ 78.2% (1889/2415) concepts from blog postings, 70.0% (4518/6452) concepts in questions, and 68.4% (7095/10,378) concepts in answers.
A few concepts were only covered by CHV: 1.7% (40/2415) concepts in blog postings, 6.3% (409/6452) concepts in questions, and 5.1% (529/10,378) concepts in answers.
All the concepts in
There were also the concepts covered by SNOMED CT but not CHV: 17.6% (424/2415) concepts from blog postings, 957/6452 (14.8%) concepts in questions and 18.7% (1936/10,378) concepts in answers (See
Top 10 frequently observed concepts covered by SNOMED CT but not CHV.
Tumblr | Yahoo! Answers | |||||
Rank | Questions | Answers | ||||
Concept (CUI)a | Freq. | Concept (CUI) | Freq. | Concept (CUI) | Freq. | |
1 | Entire skin (C1278993) | 524 | Symptoms (C1457887) | 7690 | Symptoms (C1457887) | 12,727 |
2 | Symptoms (C1457887) | 393 | Fatty acid glycerol esters (C0015677) | 1789 | Fatty acid glycerol esters (C0015677) | 8727 |
3 | Back structure, excluding neck (C1995000) | 236 | Entire foot (C1281587) | 1647 | Entire cells (C1269647) | 6435 |
4 | Massage (C0024875) | 217 | Back structure, excluding neck (C1995000) | 1589 | Entire heart (C1281570) | 3204 |
5 | Fatty acid glycerol esters (C0015677) | 210 | Entire kidney (C1278978) | 1368 | Entire pancreas (C1278931) | 3003 |
6 | Training (C0220931) | 163 | Entire eye (C1280202) | 1210 | Entire skin (C1278993) | 2614 |
7 | Entire pancreas (C1278931) | 157 | Protective cup (C1533124) | 1159 | Protective cup (C1533124) | 2178 |
8 | Entire heart (C1281570) | 156 | Entire lower limb (C1269079) | 985 | Entire stomach (C1278920) | 1876 |
9 | Entire oral cavity (C1278910) | 138 | Entire hands (C1281583) | 969 | Injection procedure (C1533685) | 1561 |
10 | Entire spine (C1280065) | 137 | Entire skin (C1278993) | 912 | Entire bony skeleton (C1266909) | 1501 |
aCUI: concept unique identifier
Among 127 UMLS semantic types (STY), about half of them were identified in our datasets: 52 STYs (40.9%) in the blog postings, 59 STYs (46.5%) in the questions, and 54 STYs (42.5%) in the answers. In general, there was a significant overlap of STYs across our datasets with 52 shared STYs. Seven STYs, however, were identified in the questions only, including “Functional Concept,” “Intellectual Product,” “Laboratory Procedure,” “Organ or Tissue Function,” “Organism Attribute,” “Social Behavior,” and “Substance.” Two STYs, “Fully Formed Anatomical Structure” and “Cell or Molecular Dysfunction,” were not found in questions, but in both the answer dataset and the blog dataset.
When comparing the top 10 frequently observed STYs across the datasets, 9 out of 10 STYs (ie, “Finding,” “Pharmacologic Substance,” “Therapeutic or Preventive Procedure,” “Disease or Syndrome,” “Organic Chemical,” “Body Part, Organ, or Organ Component,” “Sign or Symptom,” “Medical Device,” and “Amino Acid, Peptide, or Protein”) commonly appeared across the datasets with minor differences in terms of frequency. “Laboratory Procedure” appeared frequently in questions but not in blogs and answers. “Pathologic Function” appeared frequently in answers but not in blogs and questions. Example concepts of the frequently observed STYs showed that laypeople tend to frequently use common concepts to describe their diabetes-related issues in social media. To illustrate,
Top 20 frequently observed semantic types of the identified concepts.
Rank | Tumblr | Yahoo! Answers | ||||||||
Blogs | Questions | Answers | ||||||||
Semantic type | Concepta | Semantic type | Concept | Semantic type | Concept | |||||
n (%) | Freq. | n (%) | Freq. | n (%) | Freq. | |||||
1 | Finding | 380 |
5277 | Pharmacologic substance | 1240 (19.2) | 53,976 | Pharmacologic substance | 1995 |
185,880 | |
2 | Pharmacologic substance | 307 (12.7) | 4413 | Organic chemical | 1006 |
41,255 | Organic chemical | 1692 |
123,509 | |
3 | Therapeutic or preventive procedure | 241 |
3184 | Finding | 895 |
30,458 | Disease or syndrome | 1511 |
57,379 | |
4 | Disease or syndrome | 239 |
2923 | Disease or syndrome | 743 (11.5) | 28,041 | Finding | 1302 |
76,765 | |
5 | Organic chemical | 225 |
2737 | Body part, organ, or organ component | 484 |
27,172 | Body part, organ, or organ component | 666 |
48,584 | |
6 | Body part, organ, or organ component | 208 |
2566 | Sign or symptom | 338 |
19,601 | Therapeutic or preventive procedure | 583 |
16,555 | |
7 | Sign or symptom | 145 (6.0) | 2214 | Therapeutic or preventive procedure | 331 |
16,372 | Amino acid, peptide, or protein | 495 |
40,521 | |
8 | Medical device | 134 |
1319 | Amino acid, peptide, or protein | 305 |
13,178 | Sign or symptom | 436 |
38,905 | |
9 | Amino acid, peptide, or protein | 70 |
1112 | Medical device | 201 |
12,862 | Medical device | 347 |
20,391 | |
10 | Biologically active substance | 69 |
1093 | Laboratory procedure | 180 |
10,580 | Pathologic function | 292 |
12,551 |
aThe percentage was calculated based on the total number of unique identified UMLS concepts: blogs in Tumblr: n=2415, questions in Yahoo! Answers: n=6452, answers in Yahoo! Answers: n=10,378
Previous studies [
The UMLS concept usage in blogs and social Q&A was different in that the UMLS concepts appeared more frequently in the postings of social Q&A (almost 90% questions and answers) in comparison to blog postings (about 30%). Social Q&A users mainly discuss health-related issues (in the current study, diabetes-related issues) in their postings, because their participation in question asking and answering is purpose-driven. On the other hand, blog users often elaborate nonhealth related topics in their postings, although they tagged their postings with “diabetes.”
In spite of the differences of the overall UMLS concept coverage between blogs and social Q&A, we found that the UMLS concepts identified in different datasets can be covered by a similar number of UMLS source vocabularies. Two UMLS source vocabularies, ie, SNOMED CT and CHV, showed the best coverage. Social media users in our datasets may have advanced medical knowledge because they often use professional terms. CHV demonstrated the second largest coverage for all the datasets despite the fact that CHV has a much smaller number of concepts and terms than SNOMED CT (1:6 ratio). CHV was developed to incorporate consumer expressions presented in consumer-generated text data. Our findings showed that different social media platforms may play a similar role as consumer-generated documents for CHV enrichment, which confirmed the literature [
A comparison of the concept coverage between SNOMED CT and CHV in our datasets led us to examine the difference between the concept usages among blog and social media users. For example,
According to our analysis, the percentage of unique concepts covered by CHV but not by SNOMED CT varied from 1.7% to 6.3%. In the blog dataset, where approximately 3000 blogs were analyzed, only 40 concepts were covered by CHV exclusively. On the other hand, in Yahoo! Answers, 409 concepts (6.3%) in questions and 529 concepts (5.1%) in answers were covered by CHV but not by SNOMED CT. These results indicate that the larger datasets would yield more lay concepts. The size of dataset also appeared to affect the diversity of semantics. The same set of 9 semantic types was observed frequently in all our datasets. “Finding,” “Pharmacologic Substance,” and “Disease and Syndrome” were among the top 4 most observed semantic types.
Differences were observed as well. Blogs might be better platforms for consumers to discuss organic chemical, pharmacologic substances, or therapeutic or preventive procedure for diabetes. Yet, concepts of organic chemical and pharmacologic substances were also frequently used in social Q&A. In social Q&A data, 7 semantic types that were not identified in blogs were observed, indicating that larger datasets may yield more diverse medical concepts.
This study has a few limitations. First, the blog data in Tumblr and Yahoo! Answers data were collected in different time frames and are different in size, which might have affected the findings of this study. Smaller volumes of blog data used in this study may affect the diversity of the UMLS concepts identified. Even though blogging and question posting/answering are dynamic online activities for those living with chronic diseases, Tumblr and Yahoo! Answers may not represent all the health information users’ concept usage. The datasets could be expanded to include other types of social media such as diabetes-related discussion boards. The users of these online sources may be biased towards those with greater technological proficiency, such as those who are younger, more educated or those in a higher socioeconomic status who are more likely to seek health information on the Internet. This study may not reflect the experiences of those who are older adults, less educated or underprivileged [
The current study examined the potential of social media as user-generated documents in which consumers’ medical concepts can be observed and leveraged for controlled vocabulary development for ordinary health information users. We selected and tested 2 social media venues, namely blogs and social Q&A. Our findings showed stronger similarities rather than differences in the controlled vocabulary usage. The size of a dataset may affect the number of concepts identified. However, the similarities in the source vocabularies, frequently used concepts, and semantic types of the concepts indicate that social media sites tend to reflect the common sense of laypeople. More importantly, we found that social media users not only employ consumer concepts in CHV but also concepts in professional vocabularies such as SNOMED CT. This indicates that CHV still has room for improvements by incorporating concepts from other UMLS source vocabularies. The focus of our study is not to identify a list of consumer medical concepts, but to test the feasibility of leveraging social media data to identify consumer concepts covered by existing UMLS source vocabularies. Ultimately, it would assist consumers’ health information searches online, narrowing the disparity between ordinary health information users and medical professionals. In future studies, we will employ automated approaches to identify and recommend new medical terms and concepts from social media to enrich CHV.
Table A1. Full names of the UMLS source vocabularies in
Application Program Languages
atom unique identifier
Computer Retrieval of Information on Scientific Projects Thesaurus
concept unique identifier
information content
Library of Congress Subject Headings, Northwestern University subset
Logical Observation Identifier Names and Codes
Medical Subject Headings
National Cancer Institute Thesaurus
National Drug File – Reference Terminology
named entity recognition
National Library of Medicine
natural language processing
Open-Access Collaborative Consumer Health Vocabulary
Part-Of-Speech
questions and answers
SNOMED Clinical Terms
semantic type
Unstructured Information Management Architecture
Unified Medical Language System
word sense disambiguation
We would like to thank Dr Warren Allen for providing the computing resources for this work. This work was partially supported by an Amazon Web Services Education and Research Grant Award (PI: He). The work was also partially supported by National Center for Advancing Translational Sciences under the Clinical and Translational Science Award UL1TR001427 (PI: Nelson & Shenkman). The content is solely the responsibility of the authors and does not represent the official view of the National Institutes of Health.
MP initiated the idea of this study. ZH led the conceptualization, design, and implementation of this study. MP collected and provided the blog data from Tumblr.com. SO collected and provided the social Q&A data from Yahoo! Answers. ZC performed the natural language processing on the datasets and structured the results in a relational database. MP performed the data analysis and drafted the initial version; ZH, SO, BJ extensively revised the draft critically and iteratively for important intellectual content. All authors contributed to the methodology development, results interpretation, edited the paper significantly, and gave final approval for the version to be published. ZH takes primary responsibility for the research reported here.
None declared.