This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
Social media platforms (SMPs) are frequently used by various pharmaceutical companies, public health agencies, and nongovernment organizations (NGOs) for communicating health concerns, new advancements, and potential outbreaks. Although the benefits of using them as a tool have been extensively discussed, the online activity of various health care organizations on SMPs during COVID-19 in terms of engagement and sentiment forecasting has not been thoroughly investigated.
The purpose of this research is to analyze the nature of information shared on Twitter, understand the public engagement generated on it, and forecast the sentiment score for various organizations.
Data were collected from the Twitter handles of 5 pharmaceutical companies, 10 US and Canadian public health agencies, and the World Health Organization (WHO) from January 1, 2017, to December 31, 2021. A total of 181,469 tweets were divided into 2 phases for the analysis, before COVID-19 and during COVID-19, based on the confirmation of the first COVID-19 community transmission case in North America on February 26, 2020. We conducted content analysis to generate health-related topics using natural language processing (NLP)-based topic-modeling techniques, analyzed public engagement on Twitter, and performed sentiment forecasting using 16 univariate moving-average and machine learning (ML) models to understand the correlation between public opinion and tweet contents.
We utilized the topics modeled from the tweets authored by the health care organizations chosen for our analysis using nonnegative matrix factorization (NMF): cumass=–3.6530 and –3.7944 before and during COVID-19, respectively. The topics were chronic diseases, health research, community health care, medical trials, COVID-19, vaccination, nutrition and well-being, and mental health. In terms of user impact, WHO (user impact=4171.24) had the highest impact overall, followed by public health agencies, the Centers for Disease Control and Prevention (CDC; user impact=2895.87), and the National Institutes of Health (NIH; user impact=891.06). Among pharmaceutical companies, Pfizer’s user impact was the highest at 97.79. Furthermore, for sentiment forecasting, autoregressive integrated moving average (ARIMA) and seasonal autoregressive integrated moving average with exogenous factors (SARIMAX) models performed best on the majority of the subsets of data (divided as per the health care organization and period), with the mean absolute error (MAE) between 0.027 and 0.084, the mean square error (MSE) between 0.001 and 0.011, and the root-mean-square error (RMSE) between 0.031 and 0.105.
Our findings indicate that people engage more on topics such as COVID-19 than medical trials and customer experience. In addition, there are notable differences in the user engagement levels across organizations. Global organizations, such as WHO, show wide variations in engagement levels over time. The sentiment forecasting method discussed presents a way for organizations to structure their future content to ensure maximum user engagement.
Social media platforms (SMPs), such as Twitter, Facebook, and Reddit, are commonly used by people to access health information. In the United States, 8 in 10 internet users access health information online, and 74% of these use SMPs. Meanwhile, public health agencies and pharmaceutical companies often use social media to engage with the public [
The positive impacts of using SMPs by patients and HCPs have been previously discussed [
There are different topic-clustering and content analysis techniques available to identify the characteristics of stakeholders (eg, pharmaceutical companies’ tweets for drug information) on SMPs [
Prior research work has also focused on the response of G7 leaders during COVID-19 on Twitter [
COVID-19 led to a rapid change in public sentiments over a short span of time [
Although a tweet’s engagement and sentiment can only be calculated once it has been posted, forecasting presents a fascinating way to predict the sentiments beforehand. Time series–based strategies, such as autoregressive integrated moving average (ARIMA) and vector autoregressions (VAR), have been used for forecasting emotions from SMPs [
ML and natural language processing (NLP) algorithms have been recently used in various instances; for example, Bayesian ridge and ridge regression models were used for emotion prediction and health care analysis on large-scale data sets [
The implications of social media communication by HCPs have been extensively discussed [
The remainder of the paper is structured as follows: First, a preliminary analysis of topic modeling using the best-performing clustering algorithm is presented in the Methods section, followed by sentiment and engagement analysis using CardiffNLP’s
The data for this study (181,469 tweets) were gathered from the accounts of major US and Canadian health care organizations, pharmaceutical companies, and the World Health Organization (WHO) using the Twitter Academic API for Research v2 [
The complete timeline was divided into 2 phases for analysis,
Distribution of tweets for the selected user accounts of 3 types of organizations.
Name of organization (Twitter handle) | Before COVID-19, n (%) | During COVID-19, n (%) | Total tweets, N | |
|
||||
|
Centers for Disease Control and Prevention (CDCgov) | 8435 (58.6) | 5963 (41.4) | 14,398 |
|
Centers for Disease Control and Prevention (CDC_eHealth) | 1376 (86.3) | 219 (13.7) | 1594 |
|
Government of Canada for Indigenous (GCIndigenous) | 3505 (54.0) | 2989 (46.0) | 6494 |
|
Health Canada and PHAC (GovCanHealth) | 7878 (17.2) | 37,907 (82.8) | 45,785 |
|
US Department of Health & Human Services (HHSGov) | 7890 (56.9) | 5969 (43.1) | 13,859 |
|
Indian Health Service (IHSgov) | 1090 (44.7) | 1346 (55.3) | 2436 |
|
Canadian Food Inspection Agency (InspectionCan) | 4145 (62.2) | 2516 (37.8) | 6661 |
|
National Institutes of Health (NIH) | 5837 (71.6) | 2314 (28.4) | 8151 |
|
National Indian Health Board (NIHB1) | 1247 (51.1) | 1195 (48.9) | 2442 |
|
US Food and Drug Administration (US_FDA) | 5810 (59.7) | 3925 (40.3) | 9735 |
|
Total | 47,213 (42.3) | 64,343 (57.7) | 111,555 |
|
||||
|
AstraZeneca (AstraZeneca) | 3462 (78.2) | 963 (21.8) | 4425 |
|
Biogen (biogen) | 1819 (61.9) | 1120 (38.1) | 2939 |
|
Glaxo SmithKline (GSK) | 4200 (69.3) | 1857 (30.7) | 6057 |
|
Johnson & Johnson (JNJNews) | 4813 (71.4) | 1926 (28.6) | 6739 |
|
Pfizer (pfizer) | 3637 (64.1) | 2039 (35.9) | 5676 |
|
Total | 17,931 (69.4) | 7905 (30.6) | 25,836 |
|
||||
|
World Health Organization (WHO) | 24,775 (56.2) | 19,303 (43.8) | 44,078 |
aNGO: nongovernment organization.
Overall research framework. WHO: World Health Organization.
The content of each user was divided into 2 phases, before and during COVID-19. We performed topic modeling on the tweets authored by the organizations by using the topics yielded by the best-performing topic model in order to explore the most and least talked about topics with the help of heatmaps. Additionally, we examined the top 10 hashtags used by these organizations.
First, all nonalphabets (numbers, punctuation, new-line characters, and extra spaces) and Uniform Resource Locators (URLs) were removed using the regular expression module (
Researchers have used term frequency–inverse document frequency (TF-IDF) to create document embeddings for tweets [
We used Gensim LDA [
Heatmaps were generated using
The top 10 hashtags mentioned in the users’ tweets were evaluated using the
Sentiment analysis is an NLP approach used to categorize the sentiments appearing in Twitter messages based on the keywords used in each tweet. We tested different models that classify a user’s tweet in 1 of 3 categories: positive, negative, and neutral. Although there is no common threshold for how many tweets should be sampled, we witnessed a range of around 2000 tweets [
For a given user, Twitter defines the engagement rate [
where “
Researchers have analyzed the impact (popularity) of Twitter handles by proposing heuristic and neural network–based models [
where
The total number of tweets produced by a user was considered inversely proportional to the user’s impact, because a user tweeting occasionally and receiving higher engagement is more impactful than a user tweeting regularly with lower engagement.
Engagement analysis was performed to quantify the popularity of a topic generated. The engagement for each user was defined as the product of average engagement per day and their impact, as described in Equation (3). The average engagement per day was calculated as the sum of the count of likes, replies, retweets, and quotes per day. These reactions were aggregated from January 1, 2017, to December 31, 2021.
The exponential moving average (EMA) was calculated with a window span of 151 days for every user, and outliers were removed using the z-score, followed by smoothening of the average engagement per day to the eighth degree using the Savitzky-Golay filter [
To forecast the sentiment per day, we first needed to quantify the overall sentiment of the tweets from each user every day. We leveraged CardiffNLP’s
The daily sentiment scores were then resampled to a monthly mean sentiment score, which also helped us in handling missing values, if any. The complete timeline was divided into 2 phases (ie, before and during COVID-19), as discussed before, and the sentiment score was forecasted on 20% of the data set in each period for all user groups.
A grid search was used to find optimal hyperparameters, and 5-fold cross-validation was performed for every model. The
Three metrics, the mean absolute error (MAE), the mean square error (MSE), and the root-mean-square error (RMSE), were selected to evaluate the forecasting accuracy of the models. We considered 1-step-ahead forecasting for this study as it helped avoid problems related to cumulative errors from the preceding period.
The study was performed using Compute Canada (now called the Digital Research Alliance of Canada) resources, which provide access to advanced research computing (ARC), research data management (RDM), and research software (RS). The following is a list of the computing resources offered by one of the clusters from National Services (Digital Research Alliance), Graham:
Central processing unit (CPU): 2x Intel E5-2683 v4 Broadwell@2.1 GHz
Memory (RAM): 30 GB
The details of the parameters used for each model are discussed in
The scaled heatmaps showing the topic distribution for different Twitter handles are shown in
This shift in the tweets’ content was observed across the complete data set, and we further made the following inferences:
Before COVID-19: Chronic diseases were the most talked about topic for pharmaceutical companies (AstraZeneca, 1729, 49.9%, tweets; Pfizer, 1168, 32.1%, tweets) and for WHO (4831, 19.5%, tweets), followed by tweets on health research (WHO, 1703, 6.9%, tweets; AstraZeneca, 1037, 29.9%, tweets). This is supported by
During COVID-19: Chronic diseases and health research were the most active topics for AstraZeneca (680, 70.6%, tweets) and Glaxo SmithKline (GSK, 655, 35.2%, tweets), respectively. In addition, COVID-19 and vaccination were most talked about by GSK (398, 21.4%, tweets) and Pfizer (396, 19.4%, tweets).
Mean coherence scores and CPUa time for different clustering algorithms.
Clustering algorithm | cv | cumass | Time taken (minutes:seconds) | |
|
||||
|
LDAb | 0.352 | –5.526 | 17:11 |
|
Parallel LDA | 0.396 | –3.709 | 5:48 |
|
NMFc | 0.493 | –3.653 | 7:38 |
|
LSId | 0.316 | –5.921 | 0:16 |
|
HDPe | 0.696 | –18.668 | 3:24 |
|
||||
|
LDA | 0.456 | –5.688 | 14:01 |
|
Parallel LDA | 0.446 | –3.990 | 6:08 |
|
NMF | 0.567 | –3.794 | 7:04 |
|
LSI | 0.381 | –5.356 | 0:16 |
|
HDP | 0.650 | –17.610 | 3:01 |
aCPU: central processing unit.
bLDA: latent dirichlet allocation.
cNMF: nonnegative matrix factorization.
dLSI: latent semantic indexing.
eHDP: hierarchical dirichlet process.
Scaled heatmaps showing topic distribution for pharmaceutical companies before and during COVID-19.
Top hashtags of pharmaceutical companies before and during COVID-19.
WHO (user impact=4171.24) had the highest impact overall, followed by public health agencies (CDC user impact=2895.87; NIH user impact=891.06). Among pharmaceutical companies, Pfizer’s user impact was the highest at 97.79. The user impact was normalized between the range of 0 and 1 and is shown in
Among pharmaceutical companies, Pfizer’s user engagement was far higher than that of others (
A similar trend was observed in public health agencies, with the CDC’s account showing the highest user engagement between March and June 2020, the early months of the COVID-19 pandemic. A sharp rise in user engagement was observed in May 2021, when the CDC announced a relaxation on social distancing and masking rules for fully vaccinated individuals. The user engagement on WHO’s account varied significantly over time. Its engagement was the highest in the time frame of February-April 2020, the early months of the pandemic, similar to what was observed for public health agencies. A sharp increase was seen in October 2020 following the announcement of the World Mental Health Day and in late 2020, when WHO made an announcement for COVID-19 vaccine development (refer to
User impact of all Twitter handles scaled between 0 and 1. CDC: Centers for Disease Control and Prevention; NIH: National Institutes of Health; WHO: World Health Organization.
User engagement on Twitter accounts of pharmaceutical companies from January 1, 2017, to December 31, 2021.
Before COVID-19: ARIMA and SARIMAX models generated the lowest MSE (0.005) and RMSE (0.072) for pharmaceutical companies. When measuring the model performance through the MAE, ARIMA performed better than all other models (0.063). A similar trend was observed for public health agencies, with ARIMA having the lowest MAE (0.027) and SARIMAX having the lowest RMSE (0.031) and a tie between them for the MSE (0.001). SARIMAX had the lowest MAE (0.054), MSE (0.004), and RMSE (0.080) on the WHO data set.
During COVID-19: Using the CatBoost regressor gave the lowest MAE (0.072) and RMSE (0.086), while the K-neighbors regressor yielded the lowest MSE (0.008) for pharmaceutical companies. Performing regression using AdaBoost generated the lowest MAE (0.084) and RMSE (0.105) among all models used, and SARIMAX had the lowest MSE (0.011) for public health agencies. For WHO, the elastic net, lasso regression, and light gradient boosting performed equally well, with all 3 models having the same MAE (0.046) and RMSE (0.059), and SARIMAX had the lowest MSE (0.004).
To verify the forecasting performance of these models, we checked for the nature of their residual errors (ie, whether the residuals of the models were normally distributed with mean 0 and SD 1 and were uncorrelated). From
Results of time series sentiment forecasting using different MLa models (all metrics are 5-fold cross-validation).
Models | Pharmaceutical companies | Public health agencies | WHOb | |||||||||||||||||||||
Before COVID-19 | During COVID-19 | Before COVID-19 | During COVID-19 | Before COVID-19 | During COVID-19 | |||||||||||||||||||
MAEc | MSEd | RMSEe | MAE | MSE | RMSE | MAE | MSE | RMSE | MAE | MSE | RMSE | MAE | MSE | RMSE | MAE | MSE | RMSE | |||||||
ARIMAf | 0.063g | 0.005g | 0.072g | 0.098 | 0.013 | 0.112 | 0.027g | 0.001g | 0.032h | 0.240 | 0.082 | 0.286 | 0.066h | 0.006h | 0.080h | 0.106 | 0.012 | 0.111 | ||||||
SARIMAXi | 0.065h | 0.005g | 0.072g | 0.084 | 0.011 | 0.104 | 0.028j | 0.001g | 0.031g | 0.709 | 0.011g | 0.106h | 0.054g | 0.004g | 0.061g | 0.047h | 0.004g | 0.066 | ||||||
Bayesian ridge | 0.083 | 0.010 | 0.100 | 0.102 | 0.018 | 0.119 | 0.031 | 0.001 | 0.037 | 0.141 | 0.037 | 0.163 | 0.075j | 0.009j | 0.087j | 0.061 | 0.008 | 0.075 | ||||||
Ridge regression | 0.069 | 0.008 | 0.085 | 0.079 | 0.011 | 0.094 | 0.030 | 0.002 | 0.038 | 0.124 | 0.029 | 0.147 | 0.076 | 0.009 | 0.091 | 0.056 | 0.007 | 0.068 | ||||||
CatBoost regressor | 0.066 | 0.007j | 0.080h | 0.072g | 0.008h | 0.086g | 0.027h | 0.001h | 0.035 | 0.104 | 0.023 | 0.127 | 0.079 | 0.009 | 0.089 | 0.052 | 0.007 | 0.065 | ||||||
K-neighbors regressor | 0.070 | 0.009 | 0.087 | 0.075h | 0.008g | 0.087h | 0.030 | 0.001 | 0.036 | 0.093j | 0.022 | 0.113 | 0.081 | 0.011 | 0.100 | 0.050 | 0.007 | 0.061j | ||||||
Elastic net | 0.070 | 0.008 | 0.088 | 0.080 | 0.009j | 0.093j | 0.029 | 0.001h | 0.035 | 0.087h | 0.021j | 0.109j | 0.082 | 0.011 | 0.100 | 0.046g | 0.006h | 0.059g | ||||||
Lasso regression | 0.070 | 0.008 | 0.088 | 0.080 | 0.009j | 0.093j | 0.029 | 0.001 | 0.035 | 0.087h | 0.021j | 0.109j | 0.082 | 0.011 | 0.100 | 0.046g | 0.006h | 0.059g | ||||||
Random forest regressor | 0.065j | 0.007h | 0.081j | 0.080 | 0.010 | 0.093 | 0.028 | 0.001h | 0.034j | 0.110 | 0.024 | 0.134 | 0.082 | 0.009 | 0.090 | 0.047j | 0.006j | 0.060h | ||||||
Light gradient boosting machine | 0.070 | 0.008 | 0.088 | 0.080 | 0.009j | 0.093j | 0.029 | 0.001h | 0.035 | 0.087h | 0.021j | 0.109j | 0.082 | 0.011 | 0.100 | 0.046g | 0.006h | 0.059g | ||||||
Gradient boosting regressor | 0.075 | 0.008 | 0.086 | 0.079 | 0.010 | 0.094 | 0.029 | 0.001j | 0.036 | 0.141 | 0.034 | 0.168 | 0.082 | 0.010 | 0.094 | 0.051 | 0.008 | 0.064 | ||||||
AdaBoost regressor | 0.070 | 0.007 | 0.082 | 0.080 | 0.010 | 0.091 | 0.029 | 0.001 | 0.037 | 0.084g | 0.020h | 0.105g | 0.087 | 0.010 | 0.096 | 0.057 | 0.007 | 0.072 | ||||||
Extreme gradient boosting | 0.068 | 0.009 | 0.087 | 0.080 | 0.011 | 0.098 | 0.031 | 0.002 | 0.040 | 0.151 | 0.045 | 0.171 | 0.087 | 0.011 | 0.098 | 0.055 | 0.007 | 0.065 | ||||||
Decision tree regressor | 0.076 | 0.009 | 0.086 | 0.087 | 0.013 | 0.106 | 0.029 | 0.001 | 0.037 | 0.112 | 0.030 | 0.142 | 0.098 | 0.014 | 0.111 | 0.048 | 0.006j | 0.061 | ||||||
Linear regression | 0.245 | 0.312 | 0.314 | 0.094 | 0.017 | 0.114 | 0.157 | 0.164 | 0.216 | 0.124 | 0.029 | 0.148 | 2.367 | 52.719 | 3.334 | 0.062 | 0.008 | 0.076 | ||||||
Prophet | 0.108 | 0.016 | 0.126 | 0.089 | 0.011 | 0.104 | 0.040 | 0.002 | 0.049 | 0.120 | 0.015 | 0.124 | 0.114 | 0.020 | 0.143 | 0.086 | 0.011 | 0.106 |
aML: machine learning.
bWHO: World Health Organization.
cMAE: mean absolute error.
dMSE: mean squared error.
eRMSE: root-mean-square error.
fARIMA: autoregressive integrated moving average.
gThe highest-performing forecasting method.
hThe second-highest-performing forecasting method.
iSARIMAX: seasonal autoregressive integrated moving average with exogenous factors.
jThe third-highest-performing forecasting method.
One-step-ahead forecast for all pharmaceutical companies before and during COVID-19 using the best-performing models from Table S1 (
In this paper, we proposed a framework for using NLP-based text-mining techniques for performing comprehensive social media content analysis of various health care organizations. We processed reasonably large amounts of textual data for topic modeling, sentiment and engagement analysis, and sentiment forecasting. Our study revealed the following key findings:
Being the most active organization on social media does not translate to more user impact. WHO and the US public health agency CDC generated far more user impact than the Public Health Agency of Canada, even though the latter had a high number of relevant tweets when analyzed topicwise. People are more likely to engage with
Certain topics normally translate to more user engagement. Although the content on chronic diseases and health research dominated most of the tweets posted over the study period, there was a marked shift toward a discussion on COVID-19 and vaccination for public health agencies, more than what was observed in pharmaceutical companies. Tweets on COVID-19 and chronic diseases generate more interest among the public. Perhaps surprisingly, we found that people are not much receptive to content on medical trials, often shared by pharmaceutical companies, unless it concerns a public health emergency, such as the COVID-19 pandemic. Using particular hashtags certainly helps in generating engagement, as we found that most user engagement was highly skewed toward tweets concerning COVID-19. Moreover, our study revealed that compared to the user engagement patterns found in the majority of health care organizations (ie, with peaks observed around major events or announcements), there are wide variations in user engagement for WHO. This could be due to the global presence of WHO, implying that it might not be the same set of followers engaging with its content every time, but rather only those who are impacted by or interested in the content in some way.
When the content is structured, results tend to exceed expectations. We conducted sentiment forecasting on the data sets using different moving averages and various ML univariate models. Surprisingly, we observed that when the content is structured, as is normally the case for that available on official Twitter accounts, results tend to exceed expectations, more so before COVID-19 than during COVID-19. The models used in this research are able to predict monthwise tweet sentiment with high accuracy and low errors. This helped us in analyzing our work in-depth, and we did not need to create any multivariate ML models. Results show that commonly used ARIMA and SARIMAX models work well, and they can be used for predicting tweet sentiments on live data. This could also help organizations correlate tweet sentiment with user engagement. For example, the highest engagement on Pfizer’s tweets was for the ones labeled
There are 3 limitations of this study that could be addressed in future research. First, this work focused on dividing the tweets into 2 phases,
This study examined the online activity of US and Canadian health care organizations on Twitter. The NLP-based analysis of social media presented here can be incorporated to gauge engagement on the previously published tweets and to generate tweets that create an impact on people accessing health information via SMPs. As organizations continue to leverage SMPs by providing the latest information to the community, predicting a tweet’s sentiment before publishing can boost an organization’s perception by the public. In conclusion, we found that performing content analysis and sentiment forecasting on an organization’s social media usage provides a comprehensive view of how it resonates with society.
Topics and user engagement.
advanced research computing
autoregressive integrated moving average
Centers for Disease Control and Prevention
central processing unit
health care professional
hierarchical dirichlet process
latent dirichlet allocation
latent semantic indexing
mean absolute error
machine learning
mean squared error
nongovernment organization
National Institutes of Health
natural language processing
nonnegative matrix factorization
root-mean-square error
seasonal autoregressive integrated moving average with exogenous factors
social media platform
term frequency–inverse document frequency
World Health Organization
The authors thank members of the DaTALab at Lakehead University for valuable discussions, along with Andy Pan, Chandreen Ravihari Liyanage, and Lakshmi Preethi Kamak for annotating the sampled tweets to evaluate the tweet sentiment. This study was conducted using Digital Research Alliance of Canada computing resources. AS and MKB were supported by Vector Scholarships in artificial intelligence (AI) from Vector Institute, Toronto, Canada, and a Natural Sciences and Engineering Research Council (NSERC) Discovery Grant (#RGPIN-2017-05377) held by VM.
None declared.