This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.
As the COVID-19 pandemic progressed, disinformation, fake news, and conspiracy theories spread through many parts of society. However, the disinformation spreading through social media is, according to the literature, one of the causes of increased COVID-19 vaccine hesitancy. In this context, the analysis of social media posts is particularly important, but the large amount of data exchanged on social media platforms requires specific methods. This is why machine learning and natural language processing models are increasingly applied to social media data.
The aim of this study is to examine the capability of the CamemBERT French-language model to faithfully predict the elaborated categories, with the knowledge that tweets about vaccination are often ambiguous, sarcastic, or irrelevant to the studied topic.
A total of 901,908 unique French-language tweets related to vaccination published between July 12, 2021, and August 11, 2021, were extracted using Twitter’s application programming interface (version 2; Twitter Inc). Approximately 2000 randomly selected tweets were labeled with 2 types of categorizations: (1) arguments for (pros) or against (cons) vaccination (health measures included) and (2) type of content (scientific, political, social, or vaccination status). The CamemBERT model was fine-tuned and tested for the classification of French-language tweets. The model’s performance was assessed by computing the F1-score, and confusion matrices were obtained.
The accuracy of the applied machine learning reached up to 70.6% for the first classification (pro and con tweets) and up to 90% for the second classification (scientific and political tweets). Furthermore, a tweet was 1.86 times more likely to be incorrectly classified by the model if it contained fewer than 170 characters (odds ratio 1.86; 95% CI 1.20-2.86).
The accuracy of the model is affected by the classification chosen and the topic of the message examined. When the vaccine debate is jostled by contested political decisions, tweet content becomes so heterogeneous that the accuracy of the model drops for less differentiated classes. However, our tests showed that it is possible to improve the accuracy by selecting tweets using a new method based on tweet length.
The COVID-19 pandemic has profoundly affected our society and social activity worldwide. Part of this change is perceptible through messages exchanged on social media platforms, specifically on the topic of vaccination. Since the measles, mumps, and rubella vaccine controversy in 1998 [
In this context, social media analysis is particularly important, but the large amount of data exchanged over social networks requires specific methods. This is why machine learning and natural language processing (NLP) models are becoming increasingly popular for studying social media data. The most used and “most promising method” [
The aim of this study is to examine the capability of the CamemBERT model to faithfully predict the elaborated categories while considering that tweets about vaccination are often ambiguous, sarcastic, or irrelevant to the studied topic. Based on the resulting analysis, this paper aims to provide a methodological and epistemological reflection on the analysis of French-language tweets related to vaccination.
The CamemBERT model was released in 2020 and is considered one of the state-of-the-art French-language models [
Although multilingual models are plentiful, they usually lag behind their monolingual counterparts. This is why, in this study, we chose to employ a monolingual model to classify French-language tweets. As far as we are concerned, CamemBERT comes in 6 different “flavors,” ranging from small models with 110 million parameters trained on 4 GB of text up to mid-size models with 335 million parameters trained on 135 GB of text. After testing them, we found that better results were obtained with the largest size model that was pretrained on the Criss-Cross Network corpus.
All these models require fine-tuning on specific data to achieve their full potential. Fine-tuning or transfer learning have been common and successful practices in computer vision for a long time, but it is only in the last 3 years or so that the same approaches have become effective for solving NLP problems on specific data. This approach can be summarized in the following 3 steps:
A model language such as BERT is built in an unsupervised manner using a large database, removing the need to label data.
A specific head (such as dense neural network layers) is added to the previous model to make it task-specific.
The new model is trained in its entirety with a small learning rate on specific data.
The first step is usually performed by large companies, such as Google or Facebook, or public research centers that make their model freely available on internet platforms. The second and third steps form a process that is generally referred to as
French-language tweets published between July 12, 2021, and August 11, 2021, were extracted using the Twitter application programming interface ([API] version 2; Twitter Inc;
Flow chart of methodology steps. API v2: application programming interface version 2.
A total of 1851 unique tweets were randomly selected and manually labeled by 2 people (1451 for training and validation and 400 for testing). When doubt arose about labeling, which occurred for 87 of the 1851 tweets (4.7%), a discussion occurred to determine the relevant label for each tweet (see examples in
A total of 2 classifications were developed to examine arguments for (pros) or against (cons) vaccination (health measures included) and examine the type of tweet content (scientific, political, social, or vaccination status). The classifications and definitions used to label tweets are provided in
Classification criteria for tweets and definitions.
Type of tweet | Definition | Translated examples (French to English) | |
|
|||
|
Unclassifiable | Unclassifiable or irrelevant to the topics of vaccination or health measures | The Emmanuel Macron effect |
|
Noncommittal | Neutral or without explicit opinion on vaccination and/or the health pass | I have to ask my doctor for the vaccine |
|
Pros | Arguments in favor of the health pass |
Personally, I am vaccinated so nothing to fear, on the other hand, good luck to all the anti-vaccine, you will not have the choice now?? |
|
Cons | Arguments against vaccination or doubts about the effectiveness of COVID-19 vaccines, fear of side effects, and refusal to obtain the health pass | I am against the vaccine I am not afraid of the virus but I am afraid of the vaccine |
|
|||
|
Unclassifiable | Irrelevant to the topic or unclassifiable | A vaccine |
|
Scientific | Scientific or pseudoscientific content that uses true beliefs or false information | The vaccine is 95% efficient, a little less in fragile people. The risk is not zero, but a vaccinated person has much less chance of transmitting the virus. |
|
Political | Comments on legal or political decisions about vaccination or health measures | Basically the vaccine is mandatory, shameful LMAO |
|
Social | Comments, debates, or opinions on the report to other members of society | “Pro vaccine” you have to also understand that there are people who do not want to be vaccinated. |
|
Vaccination status | Explicit tweet about the vaccination status of the tweet author |
Example 1: I am very glad to have already done my 2 doses of the vaccine, fudge |
This study followed the general methodology of machine learning to guarantee a rigorous building of the model. To ensure that the model did not overfit or underfit the data set, the following steps were taken:
The data set was divided into training (n=1306), validation (n=145), and testing (n=400) data sets.
The training loss was represented as a function of the number of epochs to monitor the correct learning of the model and select its optimal value.
The validation accuracy is represented as a function of the number of epochs to ensure that the model was not overfitting or underfitting the data.
The final model was evaluated on a testing data set that had not been previously used to build or validate the model.
A total of 2 fully connected dense neural network layers with 1024 and 4 neurons (for classification problem 1) or 5 neurons (for classification problem 2) were added to the head of the CamemBERT model, adding another 1.6 million parameters. Furthermore, to prevent overfitting, a 10% dropout was applied between those 2 layers. A small learning rate of 2 × 10-5 was used for fine-tuning, and adaptive moment estimation with a decoupled weight decay regularization [
One of the main hyperparameters to be tuned for the training of the model is the number of epochs. As a rule of thumb, to prevent overfitting, the number of epochs is usually chosen based on when the abruptness of the slope of the loss changes while maintaining a low rate of misclassification on the validation data set.
This was confirmed by computing the precision, recall, and F1-score at 3 different epochs (7, 15, and 20), as shown in
A similar study for the second classification problem determined that 6 epochs were enough to prevent overfitting. The performance of the model was also measured by computing the weighted precision, recall, and F1-score, as shown in
The size of the data set is quite similar to those of Kummervold et al [
The proportion of tweets assigned to each label in the data set for classification problems 1 and 2 (n=1451).
Classification problem | Tweets | |
|
||
|
Unclassifiable | 189 (13) |
|
Neutral | 354 (24.4) |
|
Positive | 392 (27) |
|
Negative | 516 (35.6) |
|
||
|
Unclassifiable | 226 (15.6) |
|
Scientific | 441 (30.4) |
|
Political | 316 (21.8) |
|
Social | 353 (24.3) |
|
Vaccination status | 115 (7.9) |
Training loss (a) and validation accuracy (b) of the model over 20 epochs for classification problem 1.
Classification performance of the model for classification problem 1.
Epochs, n | Precisiona | Recalla | F1-scorea |
7 | 59 | 55.3 | 55.3 |
15 | 56.6 | 53 | 53.2 |
20 | 56.9 | 54.5 | 55.2 |
aThese data are provided as percentages.
Classification performance of the model for classification problem 2.
Epochs, n | Precisiona | Recalla | F1-scorea |
6 | 67.6 | 64.5 | 62.9 |
15 | 62.7 | 62.8 | 61.3 |
20 | 60.6 | 59.5 | 56.5 |
aThese data are provided as percentages.
From the results of the previous section, we see that it is significantly more difficult to build a performant classifier based on the 4 vaccine sentiment labels (unclassifiable, noncommittal, pros, and cons), with the maximum F1-score reaching 55.3% in this case. On the other hand, the classifier built from the same tweets but with 5 different labels based on content type (unclassifiable, scientific, political, social, vaccination status, or symptoms) achieved a much higher F1-score (62.9%).
To analyze the strength and weakness of a model more specifically, it is always instructive to represent it using a confusion matrix [
Since the values in these matrices are percentages, their interpretation requires some care. For the first problem, summing figures line-by-line in the matrix shows that out of 100 tweets from the test data set, on average, 11.25 are unclassifiable, 35.50 are noncommittal, 13.25 are pros, and 40.00 are cons. It is then possible to compute the proportion of tweets correctly classified by the model, label-by-label. The results are shown in
For the second problem, as expected, in line with the higher F1-score found in the previous section, the model achieves much better classification performance. It excels at classifying scientific and political tweets and is also good at classifying social tweets. It still has some difficulties classifying unclassifiable tweets and, in a larger proportion, vaccination status tweets. Looking back to the confusion matrix, for the last 2 labels, we observe that the model tends to classify them as being social tweets.
Confusion matrix for classification problems 1 and 2 (n=400).
The number of tweets correctly classified for each label in classification problems 1 and 2 (n=400).
Classification problem | Tweets | |
|
||
|
Unclassifiable | 10 (22.2) |
|
Noncommittal | 62 (43.7) |
|
Pros | 36 (67.9) |
|
Cons | 113 (70.6) |
|
||
|
Unclassifiable | 27 (40.3) |
|
Scientific | 67 (79.8) |
|
Political | 93 (82.3) |
|
Social | 58 (66.7) |
|
Vaccination status | 13 (26.5) |
To improve the performance of the fine-tuned CamemBERT model, a hypothesis about the influence of tweet length on model accuracy was tested. A Mann-Whitney
Tweet text length as a function of the accuracy of the fine-tuned CamemBERT model conducted on classification problems 1 and 2 (Mann-Whitney
The finding of the previous section is further supported after carrying out the following experiment. Tweets with more than 170 characters were selected from the 400-tweet data set. Classification model 2 was then tested with these 168 tweets to see if its accuracy increased.
As shown in
The confusion matrix generated from the comparison between the model-classified and the manually classified 168 long tweets is shown in
As already pointed out using the Mann-Whitney
Classification performance of the model for classification problem 2, limited to long tweets (170 or more characters).
Classification problem | Precisiona | Recalla | F1-scorea |
2 | 72.6 | 73.2 | 72.4 |
aThese data are provided as percentages.
Confusion matrix for classification problem 2 limited to long tweets (n=168).
The proportion of correct classifications for each label in classification problem 2, limited to long tweets (170 or more characters; n=168).
Type of problem | Number of tweets | |
|
||
|
Unclassifiable | 6 (46.3) |
|
Scientific | 42 (79.2) |
|
Political | 45 (90) |
|
Social | 25 (65.8) |
|
Vaccination status | 5 (35.7) |
A total of 2 types of classification were examined. The accuracy of the model was better with the second classification (67.6%; F1-score 62.9%) than the first classification (59%; F1-score 55.3%). This accuracy is slightly higher than that obtained by BERT for the same topic (vaccines) [
Therefore, as shown by Kummervold et al [
Consequently, tweet content is so varied that it remains difficult to manually categorize, and this has been reflected in the model predictions. On the one hand, considering classification problem 1, tweets containing characteristic terms of the anti-vaccine position, such as “5G,” “freedom,” “phase of testing,” “side effect,” and “#passdelahonte” (“shameful pass”), were found to be easier to label and predict. However, because antivaccine proponents spread disinformation more widely on social media [
Finally, relevant tweets for a topic may be rare in a data set. In some studies, the corpus is halved [
Several limitations can be highlighted, including the following: (1) the data were only provided from a single social media platform (Twitter); (2) all tweets containing the term “vaccine” and its derivatives were included without preselection; (3) several categorization classes were unbalanced; (4) a larger training set could provide contrasting results; (5) the categorization choices could affect the performance of CamemBERT, as seen in the confusion matrix; and (6) the suggestions provided (limiting the number of tweet characters) may only apply to tweets on the topic of vaccination, so further studies are needed to confirm the relevance of our conclusions.
In this study, we tested the accuracy of a model (CamemBERT) without preselecting tweets, and we elaborated an epistemological reflection for future research. When the vaccine debate is jostled by contested political decisions, tweet content becomes so heterogeneous that the accuracy of the model decreases for the less differentiating classes. In summary, our analysis shows that epistemological choices (types of classes) can affect the accuracy of machine learning models. However, our tests also showed that it is possible to improve the model accuracy by using an objective method based on tweet length selection. Other possible avenues for improvement remain to be tested, such as the addition of features provided by Twitter (conservation ID, number of Twitter users following or followers, user public metrics listed count, user public metrics tweet count, or user ID).
Examples of conflicting labeling.
application programming interface
Bidirectional encoder representations from transformers
natural language processing
odds ratio
None declared.