This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Health science findings are primarily disseminated through manuscript publications. Information subsidies are used to communicate newsworthy findings to journalists in an effort to earn mass media coverage and further disseminate health science research to mass audiences. Journal editors and news journalists then select which news stories receive coverage and thus public attention.
This study aims to identify attributes of published health science articles that correlate with (1) journal editor issuance of press releases and (2) mainstream media coverage.
We constructed four novel datasets to identify factors that correlate with press release issuance and media coverage. These corpora include thousands of published articles, subsets of which received press release or mainstream media coverage. We used statistical machine learning methods to identify correlations between words in the science abstracts and press release issuance and media coverage. Further, we used a topic modeling-based machine learning approach to uncover latent topics predictive of the perceived newsworthiness of science articles.
Both press release issuance for, and media coverage of, health science articles are predictable from corresponding journal article content. For the former task, we achieved average areas under the curve (AUCs) of 0.666 (SD 0.019) and 0.882 (SD 0.018) on two separate datasets, comprising 3024 and 10,760 articles, respectively. For the latter task, models realized mean AUCs of 0.591 (SD 0.044) and 0.783 (SD 0.022) on two datasets—in this case containing 422 and 28,910 pairs, respectively. We reported most-predictive words and topics for press release or news coverage.
We have presented a novel data-driven characterization of content that renders health science “newsworthy.” The analysis provides new insights into the news coverage selection process. For example, it appears epidemiological papers concerning common behaviors (eg, alcohol consumption) tend to receive media attention.
Health news is an increasingly popular topic in news media [
Gandy [
In this study, we aim to use data-driven, quantitative approaches to address the following questions: What topical content in health science articles correlates with receiving, or not receiving, a press release? Relatedly, what topical content correlates with receiving, or not receiving, news media coverage? What are the differences in the content of articles covered by the news media versus those that receive a press release?
The news media are powerful conduits by which to disseminate important information to the public [
Agenda setting has been used to explain the impact of the news media in the formation of public opinion [
In the area of health, journalists rely more heavily on sources and experts because of the technical nature of the information [
This study focuses on media coverage of developments in health science and scientific findings. Previous research has highlighted factors that might promote press release generation for, and news coverage of, health science articles. This work has relied predominantly on qualitative approaches. For instance, Woloshin and Schwartz [
In this study, we present a complementary approach using data-driven, quantitative methods to uncover the topical content that correlates with both news release generation and mainstream media coverage. Our hypothesis is that there exist specific topics—for which words and phrases are proxies—that are more likely to be considered “newsworthy.” Identifying such topics will illuminate latent biases in the journalistic process of selecting scientific articles for media coverage.
In this work, we apply natural language processing and statistical machine learning techniques to characterize features of scientific articles that receive media coverage. Specifically, we aim to build interpretable statistical models that can reliably predict whether a published health science article will (1) receive a press release from the publishing journal and (2) garner media coverage in mainstream outlets.
To explore these processes empirically we have constructed novel datasets. Our preliminary work [
1. We use supervised latent Dirichlet allocation (sLDA) [
2. We analyze a new corpus [
Our models are able to reliably discriminate between articles that will and will not (1) motivate a press release and (2) receive media coverage. We report robust predictors for these two tasks, both in terms of words and bigrams in a discriminative bag-of-words framework and with respect to higher-level topics uncovered via sLDA.
We now describe the datasets that we have constructed to empirically investigate patterns in press release generation for, media coverage of, and social media attention to, health science articles. We made all of these datasets publicly available, along with our code, to facilitate future research [
First, we augmented the dataset recently introduced by Sumner and colleagues [
Sumner and colleagues coded each journal article, press release, and news piece using a detailed protocol that is available online [
Additionally, we constructed two datasets, Journal of the American Medical Association (JAMA) and Reuters, which we have described in our earlier work [
We decomposed our aims into distinct modeling tasks to be undertaken using the associated datasets. We treated these as predictive tasks for validation purposes, but our interest is primarily in the predictive features, rather than classifier performance, as such.
Summary of the four tasks and their associated datasets.
Task | Source | Positive instances in dataset, |
Negative instances in dataset, |
Title length (words), |
Abstract length (words), |
PRa | Sumner PR (N=3024) | 422 (13.96) | 2602 (86.04) | 13 (5) | 214 (67) |
PR | JAMAb (N=10,760) | 846 (7.86) | 9914 (92.14) | 13 (5) | 335 (82) |
NCc | Sumner NC (N=422) | 214 (50.7) | 208 (49.3) | 14 (5) | 226 (79) |
NC | Reuters (N=28,910) | 1343 (4.65) | 27,567 (95.35) | 14 (6) | 267 (86) |
aPR: press release.
bJAMA: Journal of the American Medical Association.
cNC: news coverage.
Our first use of the Sumner corpus [
All 422 of these articles constitute
A pair of positive and negative instance snippets from the Sumner press release (PR) dataset.
The JAMA corpus comprised 846 positive instances, defined as articles for which journal editors created a press release—all journals in this corpus belong to the JAMA network [
For the first news coverage prediction task, we used the 422 articles contained in the Sumner dataset. In this case, we knew which articles were covered by one or more news outlets, and we could therefore derive positive and negative labels for each article. In all, 214 of these articles received news media coverage. We will refer to this dataset as Sumner NC.
The Reuters corpus [
In this section, we describe the machine learning methods we used to analyze the corpora. Broadly, these can be decomposed into our discriminative learning approach and the generative supervised topic modeling method we used to uncover latent topics that correlate with newsworthiness.
For discriminative learning, we used standard logistic regression with a squared ℓ2 norm penalty placed on the weights for regularization. Specifically, given a labeled corpus, we optimize the objective in Equation 1 in
As features in the logistic regression model, we used uni- and bigrams extracted from titles, abstracts, and MeSH terms. MeSH terms are Medical Subject Headings drawn from a controlled vocabulary maintained by the National Library of Medicine (NLM). These are manually assigned to citations by trained annotators at the NLM.
For text preprocessing, we used a standard English stop word list, and only kept features that appeared in at least two instances in a given dataset. We kept, at most, the 50,000 most frequently occurring features in the datasets, in cases where there were more than 50,000 unique features. The numbers of features for each task are summarized together with the sLDA model in the next section.
To identify robustly predictive features, we used bootstrap sampling to construct confidence intervals around coefficient point estimates. Specifically, we fit a regularized logistic regression model to each bootstrap training sample and recorded estimated coefficient values for each feature. We repeated this process 1000 times, deriving a variance from the observed estimates. We then constructed an approximate 95% confidence interval around coefficients using the normal approximation method [
Equation 1.
Statistical topic models have emerged as an important tool for discovering topics from large collections of text documents. Topic models postulate a
Supervised topic modeling is a variant of this, in which auxiliary meta-data about documents (ie, supervision) is assumed to be available [
More specifically, we assumed that there are
1. Draw topic proportions θ~Dirichlet(α).
2. For each word in position
(a) Draw topic assignment zn|θ ~ Multinomial (θ)
(b) Draw word as in
3. Draw class label as in
Here, the labels
The words comprising our vocabulary were unique unigrams extracted from citation titles, abstracts, and MeSH terms. We again kept up to 50,000 of the most frequently occurring words in the dataset as features. Ultimately, for the discriminative task—for which we used logistic regression—we used the following: 50,000 features for Sumner PR; 50,000 for JAMA; 10,004 for Sumner NC, which is much smaller; and 50,000 features for the Reuters corpus. For generative modeling (ie, using sLDA), we are left with the following: 23,561 features for the Sumner PR dataset; 23,539 for the JAMA dataset; 5796 for Sumner NC; and 50,000 for the Reuters corpus.
Word distribution. wn is the word at position n, zn is the topic at position n, and βK is a vector of term probabilities.
Class label. c is class label, η is a K-dimensional vector of real values.
Empirical topic frequencies.
Equation 2: softmax function.
With respect to discriminating between articles that did and did not receive a press release in the Sumner PR dataset, we achieved a mean AUC of 0.666 (SD 0.019; range 0.636-0.720 across five folds of cross-validation), indicating relatively strong predictive performance. We report the top 25 most robustly predictive n-gram features— negative and positive—in
We also present output from a 20-topic sLDA model fit to the Sumner PR dataset in
Top 25 negative features:
decreased
study
results
sexual
mice
based
research
signaling
evaluated
regarding
proposed
protein
states
discuss
program
various
lesions
review
thyroid
analyzed
performed
overexpression
medical
asd
Top 25 positive features:
uk
england
infection
death
MH-great-britain (MH prefix indicates a Medical Subject Headings [MeSH] term)
magnetic
alcohol
life
weeks
MH-england/epidemiology
magnetic resonance
setting
main outcome
pregnancy
british
functional magnetic
increase
resonance imaging
MH-great-britain MH-humans
resonance
TI-study (TI prefix indicates a title term)
adjusted
brain
outcomes
Density curve of coefficient values of four positively predictive words on Sumner press release datasets.
Density curve of coefficient values of four negatively predictive words on Sumner press release datasets.
The top 10 most probable words under the topics uncovered by a 20-topic supervised latent Dirichlet allocation model fit to the Sumner press release dataset. The horizontal axis corresponds to the coefficient of the topic. mh: this prefix indicates a Medical Subject Headings (MeSH) term; ti: this prefix indicates a title term.
The top 10 most probable words under the topics uncovered by a 20-topic supervised latent Dirichlet allocation model fit to the Journal of the American Medical Association dataset, using press release issuance as the supervision. The horizontal axis corresponds to the coefficient of the topic. mh: this prefix indicates a Medical Subject Headings (MeSH) term.
The analysis reporting informative features for logistic regression prediction was presented in our preliminary work [
On the Sumner NC dataset, we experimented with two different feature sets: (1) features extracted from the journal articles and (2) features extracted from the corresponding press release text. Our model using journal features achieved a mean AUC of 0.591 (SD 0.044) and ranged from 0.502 to 0.701 across five folds; our model using press release features achieved a mean AUC of 0.575 (SD 0.023) and ranged from 0.497 to 0.622. We note that this exhibits weaker correlation than press release prediction, although it is still better than chance (ie, 0.5). We report the top 25 most predictive features (ie, terms) of news coverage for each feature set in
Top 25 negative article features:
binding
receptor
development
protein
resistance
identify
surface
MH-molecular-sequence-data (MH prefix indicates a Medical Subject Headings [MeSH] term)
direct
responses
disruption
rapidly
domain
regions
structure
synaptic
TI-early (TI prefix indicates a title term)
MH-models-biological
specific
using
culture
MH-amino-acid-sequence
TI-children
understanding
complexes
Top 25 negative press release features:
resistance
physical
proteins
childhood
liverpool
understanding
born
impact
design
opportunities
university
attention
american
bristol published
sentences
changes
university birmingham
discovered
options
published
cardiac
revealed
date
leave
led professor
Top 25 positive article features:
95% ci
women
use
risk
countries
variation
hazard
body
england
participants
TI-cohort (TI prefix indicates a title term)
council
TI-cohort TI-study
research council
medical research
individual
individual
sex
main outcome
cohort study
systolic blood
relevant
cancers
research
TI-gene
Top 25 positive press release features:
face
research funded
shown
better
motor
years
gene
england
edinburgh
trial
production
flu
council
targeting
producing
roslin
research council
roslin institute
widespread
royal society
faces
providing
website
affecting
exciting
Density curve of coefficient values of four positively predictive words on Sumner news coverage (NC) dataset.
Density curve of coefficient values of four negatively predictive words on Sumner news coverage (NC) dataset.
Top 10 most probable words in the topics uncovered by the supervised latent Dirichlet allocation model—again assuming 20 topics—fit to the Sumner news coverage dataset. mh: this prefix indicates a Medical Subject Headings (MeSH) term; ti: this prefix indicates a title term.
In
For the discriminative learning task for this dataset, we have reported results previously [
Top 10 words from the 20 topics uncovered by the supervised latent Dirichlet allocation on the Reuters corpus, again using news coverage as the supervision. mh: this prefix indicates a Medical Subject Headings (MeSH) term; ti: this prefix indicates a title term.
As news organizations weather the fast-changing information landscape, press releases are now assisting journalists to fill news holes, more than ever before [
In our prior preliminary work [
This study examined the topics covered by press releases generated by scientific journals. Specifically, we have presented new corpora, methods, and results that aim to illuminate factors that correlate with press release generation for, and news media coverage of, health science articles. Our analysis indicates that scientific journals intentionally disseminate press releases that cover topics likely to be found “newsworthy” by lay audiences. For example, the flu was a topic frequently found in articles deemed newsworthy and in those for which journal editors wrote press releases.
Some of the press release topics were very general and applicable to broad audiences. For example,
There are several practical implications regarding the results of this study. For instance, press releases from scientific journals might be considered a trustworthy source for journalists working in health news. However, journalists should be aware of the limited scope of the breadth of topics covered in press releases and that other research findings should be explored for news coverage.
This research is not without limitations, however. We can only surmise as to why press releases were written on certain health science research findings or why a press release garnered news coverage. More research needs to be conducted on why certain health science articles are chosen as newsworthy and why journalists reported on the research findings they did. Although news values are meant to guide journalists’ selection of news, there are some who argue that news values are broad and vary greatly among news organizations [
Moving forward, we are encouraged by our positive results, and believe our models could be improved further in future work. For example, we could move beyond simple lexical features like n-grams and MeSH terms, including high-level concepts as features, such as the size and composition of the study cohort or the affected population, the type of study (eg, observational or controlled), and whether the research is basic or more applied. Richer linguistic features would also be interesting to incorporate, to help understand if certain writing styles are associated with more or less press coverage. When predicting media coverage, it would also be interesting to use features extracted from the press releases in addition to, or instead of, the features from the original journal articles, to understand how press releases influence the news media.
area under the curve
Journal of the American Medical Association
latent Dirichlet allocation
Medical Subject Headings
MeSH terms
news coverage prediction task
National Library of Medicine
press release prediction task
supervised latent Dirichlet allocation
Sumner's dataset for news coverage prediction task
Sumner's dataset for press release prediction task
title terms
None declared.