This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
PubMed is the largest biomedical bibliographic information source on the Internet. PubMed has been considered one of the most important and reliable sources of up-to-date health care evidence. Previous studies examined the effects of domain expertise/knowledge on search performance using PubMed. However, very little is known about PubMed users’ knowledge of information retrieval (IR) functions and their usage in query formulation.
The purpose of this study was to shed light on how experienced/nonexperienced PubMed users perform their search queries by analyzing a full-day query log. Our hypotheses were that (1) experienced PubMed users who use system functions quickly retrieve relevant documents and (2) nonexperienced PubMed users who do not use them have longer search sessions than experienced users.
To test these hypotheses, we analyzed PubMed query log data containing nearly 3 million queries. User sessions were divided into two categories: experienced and nonexperienced. We compared experienced and nonexperienced users per number of sessions, and experienced and nonexperienced user sessions per session length, with a focus on how fast they completed their sessions.
To test our hypotheses, we measured how successful information retrieval was (at retrieving relevant documents), represented as the decrease rates of experienced and nonexperienced users from a session length of 1 to 2, 3, 4, and 5. The decrease rate (from a session length of 1 to 2) of the experienced users was significantly larger than that of the nonexperienced groups.
Experienced PubMed users retrieve relevant documents more quickly than nonexperienced PubMed users in terms of session length.
Methods of information seeking have become much easier, faster, and inexpensive since the 1990s with the advent of information technologies (ITs) including the Internet, digital libraries (eg, electronic full-text databases), and online search software/services such as Google Scholar and PubMed. [
Recent years have seen a rising trend in biomedical information seeking from PubMed [
The goal of this study was to shed light on how PubMed users perform their search queries by analyzing a full-day query log. The hypotheses of this study were that (1) experienced PubMed users who use system functions such as Medical Subject Heading (MeSH) terms and search field tags quickly retrieve relevant documents and (2) nonexperienced PubMed users who do not use them have longer search sessions than experienced users, because they identify their information needs through subsequent queries by narrowing and/or broadening their queries. In order to test the hypotheses, we analyzed a full day of PubMed log data. We assumed that if a session was closed within a few queries, the session was successful (meaning that relevant documents were retrieved), even if a session close did not always mean successful IR.
In this study, experienced PubMed users were defined as users who used advanced PubMed IR functions for query formulation. The proper use of IR functions (described in the next section) is key for efficient and effective PubMed searches [
PubMed system functions include search field tags, MeSH terms (used for indexing PubMed articles), truncation, and combining searches using search history. In PubMed, bibliographic information is stored in a structured database with 65 fields including title, abstract, author, journal or proceeding, publication type, and publication date. PubMed provides 48 search field tags in order to facilitate searching in its various database fields; a description for each search field is available at the NLM website [
The study of information-seeking behavior is very important for the user-centric design of online IR systems including digital libraries. Individuals’ knowledge and skills related to information seeking are the primary determinants of their online IR performance. According to Marchionini (1995) [
The second major area of expertise is the knowledge of information seekers in their area of interest (known as domain knowledge). The NLM reported that almost two-thirds of PubMed users are health care professionals and scientists (ie, domain experts), whereas the remainder are the general public [
The other two determinants of search performance (ie, overall experience using online information seeking and experience or knowledge of the functions of the IR system) can be considered together as procedural knowledge for using the IR system [
In this study, our goal was to compare experienced versus nonexperienced users’ searching behavior in terms of session length (ie, the number of queries per session). We used a full-day PubMed query log for that purpose. There are a number of approaches for studying user-searching behavior such as eye tracking, surveys, and search log analysis. Search log analysis has become a viable solution for many applications including search engines [
Silverstein et al (1999) [
The focus of this study is different from that of the eight studies that used PubMed log data [
The dataset used in this study is a plain text file containing a full-day’s query log of PubMed that was obtained from the NLM FTP site (Refer to [
The data cleaning and preprocessing steps are presented in
Data cleaning and preprocessing.
The user queries in the PubMed log file are categorized as informational, navigational, or mixed according to the purpose of the search expressed in the query. Informational queries are intended to fulfill end users’ information needs (eg, "diabetes mellitus" [MeSH]) and navigational queries are intended to retrieve specific documents (eg, Yoo [author] AND Mosa [author]). Mixed queries have both intentions (eg, searching for a specific topic within a specific journal). Refer to Broder (2002) [
In order to identify the purpose of user queries for query categorization, we used PubMed’s ATM. Every PubMed user query is automatically translated by ATM to improve overall IR performance and the translated query is actually used for the PubMed search; if a query contains double quotation marks or search tags, those parts (words or terms) are not translated. The ATM translation identifies each term in a query and adds an appropriate search tag to the term. We categorized PubMed queries using ATM-added tags as well as user-added tags after ATM translations. PubMed provides 48 search tags (refer to the PubMed Help website [
The search tag extraction process involved a semiautomatic approach consisting of two steps: the semiautomatic construction of a list of search tags and their variations, and the automatic extraction of the search tags including their variations from the queries using the search tag list. A total of 963 unique substrings were extracted from the queries in the first step. The first step (a partial manual step) was required for two reasons: (1) for each search tag there are several variations that are not fully documented even though they are correctly recognized by the PubMed system; for example, [Author Name], [Author], [AU Name], [Auth], and [AU] represent the same search tag header but only [Author Name] and [AU] are documented in the PubMed Help web page, and (2) incorrect search tags (eg, typos like [Atuhor]) used in PubMed queries are not recognized by the PubMed system but a domain expert could correctly recognize and read those intentions. The extracted search tags from the translated queries were then analyzed to identify query types. Since navigational search tags are mainly used to retrieve specific documents rather than to fulfill information needs, we excluded navigational and mixed queries from the analysis, assuming informational search tags are primarily used for information needs.
Query categorization.
Information seeking is defined as “the process of repeatedly searching over time in relation to a specific, but possibly an evolving information problem” [
In this study, we employed both the session-shift and temporal-constraint-based sliding window for session segmentation. This is because several studies reported the average duration of user sessions for query log analysis (meaning that the maximum length of session window can be chosen based on those results for session segmentation) [
Using this method, we extracted 742,602 user sessions from more than 2 million informational queries. User sessions were divided into two categories: experienced and nonexperienced. Experienced sessions were those in which queries were formed using system functions such as MeSH terms and search field tags. Otherwise, a user session was considered nonexperienced. For example, while a query containing “hypertension [MeSH]” was considered experienced, a query with “high blood pressure” was considered nonexperienced, even though hypertension is a synonym of high blood pressure. This is because although for the query “high blood pressure,” PubMed’s ATM internally expands the query by adding the MeSH term
First, we performed some basic statistical analysis on query and session data. The number of queries per user ranged from 1 to 8544 (an extreme outlier) with an average of 4.77 queries per user (SD 15.11, median 2).
PubMed users may perform multiple IR sessions to fulfill their various information needs. In order to identify the purpose of each IR session, we categorized the queries in the log dataset as shown in
About 94% (=700,547/742,602) of the sessions were performed by nonexperienced-users and 6% (=42,055/742,602) of the sessions were performed by experienced users (see
In addition, we measured user decrease rates of the experienced and nonexperienced users from the session length of 1 to 2, 3, 4, and 5. Because the ideal session length is 1 (meaning that a user fulfills his or her information need with only one query), the baseline session length should be 1 (the ideal session). Decrease rates from the baseline indicate the success of the IR session (at retrieving relevant documents).
Percentage of users and queries per number of queries.
Query types and session types.
Percentages of experienced and nonexperienced users per session length (# of queries per session).
Decrease rates of experienced and nonexperienced users by session length (# of queries per session).
In bibliographic searches like PubMed searches, procedural knowledge is an important factor to improve the overall performance of information retrieval. Procedural knowledge includes experience using online search systems and their search functions. Earlier studies demonstrated that PubMed users perform searches with higher recall and precision if PubMed search functions are used [
There are some limitations of this study. First, the PubMed query log data used in this study could have been biased in terms of IR function usage because the data contained search queries for one day only. Second, we used a predetermined time cutoff (20 minutes) for determining search sessions since the log data did not contain any session-related information. It is possible for a PubMed user to perform more than one session in 20 minutes. However, according to recent studies [
Fourth, we assumed if a session was closed within a few queries, the session was successful (meaning that their information needs were fulfilled) even if a session close does not always mean successful IR. This assumption is based on the fact that nearly 77% of users had only 1 to 3 queries in a session. We believe that most searches are successful. If most searches were unsuccessful, one would expect that most users would not use PubMed again. However, according to the NLM, the number of PubMed users has been increasing. In fact, there is no way to know if a session has been successful using the log data; using web log information is the only solution to this problem but this information is not available. We believe that some sessions that are closed within a few queries are unsuccessful. However, the gaps between the decrease rates of the experienced and nonexperienced users (especially at the session length of 2, see
It is unknown when the PubMed query data were collected, for confidentiality reasons. However, they are at least 9 years old. One might argue that this study based on old log data is still currently applicable, because the NLM has added many features to improve the performance and user interface of PubMed. Some examples are related citations, automatic term mapping, and PubMed Clinical Queries. PubMed is significantly different from how it was 9 years ago, in terms of the user interface and internal processes for better information retrieval. However, it is imperative to ascertain whether the new features and user interface retrieve documents that are more relevant or lead to better PubMed searches. Studies have found that most PubMed users
There are many recent studies (published in 2010 or later) that found that physicians prefer UpToDate and/or Google to PubMed, and that UpToDate and/or Google provide more answers to clinical questions. Thiele and colleagues (2010) [
In sum, the findings of these recent studies indicate that the information retrieval features of PubMed are inferior to other electronic resources or search engines such as UpToDate and Google. In other words, most PubMed users still have considerable difficulty obtaining relevant documents/information despite its many new features. As a result, physicians spend more time finding relevant information with PubMed. This problem is critical for PubMed because recent studies still show that the main barrier to POC learning is lack of time [
The PubMed log analysis indicated that experienced PubMed users quickly retrieved relevant documents in terms of session length and nonexperienced PubMed users had longer search sessions than experienced users. We believe there are a few potential solutions to this problem. First, the NLM could design and provide a novel PubMed user interface for nonexperienced users so that they can readily utilize advanced search functions without special training in PubMed. Second, because it is imperative for health professionals (especially physicians) to learn the system functions and MeSH vocabulary for better PubMed searches, the NLM could award grant funding only to institutes that regularly train health professionals in PubMed search skills. Third, the NLM could develop a sophisticated relevance-sorting algorithm similar to Google’s, so that PubMed users can quickly find relevant documents. Currently, PubMed provides a relevance sorting option. However, it is not the default sorting option as of 17 June 2015 and we believe there should be a significant improvement to the sorting algorithm. This PubMed search problem is not just an information retrieval issue but also a health care practice matter, because health professionals, especially physicians, could significantly improve the quality of patient care and effectively educate chronic patients using clinical and medical information and knowledge obtained from PubMed searches.
Automatic Term Mapping
information retrieval
Medical Subject Heading
National Library of Medicine
The authors are thankful to the United States National Library of Medicine for their efforts in producing and making the PubMed query log publicly available.
None declared.