This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.
Big data analytics offers promise in many business sectors, and health care is looking at big data to provide answers to many age-related issues, particularly dementia and chronic disease management.
The purpose of this review was to summarize the challenges faced by big data analytics and the opportunities that big data opens in health care.
A total of 3 searches were performed for publications between January 1, 2010 and January 1, 2016 (PubMed/MEDLINE, CINAHL, and Google Scholar), and an assessment was made on content germane to big data in health care. From the results of the searches in research databases and Google Scholar (N=28), the authors summarized content and identified 9 and 14 themes under the categories
The top challenges were issues of data structure, security, data standardization, storage and transfers, and managerial skills such as data governance. The top opportunities revealed were quality improvement, population management and health, early detection of disease, data quality, structure, and accessibility, improved decision making, and cost reduction.
Big data analytics has the potential for positive impact and global implications; however, it must overcome some legitimate obstacles.
Big data analytics offers promise in many business sectors, and health care is looking at big data to provide answers to many age-related issues, particularly dementia and chronic disease management. This systematic review explores the depth of big data analytics since 2010 and identifies both challenges and opportunities associated with big data in health care. The review follows the standard set by Preferred Reporting Items for Systematic Reviews and Meta-analysis (2009) [
Big data is commonly defined through the 4 Vs: volume (scale or quantity of data), velocity (speed and analysis of real-time or near-real-time data), variety (different forms of data, often from disparate data sources), and veracity (quality assurance of the data). The first 3 Vs are found in most literature [
As of 2012, about 2.5 exabytes of data are created each day; Walmart can collect up to 2.5 petabytes of customer-related data per hour [
There are several large sources for big data in health care: genomics, EHR, medical monitoring devices, wearable video devices, and health-related mobile phone apps. Approximately 483 studies on genomics are registered with the US Department of Health and Human Services; these studies are being conducted in 9 countries, and they all use portions of the data from the Human Genome Project [
The decrease in the cost of storage has enabled an exponential distribution of data collection, but the ability to analyze this quantity of data is the center of gravity for “big data” in health care. In the United States, financial incentives offered for the “meaningful use” of health information technology has spurred growth in the adoption of the EHR and other enabling health-related technology since 2009.
Health information systems show great potential in improving the efficiency in the delivery of care, a reduction in overall costs to the health care system, as well as a marked increase in patient outcomes [
With the implementation of this legislation as well as the technologies associated with it, it is imperative to effectively organize and process the ever-increasing quantity of data that is digitally collected and stored within health care organizations. Other industries such as astronomy, retail, search engines, and politics have developed advanced data-handling capabilities to convert data into knowledge. Health care needs to follow their lead so that decisions regarding organizational objectives and goals can be met [
The purpose of this systematic review is to objectively review articles and studies published in academic journals in order to compile a list of challenges and opportunities faced by big data analytics in health care in the United States. Particular emphasis was paid to age-related applications of big data.
Articles and studies were eligible for analysis if they were published between 2010 and 2015, published in academic journals, and published in English. The researchers chose a range from 2010 to 2015 for two reasons: HITECH was passed in 2009, and it appeared that a blossom of research and other articles seemed to occur in 2010. We focused on academic journals for their peer-review quality and to decrease the chance of selecting something about big data published from a noncredible source.
A combination of key terms from Medical Subject Headings (MeSH) and Boolean operators were combined and used in 2 common research databases, CINAHL and PubMed, and combined with a general search from Google Scholar (see
These terms were chosen not only because they are the focus of the review, but also because they were identified in the initial research into the definition of big data.
The following search string was used in all 3 searches: ((“big data” AND healthcare) OR (“big data” AND “health care”)). This search string was used in CINAHL, PubMed (MEDLINE), and Google Scholar. In the 2 research databases, our team was able to restrict the search to academic journals (including other systematic reviews). MEDLINE was excluded in CINAHL because it was already captured in PubMed. Google Scholar creates difficulty for searches because of its severe limit of filters typically associated with academic research. The initial 13,935 results were limited by restricting dates to the last 5 years, limiting results to academic journals and MEDLINE, and in Google Scholar by restricting the keyword search to titles. The result from the filters ended with 121 articles to review.
Literature review process with inclusion and exclusion criteria.
Through group research and a series of consensus meetings, researchers were trained to identify articles germane to this review and to recommend elimination of all others. A shared spreadsheet was used by the research team to parse through the list of articles. Researchers read all articles in their entirety. A total of 97 articles were eliminated due to various exclusion criteria (not germane to big data or health care, editorial only, not an academic journal, or duplicate from another search), and 4 additional articles were identified from the references of the 24 that remained. The group of reviewers made these rejections or additional recommendations through a series of consensus meetings where we met to discuss their recommendations and consensus was reached through discussion. A total of 28 articles remained in the final review.
Each article was reviewed by at least two authors to identify the relevant points. All reviewers used a spreadsheet template to summarize their key observations from each article. One team member combined the spreadsheets into one and shared it once again. Reviewers held one more consensus meeting to discuss their findings. From this meeting, trends were identified, and from those trends, inferences were made.
From the list of observations, reviewers were able to identify some common threads that emerged as challenges and opportunities in health care that permeated multiple articles. Separate tables were created to group the threads, and from each of these tables, common themes were identified. These common themes only emerged when reviewers combined their observations. These themes were tabulated and counted for additional analysis.
As depicted in
Multiple reviewers read each article in its entirety. Articles were included or excluded based on the criteria illustrated in
A study catalog number was assigned to each article to simplify the analysis. Researchers summarized the main points of each article for further analysis.
Through the combination of observations, reviewers identified common threads (challenges and opportunities) and themes from each thread. Themes were organized into affinity diagrams (
Nine themes emerged under the category of challenges: data structure, security, data standardization, data storage and transfers, managerial issues such as governance and ownership, lack of skill of data analysts, inaccuracies in data, regulatory compliance, and real-time analytics. Examples for each theme are provided in
Themes associated with challenges for big data in health care.
Themes | Examples | Number of articles |
Articles themes appeared in | % of total articles |
Data structure | Fragmented data | 17 | 1, 2, 7-9, 12, 14-19, 22, 25-28 | 61% |
Incompatible formats | ||||
Heterogeneous data | ||||
Raw and unstructured datasets | ||||
Large volumes | ||||
High variety and velocity | ||||
Lack of transparency | ||||
Security | Privacy | 14 | 2, 4, 7-9, 12, 13, 17, 21, 22, 25-28 | 50% |
Confidentiality | ||||
Data duplication | ||||
Integrity | ||||
Data standardization | Limited Interoperability | 11 | 4, 5, 7-9, 11, 12, 15, 16, 22, 25 | 39% |
Data acquisition and cleansing | ||||
Global sharing | ||||
Terminology | ||||
Language barriers | ||||
Storage and transfers | Expensive to store | 8 | 1, 4, 7, 12, 22, 26, 28 | 28% |
Transfer from one place to other | ||||
Store electronic data | ||||
Securely extract, transmit, and process | ||||
Managerial issues | Governance issues | 4 | 2, 8, 14, 22 | 14% |
Ownership issues | ||||
Lack of skill | Untrained workers | 3 | 5, 9, 14 | 11% |
Inaccuracies | Inconsistences | 1 | 9 | 4% |
Lack of precision | ||||
Data timeliness | ||||
Regulatory compliance | Legal concerns | 1 | 13 | 4% |
Real-time analytics | Real-time analytics | 1 | 9 | 4% |
The 4 Vs appear in multiple places under the Challenges category. Volume and variety are seen by name under the theme of Data structure. Variety is also implied in the same theme, but listed as Incompatible formats, as well as Raw and unstructured datasets. Variety can also be inferred from the theme of Data standardization, listed as Limited interoperability. Velocity is seen in the theme Real-time analytics. Veracity is seen under the theme of Data Standardization, but listed as Data acquisition and cleansing, Terminology, and Language barriers. It is also inferred in the theme Inaccuracies listed as Inconsistencies and Lack of precision.
Issues related to data structure were addressed in the majority of the papers reviewed for this study. It is essential that the key functions of data processing are supported by the applications of big data [
Research data within the health care sector is more heterogeneous than the research data produced within other research fields [
There are considerable privacy concerns regarding the use of big data analytics, specifically in health care given the enactment of Health Insurance Portability and Accountability Act (HIPAA) legislation [
Although the EHRs share data within the same organization, intra-organizational, EHR platforms are fragmented, at best. Data is stored in formats that are not compatible with all applications and technologies [
Limited interoperability poses a large challenge for big data, as data is rarely standardized [
Data generation is inexpensive compared with the storage and transfer of the same. Once data is generated, the costs associated with securing and storing them remain high [
Data governance will need to move up on the priority list of organizations, and it should be treated as a primary asset instead of a by-product of the business [
It is important that health care workers are also kept up to date with the use of constantly changing technology, techniques, and a constantly moving standard of care [
Self-reported data is extensively used in health care, and so it is crucial that the data collected in this manner be consistent [
Health care organizations should be aware of the various legal issues that can surface in the process of managing high volume of sensitive information. Organizations implementing big data analytics as a part of their information systems will have to comply with a significant amount of standards and regulatory compliance issues specific to health care [
One of the key requirements in health care is to be able to utilize big data in real time. Real time is defined by enabling the use of applications such as cloud computing to view said data in real time. The use of these technologies leads to issues of security and privacy within patient information [
Fourteen themes emerged under the category of opportunities: improve quality of care, managing population health, early detection of diseases, data quality, structure, and accessibility, improve decision making, cost reduction, patient-centric care, enhances personalized medicine, globalization, fraud detection, and health-threat detection. Examples of each theme are listed in
Themes that emerged from the opportunities for big data in health care.
Themes | Examples | Number of articles |
Articles themes appeared in | % of total articles |
Improve quality of care | Improve efficiency | 18 | 2, 4, 5, 6, 8-13, 18-20, 22-25, 27 | 64% |
Improve outcomes | ||||
Reduce waste | ||||
Reduce readmissions | ||||
Increased productivity and performance | ||||
Risk reduction | ||||
Process optimization | ||||
Managing population health | Managing population health | 17 | 2, 5, 8-10, 12-14, 16, 18-20, 23, 25, 26, 28 | 61% |
Early detection of diseases | Predicting epidemics | 17 | 2, 4, 5, 7-13, 15, 18-20, 23, 24, 28 | 61% |
Disease monitoring | ||||
Health tracking | ||||
Adopt and track healthier behaviors | ||||
Predicting patient vulnerability | ||||
Improved treatments | ||||
Data quality, structure, and accessibility | Large volumes | 16 | 2, 4, 6, 9, 11, 12, 16, 18, 20- 23, 25-28 | 57% |
Wide variety | ||||
Creating transparency | ||||
High-velocity capture | ||||
Access to primary data | ||||
Reusable data | ||||
Weed out unwanted data | ||||
Open source—free access | ||||
Improve decision making | Evidence-based medicine | 11 | 2,-4, 7, 9, 12, 16, 20, 22, 23, 24 | 39% |
New treatment guidelines | ||||
Accuracy in information | ||||
Cost reduction | Inexpensive | 10 | 1, 3, 4, 7, 9, 11, 12, 14, 16, 18 | 36% |
Reducing health care spending | ||||
Patient-centric health care | Empowering patients | 8 | 2, 3, 5, 12, 14, 20, 22, 24 | 29% |
Patients making informed decisions | ||||
Increased communication | ||||
Enhancing personalized medicine | Targeted approach | 6 | 4-6, 24, 25, 28 | 24% |
Globalization | Widely accessible | 6 | 2, 6-8, 10, 20 | 24% |
Global sharing | ||||
Leveraging knowledge and practices | ||||
Knowledge dissemination | ||||
Fraud detection | Fraud detection | 3 | 8, 12, 28 | 11% |
Health-threat detection | Health-threat detection | 1 | 7 | 4% |
Despite the challenges that big data needs to overcome, the advanced analytics that are promised through big data offer tremendous opportunities for most stakeholders in the health care industry (patient, provider, and payer). More than 64% of the articles analyzed focused on quality improvement and more than 60% on managing population health and early detection of diseases through big data analytics. If even some of the opportunities of big data are realized, they can radically change patient outcomes and the way decisions are made by providers, and help solve some macro-level issues related to health care within countries such as the United States (cost, quality, and access).
Big data has the potential and ability to improve the quality and efficiency of care [
Quality of care will also be improved by reducing waste of information, which will reduce inefficiencies [
The management of population health and the early detection of diseases were topics that the authors thought would have highly similar results after the analysis. Although there was a large overlap between the 2 themes, there was also specific variation between them. So, the researchers chose to keep them separate. The theme of managing population health focused on special populations rather than public health.
Big data analytics define populations at a finer level of granularity than has ever been previously achieved [
Big data allows for the early detection of diseases, which aids in clinical objectives related to achieving improved treatments and higher patient outcomes [
Literature suggests that big data enables rapid capture of data and the conversion of primary, raw and unstructured data into meaningful information [
Big data enables appropriate use of evidence-based medicine and helps health care providers make more informed decisions [
Decision-making process can be highly optimized by the availability of accurate and up-to-date information, as decision making is influenced by the generation of new practices and treatment guidelines within clinical research. Allowing big data to influence decision making will allow for a faster and simpler process. This is done by either supporting or replacing human decision making. About 39% of the literature mentioned this opportunity.
The literature suggests that the decrease in cost of the elements of computing, such as storage and processing, leads to a decrease in the cost of data-intensive tasks [
Increasing the use of technology is slowly changing the direction of the health care sector from disease-centric care toward patient-centric care [
With the use of big data, the objectives of personalized medicine can be translated into clinical practice [
Big data will actively help in disseminating the knowledge acquired from the data collected [
One of the most significant benefits offered by big data is that it is instrumental in detecting fraud in an efficient and effective manner [
Big data offers opportunity for improving capabilities of threat detection quickly and more accurately. This can be especially beneficial for government use [
Opportunities most often mentioned or discussed were improve quality of care (18/28, 64%), managing population health (17/28, 61%), early detection of diseases (17/28, 60.7%), data quality structure and accessibility (16/28, 57%), improve decision making (11/28, 39.3%), cost reductions (10/28, 36%), patient-centric health care (8/28, 29%), enhancing personalized medicine (6/28, 24%), and globalization (6/28, 24%). The other two opportunities each comprised less than 15% of the observations.
Although the integration of big data is well underway in industries such as finance and advertising, it has not yet fully assimilated into health care. Challenges and opportunities were made quite clear in the articles analyzed in this review. Three of the 4 Vs (volume, velocity, and variety) were consistently adhered to. The fourth V, veracity, was found, but rarely listed by name.
A big limitation in this review is the low number of articles used in the analysis. If we were to do this over again, we would query another database to see whether additional articles were available for analysis.
Selection bias seems to exist in any study. Our control for selection bias was the initial research up front to agree on a definitive definition of the concept of big data, and our consensus meetings to discuss findings. The consensus meetings offered great value to the process because they enabled the group to hear the focus of an individual and either provide feedback to confirm the focus or agree that the unique focus was warranted for all the articles in the review.
Another bias that we discuss regularly is publication bias. Journals tend to publish results that are statistically significant, which inherently limits the publication of research that may not reach that level. Our control for publication bias was to include Google Scholar in our search. Our intent was to identify material in lesser-known journals that might not be indexed in PubMed (MEDLINE) or CINAHL.
Big data and the use of advanced analytics have the potential to advance the way in which providers leverage technology to make informed clinical decisions. However, the vast amounts of information generated annually within health care must be organized and compartmentalized to enable universal accessibility and transparency between health care organizations.
Our systematic literature review revealed both challenges and opportunities that big data offers to the health care industry. The literature mentioned the challenges of data structure and security in at least 50% of the articles reviewed. The literature also mentioned the opportunities of increased quality, better management of population health, early detection of disease, and data quality structure and accessibility in at least 50% of the articles reviewed. These findings identify foci for future research.
Summary or relevance of cited work.
American Recover and Reinvestment Act
electronic health record
Health Information Exchange
Health Insurance Portability and Accountability Act
Health Information Technology for Economic and Clinical Health
Medical Subject Headings
Preferred Reporting Items for Systematic Reviews and Meta-analysis
None declared.