Background

JMI

JMIR Med Inform

JMIR Medical Informatics

2291-9694

JMIR Publications

Toronto, Canada

v9i1e23086

33480858

10.2196/23086

Original Paper

ALBERT-Based Self-Ensemble Model With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation: Algorithm Validation Study

Wang

Yanshan

Liu

Sijia

Wang

Liwei

Mordaunt

Dylan

Junyi

ME 1

https://orcid.org/0000-0002-7162-5396

Zhang

Xuejie

PhD 1

https://orcid.org/0000-0002-5252-5162

Zhou

Xiaobing

PhD 1

School of Information Science and Engineering Yunnan University

East Outer Ring Road

Chenggong District, Kunming

Kunming, 650091

China 86 87165031748 zhouxb@ynu.edu.cn

https://orcid.org/0000-0003-1983-0971

1 School of Information Science and Engineering Yunnan University

Kunming

China

Corresponding Author: Xiaobing Zhou zhouxb@ynu.edu.cn

1 2021

22 1 2021

9 1

e23086

31 7 2020 22 9 2020 22 11 2020 15 12 2020

©Junyi Li, Xuejie Zhang, Xiaobing Zhou. Originally published in JMIR Medical Informatics (http://medinform.jmir.org), 22.01.2021.

2021

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on http://medinform.jmir.org/, as well as this copyright and license information must be included.

Background

In recent years, with increases in the amount of information available and the importance of information screening, increased attention has been paid to the calculation of textual semantic similarity. In the field of medicine, electronic medical records and medical research documents have become important data resources for clinical research. Medical textual semantic similarity calculation has become an urgent problem to be solved.

Objective

This research aims to solve 2 problems—(1) when the size of medical data sets is small, leading to insufficient learning with understanding of the models and (2) when information is lost in the process of long-distance propagation, causing the models to be unable to grasp key information.

Methods

This paper combines a text data augmentation method and a self-ensemble ALBERT model under semisupervised learning to perform clinical textual semantic similarity calculations.

Results

Compared with the methods in the 2019 National Natural Language Processing Clinical Challenges Open Health Natural Language Processing shared task Track on Clinical Semantic Textual Similarity, our method surpasses the best result by 2 percentage points and achieves a Pearson correlation coefficient of 0.92.

Conclusions

When the size of medical data set is small, data augmentation can increase the size of the data set and improved semisupervised learning can boost the learning efficiency of the model. Additionally, self-ensemble methods improve the model performance. Our method had excellent performance and has great potential to improve related medical problems.

data augmentation semisupervised self-ensemble ALBERT clinical semantic textual similarity algorithm semantic model data sets

Introduction

With the rapid development of computers and artificial intelligence, information availability has begun to show exponential growth. We are already in an era of information explosion. When faced with a large amount of information, time is wasted screening valid information. In addition, a large amount of information is stored in the form of text. Whether involving cluster storage or referring to related information, efficient information matching and screening is crucial. The importance of text information processing research has become very obvious. With major breakthroughs in the research of related algorithms in natural language processing and artificial intelligence, increasingly, research has been devoted to text information processing.

Textual similarity calculation [1] is a key technology for efficient information screening and matching in the field of text processing. Previous work [2-8] has proposed some methods for textual similarity calculation, for example, traditional text similarity calculation methods [2], word similarity calculation [3], vector space model [4], and latent Dirichlet allocation model [5]. At present, with the development of deep learning and neural networks, methods based on neural networks have become popular, for example, word vector embedding method [6,7] and one-hot representation [8]. At the same time, these methods can also be clinically applied.

In the field of medicine, with the rapid increase in electronic medical data [9], electronic medical records and medical documents have become important data resources for medical clinical research. However, most of these data resources are stored unprocessed or in heterogeneous text formats. To understand the content of text data, it is necessary to integrate structured and heterogeneous clinical data resources, medical records, and scientific research documents. Similarity calculation can improve information retrieval performance for medical resources and effectively allow the integration of heterogeneous clinical data. The concept of semantic similarity evaluation is the key to understanding text data resources, which can effectively allow the processing, classification, and structured processing of those resources. For example, a semantic similarity method can be used to semantically analyze patient medical records to identify similar cases and find the best solution.

However, a large number of publicly available medical data sets are restricted because of privacy, and there are insufficient sources of medical data sets. The scarcity of data sets has led to the slow development of natural language processing (NLP) in the medical field. In recent years, more researchers have begun to pay attention to this issue. Therefore, competitions related to textual semantic similarity calculation have been produced, such as SemEval [10], to develop an automated method, and the 2019 National NLP Clinical Challenges (N2C2) Open Health Natural Language Processing (OHNLP) [11,12] shared task Track 1 on Clinical Semantic Textual Similarity (STS) [13], for systems based on semisupervised learning. An example of clinical STS is shown in Figure 1. The score indicates the similarity between the 2 sentences are and fall within an ordinal range, ranging from 0 to 5, where 0 means that the 2 sentences are completely different (ie, their meanings do not overlap) and 5 means that the 2 sentences have complete semantic equivalence.

Figure 1

An example from the Clinical STS.

Teams that participated in the 2019 N2C2 OHNLP Clinical STS challenge demonstrated good results with methods such as multitask learning, XLNet, and ClinicalBERT methods. In the challenge, we used recursive neural networks and variants of these neural networks for experiments, such as long short-term memory neural networks [14], convolutional neural networks [15,16], capsule neural networks [17], and ordered long short-term memory neural networks. In addition, we combined some popular deep learning mechanisms, such as attention [18] and Siamese [19,20] networks. Through comparative experimental research, we obtained a Pearson correlation coefficient of 0.66 [21] in the official submission, which was not a satisfying result. Compared with other teams’ methods, our model had 2 drawbacks. First, because the size of clinical data sets was small, there were not enough data to train the model, which led to insufficient learning and understanding of the model. Second, our model was based on a recurrent neural network. Due to the influence of the forget gate in the recurrent neural network, important information may be lost in the process of long-distance propagation, which prevents the model from extracting key information. As a result, the learning efficiency of the model decreased.

To address the abovementioned problems, this paper proposes a self-ensemble [22] ALBERT [23] model under semisupervised learning [24,25] with easy data augmentation (EDA) [26] to calculate the semantic similarity of clinical text.

Methods Overview

In this section, we introduce 3 highlights of our method. Our method uses data augmentation and semisupervised learning to expand the scale of the data set from different levels. We pretrained ALBERT (based on self-ensemble methods) to strengthen the acquisition of key information and improve the performance of the model, and semisupervised learning and data augmentation methods were used to expand the number of data sets and increase the representation of data sets, which can prevent self-ensemble methods from overfitting.

Data Augmentation

By using external general domain data sets for semisupervised learning, we indirectly solved the problem of insufficient data. However, for medical data, semisupervised learning does not directly increase the amount of medical data. Therefore, we used an EDA method to directly increase the amount of medical data.

Generally, data augmentation is used in computer vision to flip, zoom, and add noise to a picture. These operations can increase small amounts of data, which can help train a more robust model; however, for text data, data augmentation is mainly used for operations such as replacing, adding, and deleting text. Previous work [27,28] has proposed some methods for data augmentation in NLP. For example, a study [27] translated sentences into French and then into English to generate new data. Other work has used data noising as smoothing [28]. However, these methods are highly time- and resource-consuming thus are not often used in practice.

In this paper, we use the form of EDA [26] shown in Table 1. Due to the irreplaceability of proper nouns in medical data, the selection range of the replacement operation has been optimized to keep proper nouns as much as possible. The size of medical data set increased from 1642 to 16,411 after EDA. We can intuitively see a substantial increase in the amount of medical data. We verified that this method increases the size of data set.

Table 1

Sentences generated using EDA.

Operation	Sentence 1	Sentence 2	Sentence 3
None^a	oxycodone [ROXICODONE] 5 mg tablet 0.5-1 tablets by mouth every 4 hours as needed.	A lady is running her cute dog through an agility course.	A beautiful woman with a young girl pose with bear statues in front of a store.
Synonym replacement	oxycodone [ROXICODONE] 5 mg tablet 0.5-1 tablets by mouth every 4 hours as indeed.	A lady is running her cute dog through an legerity course.	A beautiful woman with a young girl pose with bear figurines in front of a store.
Random insertion	oxycodone [ROXICODONE] 5 mg tablet 0.5-1 tablets by every mouth every 4 hours as needed.	A lady is running her cute dog through an amazing agility course.	A beautiful woman with a young girl pose with lovely bear statues in front of a store.
Random deletion	oxycodone [ROXICODONE] 5 mg tablet 0.5-1 tablets by mouth every 4 hours.	A lady is running her dog through an agility course.	A woman with a young girl pose with bear statues in front of a store.

^aNone indicates that this sentence did not undergo any operation.

Semisupervised Learning

Because there was not a sufficient amount of medical data, the training of the model was not complete. To solve this problem, we used the semisupervised learning method in transfer learning.

The semisupervised [29] pretraining task in NLP is a form of transfer learning that aims to establish a wide range of semantic understanding to promote the performance improvement of training and testing tasks. It has been proven that semisupervised pretraining in transfer learning is very effective in benchmark NLP tasks, and the application prospects in medical NLP tasks are particularly broad. Nonspecific pretraining tasks are used for general medical domain tasks; however, commonly used and publicly available data sets are not specific to the medical domain and may not be well summarized. Therefore, the transfer of nonspecific pretraining tasks and the promotion of language models to medical domain tasks are very important for future model development.

To improve traditional semisupervised learning, we used the teacher and student idea in data distillation [30,31] to improve the design of semisupervised learning. Teacher–student refers to the same training process. The beginning of the student's training is the end of the teacher's training, which can deepen the learning of the model. We used the teacher–student approach to design semisupervised learning. The teacher part uses a data set from the common domain, using the STS-B data set from the General Language Understanding Evaluation standard of the general domain. The student part uses a clinical text data set. Our semisupervised learning method is shown in Figure 2.

Figure 2

Semisupervised learning.

Self-Ensemble ALBERT Model

ALBERT has been applied to some tasks, such as natural language inference [32], sentiment analysis [33], causality analysis [34], and medical machine reading [35]. The self-attention structure is the core part of the transformer mechanism. The self-attention structure can directly calculate the similarity between words, which can intuitively solve the problem of long-distance information dependence. The combined self-attention structure transformer's semantic feature extraction ability is better than those of long short-term memory and convolutional neural networks, and it performs better under the combined action of decomposed embedding parameters and cross-layer shared parameters. Therefore, the pretrained self-attention structure, namely, the pretrained ALBERT model, was applied to our model. ALBERT is a variant of BERT that adds 2 methods of decomposing embedded parameters and sharing parameters across layers. It has 3 improvements. First, ALBERT decomposes embedding, which makes a large number of parameters sparse and reduces the number of dictionaries. Second, ALBERT adopts cross-layer parameter sharing, which reduces the parameter scale and improves the training speed. Third, ALBERT uses intersentence coherence, which makes the model unaffected by specific tasks. The architecture of the ALBERT model is shown in Figure 3.

Figure 3

Model architecture.

Following ALBERT, we first embedded the input data. Our embedding representation is constructed by the sum of token embedding, segment embedding, and location embedding. The input sequence is S = [s₁, s₂, ..., s_n], where n is the number of words in the input. The tokens “[CLS]” and “[SEP]” were added at the beginning and end of each instance, respectively.

Then, we input the data into the ALBERT model, which is made up of n transformer stacks,

where S_m is the output of transformer stack m.

Since the results do not need to be normalized, we did not use an activation function.

To achieve the best performance, the ALBERT model was fine-tuned. ALBERT models are usually fine-tuned using stochastic gradient descent methods. In fact, fine-tuning the performance of ALBERT is usually sensitive to different random seeds and orders of the training data, especially if the last training sample is noisy. To alleviate this situation, an ensemble method was used to combine multiple fine-tuning models because it can reduce overfitting and improve model generalization. The ensemble ALBERT model usually has better performance than a single ALBERT model. However, training multiple ALBERT models simultaneously is time-consuming. It is often impossible to train multiple models with limited time and GPU resources. Therefore, we improved the model ensemble method to fine-tune the ALBERT model. Our model’s ensemble method is called self-ensemble. The self-ensemble architecture is shown in Figure 4. The formula for self-ensemble is

where ALBERT(S_k) represents the checkpoints of the model with k training steps.

Figure 4

(a) Traditional ensemble vs (b) self-ensemble architecture.

Data Sets

The Clinical STS shared task data set was collected from electronic health record in the Mayo Clinic clinical data warehouse. Since the Mayo Clinic has completed the system-wide electronic health record conversion of all care locations from General Electric to Epic, the Clinical STS shared task data set will be extracted from the historical General Electric and Epic systems.

STS-B is a carefully selected English data set used in shared tasks between SemEval and SEM STS between 2012 and 2017. The data was divided into a training set, a development set, and a test set. The development set can be used to design new models and adjust hyperparameters. STS-B can be used to make comparable assessments in different research work and improve the tracking of the latest technology.

Table 2 shows the size of data set in the Clinical STS data set and the STS-B data set. The STS-B data set was used for the semisupervised learning training model. The STS-B data set comes from a data set collected by the general domain criterion General Language Understanding Evaluation. The Clinical STS data set was used to test the experimental results. The Clinical STS data set was provided by the competition organizer.

The STS-B data set provides paired text summaries, which are mainly from STS tasks in SemEval obtained over the years. The Clinical STS data set provides pairs of clinical text summaries, which are sentences extracted from clinical notes. This task assigns a numerical score to each pair of sentences to indicate their semantic similarity. Table 3 shows that the scores fall within an ordinal range, ranging from 0 to 5, where 0 means that the pair of sentences are completely different (ie, their meanings do not overlap) and 5 means that the pair of sentences have complete semantic equivalence.

Table 2

The size of data set.

Data set	Training	Validation	Test
STS-B	5749	1500	1379
Clinical STS	1642	N/A^a	412

Table 3

Similarity scores with examples.

Score	Sentence 1	Sentence 2
0	The patient has missed 0 hours of work in the past seven days for issues not related to depression.	In the past year, the patient has the following number of visits: none in the hospital none in the er and one as an outpatient.
1	nortriptyline [PAMELOR] 50 mg capsule 1 capsule by mouth every bedtime.	Tylenol Extra Strength 500 mg tablet 2 tablets by mouth every bedtime.
2	bupropion [WELLBUTRIN XL] 300 mg tablet sustained release 24 hour 1 tablet by mouth one time daily.	Flintstones Complete chewable tablet 1 tablet by mouth two times a day.
3	Given current medication regimen, the following parameters should be monitored by outpatient providers: None	Given current medication regimen, the following parameters should be monitored by outpatient providers: lithium level
4	The diagnosis and treatment plan were explained to the family/caregiver who expressed understanding of the information presented.	Explained diagnosis and treatment plan; patient expressed adequate understanding of the information presented today.
5	Learns best by: verbal instructions as procedure is being performed, reading, seeing, listening.	Learns best by: verbal instruction while procedure is performed, reading, seeing, listening.

Metric

We used the Pearson correlation coefficient as an evaluation criterion for the performance of the task. The Pearson correlation coefficient,

where E is the mathematical expectation (or mean), D is the variance, and Cov(X,Y)=E{ [X – E(X)] [Y – E(Y)]} is the covariance of random variables X and Y, is used to measure the degree of correlation between 2 variables.

Experimental Setting

In the experiments, we used Intel Xeon 2.2 GHz and Nvidia Tesla V100 32 GHz processors. Since we use semisupervised learning and self-ensemble techniques, our model will be stored by the checkpoint. The input dimensions of each of our data sets are the same. The optimal setting for the length of the input sequence is 64, and the optimal setting for the batch size was 32. The optimal setting for the checkpoint was 200. The optimal setting of the training step was 3598. In the experiments, we did not cross-train on the data set.

Results Performance Comparison

Table 4 shows the top 5 performance results for the 2019 N2C2 OHNLP Track 1 Clinical STS, the value that we obtained during the challenge, and the value obtained by the method presented in this paper. Our current method achieves a good result—the Pearson correlation coefficient value exceeded the best result by 2 percentage points.

Table 4

Results on the test set for Clinical STS.

Methods	Pearson correlation coefficient
Multitask learning, ClinicalBERT	0.90
Multitask learning, BERT	0.89
BERT, XLNet	0.88
BERT	0.87
BERT, XLNet	0.87
Our previous method^a	0.66
Our method in this paper	0.92

^aOrdered short long-term memory and attention.

Data Augmentation

The EDA method uses text replacement and deletion operations, optimizes the selection range of replacement and deletion, and retains the medical proper nouns in the data set. Table 5 shows the effect of using EDA on the model performance. After EDA, the size of medical data set is expanded, and the model's performance was greatly improved.

Table 5

Comparison between the model with and without EDA.

Methods	Pearson correlation coefficient
Without EDA^a	0.88
With EDA	0.92

^aEDA: easy data augmentation.

Semisupervised Learning

The semisupervised learning method uses the general domain data set STS-B for training to solve the problem of insufficient medical data. Table 6 shows the effect of using semisupervised learning on the model performance. We can see that semisupervised learning can greatly improve the efficiency of the model.

Table 6

Comparison between the model with and without semisupervised learning.

Methods	Pearson correlation coefficient
Without semisupervised learning	0.87
With semisupervised learning	0.92

Self-Ensemble ALBERT

Table 7 shows the effect of using the self-ensemble method on the model performance. We can see that the efficiency of the model with self-ensemble is better than that of the ordinary ensemble model. Additionally, self-ensemble greatly shortens the training time of the model, reduces the calculation time of the algorithm, and improves the efficiency of the algorithm.

BERT and ALBERT are pretrained models with the same self-attention structure. As shown in Table 8, the performance of ALBERT is better than that of BERT on the Clinical STS data set.

Table 7

Comparison among the model without ensemble, the model with ensemble, and the model with self-ensemble.

Method	Pearson correlation coefficient
None	0.85
Ensemble^a	0.89
Self-ensemble	0.92

^aEnsemble represents an ensemble method through multiple ALBERT models.

Table 8

Comparison between the ALBERT and BERT models.

Methods	Runtime (minutes)	Convergence speed^a (steps)	Pearson correlation coefficient
BERT	50	3300	0.86
ALBERT	32	2700	0.92

^aConvergence speed is measured using the training steps.

Discussion Overview

This paper makes the following contributions. First, we used the EDA text data augmentation method. This method increased the number of data through a series of operations and enriched the semantics of the data. Second, for the problem of insufficient medical data, we used a semisupervised learning method. This method relied on the use of external data to enrich the semantics. Third, to solve the problem of learning complex semantics and the loss of key semantic information, we used the self-ensemble ALBERT model for semantic similarity calculation of clinical text. This method not only improves the results of the semantic similarity calculation of clinical text but also, due to the improvement of the self-ensemble of our model, allows the algorithm to shorten its running time and improve its efficiency. With these techniques, our model obtained a Pearson correlation coefficient of 0.92.

In order to test the influence of the method on performance, we conducted ablation experiments on EDA, semisupervised learning, and self-ensemble. At the same time, in order to verify the performance of the model, we also performed ablation experiments on ALBERT.

Conclusions

Compared with other models and methods, combining an EDA and self-ensemble ALBERT model under semisupervised learning to perform clinical textual semantic similarity calculations can save a large amount of training time and allows more data to be trained at the same time. This brings great convenience for practical applications and scientific research.

In the future, we will study how to combine reinforcement learning to process natural language to further improve the performance of the model and handle the dilemma of bloated or erroneous in electronic health records caused by the increasing use of copy and paste.

Abbreviations

EDA

easy data augmentation

GLUE

General Language Understanding Evaluation

OHNLP

Open Health Natural Language Processing

N2C2

National NLP Clinical Challenges

NLP

natural language processing

STS

semantic textual similarity

This work was supported by the National Natural Science Foundation of China under Grant 61463050, Grant 61762091 and Grant 12061088, and the Science Foundation of Yunnan Education Department under Grant 2020Y0011.

None declared.

Karwatowski

Russek

Wielgosz

Koryciak

Wiatr

Energy efficient calculations of text similarity measure on FPGA-accelerated computing platforms

Parallel Processing and Applied Mathematics 2016 4 2 9573 31 40

10.1007/978-3-319-32149-3_4

Quan

Liu

Wenyin

Short text similarity based on probabilistic topics

Knowl Inf Syst 2009 9 17 25 3 473 491

10.1007/s10115-009-0250-y

Song

Feng

Question similarity calculation for FAQ answering

2007

Third International Conference on Semantics Knowledge and Grid (SKG)

October 29-31

Shan Xi, China

298 301

10.1109/skg.2007.247

Zhu

An improved text similarity calculation algorithm based on vsm

AMR 2011 4 225-226 1105 1108

10.4028/www.scientific.net/amr.225-226.1105

Zhang

Deep learning for remote sensing data: a technical tutorial on the state of the art

IEEE Geosci Remote Sens Mag 2016 6 4 2 22 40

10.1109/mgrs.2016.2540798

Pennington

Socher

Manning

GloVe: Global vectors for word representation

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014 10

19th Conference on Empirical Methods in Natural Language Processing (EMNLP)

October 25–29

Doha, Qatar

1532 1543

10.3115/v1/d14-1162

Kusner

Sun

Kolkin

From word embeddings to document distances

2015

International Conference on Machine Learning

July 6-11

Lille, France

957 966

Xiong

Chen

Qin

Cao

Shen

Wang

Chen

Yan

Tang

Distributed representation and one-hot representation fusion with gated network for clinical semantic textual similarity

BMC Med Inform Decis Mak 2020 04 30 20 Suppl 1 1 7

10.1186/s12911-020-1045-z

32349764

10.1186/s12911-020-1045-z

PMC7191689

Ritchie

Welch

Categorization of third-party apps in electronic health record app marketplaces: systematic search and analysis

JMIR Med Inform 2020 05 29 8 5 e16980

10.2196/16980

32469324

v8i5e16980

PMC7293052

Cera

Diabb

Agirrec

Lopez-Gazpioc

Speciad

SemEval-2017 Task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) 2017

11th International Workshop on Semantic Evaluation

August 3-4

Vancouver, Canada

Association for Computational Linguistics

1 14

10.18653/v1/s17-2001

Wang

Afzal

Liu

Rastegar-Mojarad

Wang

Overview of the BioCreative/OHNLP challenge 2018 task 2: clinical semantic textual similarity

Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 2018 8

9th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

August 29-September 1

Washington DC, USA

10.1145/3233547.3233672

Wang

Shen

Henry

Uzuner

Liu

The 2019 n2c2/OHNLP track on clinical semantic textual similarity: overview

JMIR Med Inform 2020 11 27 8 11 e23375

10.2196/23375

33245291

v8i11e23375

PMC7732706

Wang

Liu

Sijia

Afzal

Naveed

Rastegar-Mojarad

Majid

Wang

Liwei

Shen

Feichen

Kingsbury

Paul

Liu

Hongfang

A comparison of word embeddings for the biomedical natural language processing

J Biomed Inform 2018 11 87 12 20

10.1016/j.jbi.2018.09.008

30217670

S1532-0464(18)30182-5

PMC6585427

Tao

Wang

Long short-term memory neural network for traffic speed prediction using remote microwave sensor data

Transportation Research Part C: Emerging Technologies 2015 05 54 187 197

10.1016/j.trc.2015.03.014

Shin

Roth

Gao

Nogues

Yao

Mollura

Summers

Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning

IEEE Trans Med Imaging 2016 05 35 5 1285 98

10.1109/TMI.2016.2528162

26886976

PMC4890616

Wang

Zhang

Zhou

A gated dilated convolution with attention model for clinical cloze-style reading comprehension

Int J Environ Res Public Health 2020 02 19 17 4

10.3390/ijerph17041323

32092861

ijerph17041323

PMC7068278

Zhu

Peng

Chen

Gao

A convolutional neural network based on a capsule network with strong generalization for bearing fault diagnosis

Neurocomputing 2019 01 323 62 75

10.1016/j.neucom.2018.09.050

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

Attention is all you need

2017

31st Conference on Neural Information Processing Systems (NIPS 2017)

December 4-9, 2017

Long Beach, CA, USA

5998 6008

Bertinetto

Valmadre

Henriques

Vedaldi

Torr

PHS

Fully-convolutional siamese networks for object tracking

2016 11

European Conference on Computer Vision

October 8-16

Amsterdam, Netherlands

850 865

10.1007/978-3-319-48881-3_56

Wang

Gao

Where-and-when to look: deep siamese attention networks for video-based person re-identification

IEEE Transactions on Multimedia 2019 6 21 6 1412 1424

10.1109/tmm.2018.2877886

Eisinga

Grotenhuis

Pelzer

The reliability of a two-item scale: Pearson, Cronbach, or Spearman-Brown?

Int J Public Health 2013 08 58 4 637 42

10.1007/s00038-012-0416-3

23089674

Jung

Hwejin

Kim

Bumsoo

Lee

Inyeop

Lee

Junhyun

Kang

Jaewoo

Classification of lung nodules in CT scans using three-dimensional deep convolutional neural networks with a checkpoint ensemble method

BMC Med Imaging 2018 12 03 18 1 48

10.1186/s12880-018-0286-0

30509191

10.1186/s12880-018-0286-0

PMC6276244

Lan

Chen

Goodman

Gimpel

Sharma

Soricut

Albert: A lite bert for self-supervised learning of language representations

2019

International Conference on Learning Representations

April 26-30

Addis Ababa

Huang

Song

Gupta

JND

Semi-supervised and unsupervised extreme learning machines

IEEE Trans Cybern 2014 12 44 12 2405 2417

10.1109/TCYB.2014.2307349

25415946

Enguehard

O'Halloran

Gholipour

Semi-supervised learning with deep embedded clustering for image classification and segmentation

IEEE Access 2019 7 11093 11104

10.1109/access.2019.2891970

Wei

Zou

Eda: Easy data augmentation techniques for boosting performance on text classification tasks

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019 11

EMNLP-IJCNLP 2019

November 3-7

Hong Kong, China

6382 6388

10.18653/v1/d19-1670

A W

Dohan

Luong

M T

Qanet: Combining local convolution with global self-attention for reading comprehension

2018

International Conference on Learning Representations

April 30-May 3

Vancouver, Canada

Xie

Wang

S I

Data noising as smoothing in neural network language models

2017

International Conference on Learning Representations

April 24-26

Toulon, France

Hussain

Cambria

Semi-supervised learning for big social data analysis

Neurocomputing 2018 01 275 1662 1673

10.1016/j.neucom.2017.10.010

Yim

Joo

Bae

Kim

A gift from knowledge distillation: fast optimization, network minimization and transfer learning

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2017 11

IEEE Conference on Computer Vision and Pattern Recognition

July 21-26

Honolulu, HI, USA

4133 4141

10.1109/cvpr.2017.754

Pan

A novel enhanced collaborative autoencoder with knowledge distillation for top-N recommender systems

Neurocomputing 2019 03 332 137 148

10.1016/j.neucom.2018.12.025

Williams A, Nangia N, Bowman S R

Nangia

Bowman

S R

A broad-coverage challenge corpus for sentence understanding through Inference

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2018

2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

June 1-6

New Orleans, Louisiana

1112 1122

Zampieri

Nakov

Rosenthal

SemEval-2020 task 12: Multilingual offensive language identification in social media (OffensEval 2020)

Proceedings of the Fourteenth Workshop on Semantic Evaluation 2020

The 28th International Conference on Computational Lingustics (COLING-2020)

September 13-14

Barcelona (online)

Association for Computational Linguistics

1425 1447

H Q

Dynamic causality knowledge graph generation for supporting the Chatbot health care system

Proceedings of the Future Technologies Conference (FTC) 2020 2020

Future Technologies Conference (FTC) 2020

October

Vancouver, Canada

30 45

10.1007/978-3-030-63092-8_3

Chen

Towards medical machine reading comprehension with structural knowledge and plain text

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020

2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

November

Online

Association for Computational Linguistics

1427 1438

10.18653/v1/2020.emnlp-main.111