Adverse Drug Event Discovery Using Biomedical Literature: A Big Data Neural Network Adventure

Background The study of adverse drug events (ADEs) is a tenured topic in medical literature. In recent years, increasing numbers of scientific articles and health-related social media posts have been generated and shared daily, albeit with very limited use for ADE study and with little known about the content with respect to ADEs. Objective The aim of this study was to develop a big data analytics strategy that mines the content of scientific articles and health-related Web-based social media to detect and identify ADEs. Methods We analyzed the following two data sources: (1) biomedical articles and (2) health-related social media blog posts. We developed an intelligent and scalable text mining solution on big data infrastructures composed of Apache Spark, natural language processing, and machine learning. This was combined with an Elasticsearch No-SQL distributed database to explore and visualize ADEs. Results The accuracy, precision, recall, and area under receiver operating characteristic of the system were 92.7%, 93.6%, 93.0%, and 0.905, respectively, and showed better results in comparison with traditional approaches in the literature. This work not only detected and classified ADE sentences from big data biomedical literature but also scientifically visualized ADE interactions. Conclusions To the best of our knowledge, this work is the first to investigate a big data machine learning strategy for ADE discovery on massive datasets downloaded from PubMed Central and social media. This contribution illustrates possible capacities in big data biomedical text analysis using advanced computational methods with real-time update from new data published on a daily basis.


Score (articles) = Impact Factor (journal) + Eigenfactor (journal) + SJR (journal)
(1) Each individual ranking index above comes in a different range. The journals' Impact Factors for the articles we downloaded from PubMed range from 137 to 0, Eigenfactors and SJR vary from 1.813 to 0, and from 9.92 to 0 respectively. We first normalize (Equation 2) to adjust the values measured on different scales to a common scale of 0-1 range.
Next, we find the average value of every scale individually, and calculate the average score as: avg (score) = avg (Impact Factor) + avg (Eigenfactor) + avg (SJR) Once we calculate the average scores, we select only the articles where an associated journal's score that is above average. Examples of journals with above average scores include Genome Biology [17], Journal of Clinical Oncology [18] and Nature Chemical Biology [19] to name just a few. Regarding social media, we use only those that are sound and wellfounded in health-related domains, such as MedHelp [20], Patient [21], and WebMD [22].

A.3 Word2vec neural network model
The word2vec algorithm is an important technique used in neural network language modeling and it is categorized in two different learning strategies: (1) Continuous bag-of-words (CBOW), and (2) Skip-gram. CBOW predicts a target word given a context, and skip-gram learning strategy predicts a target context given a word [1][2]. The two learning strategies of word2vec model are originally shallow neural networks; however, the representations acquired from these models can be used in various applications of deep learning.
Using the skip-gram learning algorithm, the target word is at the input layer and the context words will be on the output layer. In our proposed method, we developed an extended version of word2vec namely sentence2vec, employing the skip-gram model which is able to produce more accurate results on large-scale datasets [3]. Before delving into the detail of word2vec skip-gram and sentence2vec models, we shall explain the vector representation of word2vec for words across a corpus. Figure A.2 shows an example of word2vec vector representation for five words, including W1 to W5 amongst three sentences. Window Size is one of the Word2vec internal parameters that defines the context window, and in this example, we utilized a Window Size of 2, meaning that the vector of word "W1" is directly affected by the words "W2" and "W5", and "W2" can be directly affected by four more words "W1", "W3" "W4" and "W5". In a very similar way, the vector of "W5'' will be affected by the words "W1", "W2'' and "W4" as shown in Figure   CBOW, we are predicting P(W(t)|W(t-1),W(t+1)) and for Skip-gram P(W(t-1),W(t+1)|W(t)).
Using skip-gram as a feedforward neural network, words are read into the vector one at a time, and scanned back and forth within a certain range as N-grams. N-gram is a contiguous sequence of N terms from a given sentence [4][5], and it is likely the N-th model of uni-gram, bi-gram, tri-gram, or four-gram. The N-gram is then fed into a neural network to account the significance of a given word vector. Skip-gram is able to predict the surrounding words given the current word, and it has the training complexity architecture as follows: where the maximum distance for the words is C, D are word representations, and V is the dimensionality. Thus, for each training word, we will select randomly a number R in range <1;C> and employ R words from history, and R words from the future of the chosen word as we will have the definition of the hidden layer outputs h as equation (5) such that it copies the k-th row of W to h. v w I is the vector representation of the input word w I , and it shows that the activation function of the hidden layer units is linear [2,6]. Here, from the hidden layer to the output layer, there will be a different weight matrix W ' which is an NV matrix as W ' ij . Employing the entire weights, the score u j for every word in the context could be estimated as: where v wj is the j-th column of the matrix W ' . In almost all classification problems we take advantage of a condition that the classes are mutually exclusive. Considering an activation function (e.g., Hard limit, Saturating linear, Log-sigmoid) in the output layer, an adequate neural network architecture for such requirement is Max-layer output in which it could generate probability of 1 for the maximum output of the previous layer and probability of 0 for the rest of the output nodes. The problem is that, such output layer could not be differentiable and therefore it will be very challenging to train. Utilizing "Softmax" function as an activation function in the output layer, works nearly similar to Max-layer and it could be differentiable if trained by the gradient descent [7][8][9][10]. The Softmax function will increase the probability of maximum value of the previous layer compared to other value [1,2,6].
Hence, for the discussed neural network architecture, the "Softmax" function is used as an activation function in the output layer. In the output layer, every output is calculated using the same hidden  output matrix as equation (7).
where w c,j is the j-th word on the c-th panel of the output layer, w O,c is the exact c-th word in the output context words, w I is the input word, y c,j is the output of the j-th unit on the c-th The calculation of weights from hidden layer to output layer is also shown in equation (10). The calculation of the Softmax output for Y 11 , Y 12 , and Y 13 would be equal to equations (11), (12), and (13) As we discussed earlier, we are interested in minimizing the error. Therefore, and in this example ( Figure A.4), the loss function would be defined by a conditional probability as equation (14).

E (14)
Generalizing the equation by considering the total context windows equal to C and vocabulary size as J: And we can make it short as: By taking derivative with respect to e Net(O cj ) : where Y cj is the outscore of j-th word of c-th context window. If j-th word of c-th context window is the actual output word, then Z cj =1, otherwise Z cj =0.
A customized version of the word2vec skip-gram algorithm, namely "sentence2vec", has been developed and used for the current study. In this model, the class of the sentence (e.g., ADEs or No-ADEs) was concatenated to the words list, so the model can make accurate guesses about a sentence's meaning based on past appearances in the corpus.