Identification of High-Order Single-Nucleotide Polymorphism Barcodes in Breast Cancer Using a Hybrid Taguchi-Genetic Algorithm: Case-Control Study

Background Breast cancer has a major disease burden in the female population, and it is a highly genome-associated human disease. However, in genetic studies of complex diseases, modern geneticists face challenges in detecting interactions among loci. Objective This study aimed to investigate whether variations of single-nucleotide polymorphisms (SNPs) are associated with histopathological tumor characteristics in breast cancer patients. Methods A hybrid Taguchi-genetic algorithm (HTGA) was proposed to identify the high-order SNP barcodes in a breast cancer case-control study. A Taguchi method was used to enhance a genetic algorithm (GA) for identifying high-order SNP barcodes. The Taguchi method was integrated into the GA after the crossover operations in order to optimize the generated offspring systematically for enhancing the GA search ability. Results The proposed HTGA effectively converged to a promising region within the problem space and provided excellent SNP barcode identification. Regression analysis was used to validate the association between breast cancer and the identified high-order SNP barcodes. The maximum OR was less than 1 (range 0.870-0.755) for two- to seven-order SNP barcodes. Conclusions We systematically evaluated the interaction effects of 26 SNPs within growth factor–related genes for breast carcinogenesis pathways. The HTGA could successfully identify relevant high-order SNP barcodes by evaluating the differences between cases and controls. The validation results showed that the HTGA can provide better fitness values as compared with other methods for the identification of high-order SNP barcodes using breast cancer case-control data sets.


Introduction
Breast cancer has a major disease burden in the female population, with a growing incidence recently [1,2]. Previously, several interpretations of associations between breast cancer and tumor characteristics [3][4][5], single-nucleotide polymorphisms (SNPs) [6][7][8], clinicopathological factors [9], and biomarkers [10] revealed relevant association effects between these factors and the risk of cancer. Previous studies also indicated that genomic variation could contribute to the tumorigenicity process in breast cancer [11][12][13][14]. Thus, effective approaches for breast cancer estimation are required.
SNPs are crucial genetic variants in genomic association analyses involving leukemia [15], cancers [16], and other diseases [17][18][19]. Numerous SNPs cannot be excluded from analyses as no relevant differences between cases and controls can be found through conventional methods. Some SNPs may have relevant associations with other SNPs, and these associations are referred to as SNP barcodes. Consequently, the detection of SNP barcodes is vital for association analyses of diseases and cancers [20][21][22][23].
An SNP barcode consists of SNPs, and each SNP includes three genotypes. The large space of suitable SNP barcode combinations complicates the statistical evaluation and identification of relevant SNP barcodes. Evolutionary algorithms have been proposed to facilitate statistical identification of SNP barcodes, and a genetic algorithm (GA) is one of the most frequently used algorithms in genomic studies [24,25]. A GA is an effective approach in the identification of relevant genetic associations for various diseases through the use of more efficient search abilities to enhance population diversity [26]. The crossover and local search operations in a GA can reduce the probability of the same vector being identified between two selected SNPs, and hence, they can improve the search ability of this algorithm.
Breast cancer is a major health issue, and machine learning algorithms are frequently employed to detect the complex genomic associations in breast cancer studies. Although previous machine learning approaches could effectively identify SNP associations in genomic studies, the detection rate of SNP barcodes remains challenging for high-order SNP barcodes. Thus, we proposed a hybrid Taguchi-genetic algorithm (HTGA) for high-order SNP barcode identification in a breast cancer case-control study.

Genetic Algorithm
A GA is a machine learning algorithm inspired by biological evolutionary processes [27]. The first GA operation is population initialization, in which solutions are produced over the solution space; these initial solutions are designated as parents. In the population, two parents are strategically selected according to some fitness values for crossover operators. Crossover operators generate offspring by combining the chromosomal matter from the two parents. Mutation operations can increase population diversity through localized change, eliminating inferior chromosomes from the population and retaining good offspring. Thus, the good factors within the population can be passed on to the next generation. The aforementioned operations and population replacement are repeated until the stopping criterion is satisfied.

Taguchi Method
The methods proposed by Taguchi et al [28] are based on a statistical experimental design to improve the evaluation and performance of products, process conditions, and parameter settings. Taguchi methods primarily rely on orthogonal arrays (OAs) and the signal-to-noise ratio (SNR). An OA is a fractional factorial matrix that provides a comprehensive analysis of interactions among all design factors. This matrix ensures a proportionate comparison of levels for all factors. A two-level OA can be defined as L n (2 n−1 ), where n=2 k is the number of experimental runs, k (1) is a positive integer, base 2 represents two levels for each design parameter, and n−1 is the number of columns in the OA. "L" represents "Latin," because the OA experimental design concept is associated with the Latin square. An example of an OA is shown in Table 1. SNR (η) is used as the selection quality characteristic in the field of communications engineering; it can be used to optimize the parameters for a target. Taguchi methods can classify the parameter design problem into several categories according the problem. Both smaller-the-better and larger-the-better SNR types are used. Considering the set of characteristics y 1 , y 2 , …, y n , in the smaller-the-better case, the SNR can be determined using the following equation: In the larger-the-better case, the SNR can be determined using the following equation: The SNR evaluates the robustness of the levels of each design parameter. A high-quality result can be achieved for a particular target by controlling the parameters at a particular level with a high SNR value.  G  F  E  D  C  B  A   1  1  1  1  1  1  1  1   2  2  2  2  1  1  1  2   2  2  1  1  2  2  1  3   1  1  2  2  2  2  1  4   2  1  2  1  2  1  2  5   1  2  1  2  2  1  2

Hybrid Taguchi-Genetic Algorithm
In the HTGA, a Taguchi method is added into GA crossover and mutation operations. Figure 1 depicts a flowchart of the HTGA approach, which includes the below-mentioned 17 steps. The pseudocode of the HTGA is shown in Textbox 1.

HTGA Procedure
The procedure involves the following 17 steps: (1) Population initialization, execute the algorithm and generate an initial population; (2) Fitness value evaluations, evaluate the population's fitness values; (3) Selection operation, select candidates using the tournament approach; (4) Crossover operation, the probability of crossover is determined by the crossover rate p c ; (5) Select a suitable two-level orthogonal array for the experiment;

Encoding Schemes and Population Initialization
In the proposed GA, a suitable solution to a problem is denoted as chromosome C = {c 1 , c 2 , …, c n }, and the encoding scheme aims to design suitable elements in a chromosome. In the SNP barcode problem, the elements in a chromosome include (1) the indexes of the selected SNPs in the data set and (2) the genotypes of these selected SNPs. Thus, a chromosome C i is expressed as shown in equation 3.
where i = 1, 2, …, m, and is the population size. SNP i,s , where s = 1, 2, …, n/2, is a selected SNP dimension in which all SNPs are unrepeatable, and n is the SNP barcode order. Genotype i,g represents the three possible genotypes of the selected SNP i,s , where g = n/2 + 1, n/2 + 2, …, n is the selected genotype dimension. In the population initialization, all chromosomes are stochastically generated according to the encoding schemes.

Fitness Function Evaluation
The aim of SNP barcode identification is to detect relevant differences between cases and controls. To optimize the protective effect of the SNP combination, a fitness function is required for comparing cases and controls. A high difference between cases and controls indicates a high probability of detecting relevant SNP barcodes. In the proposed GA, a chromosome is measured by the fitness function shown in equation 4.
where number is the total number of elements in a set, control denotes the controls, case denotes the cases, and C i is the ith chromosome. Thus, the number of intersections between the ith chromosome and the controls is calculated by number (control∩C i ), and the number of intersections between the ith chromosome and the cases is calculated by number (case∩C i ). Thereafter, we calculate the difference between number (control∩C i ) and number (case∩C i ) as the fitness value at C i .

Selection Operation
In the selection operation, a random tournament selection scheme is used to pick each pair of parents from the population [29]. In tournament selection, two chromosomes are randomly selected to compare their individual fitness values. The chromosomes with better fitness values are inserted into the mating pool. According to the mechanism of tournament selection, the probability that the average fitness value of solutions in the mating pool is better than the average fitness value of the parent population is high. Chromosomes in the mating pool are selected for the crossover operation and used to produce offspring. Textbox 2 provides the pseudocode of tournament selection. The selection operation is repeatedly executed until the maximum mating pool size is achieved.

Crossover Operation
After the selection operation, the crossover operation is implemented to create high-performing individuals. Two chromosomes are sequentially selected from the mating pool as a pair of parents, and then, the crossover operation is executed on them. The crossover operation uses a uniform crossover. Each bit in a chromosome is randomly generated as 0 or 1, and for 1, points are swapped between parent organisms; otherwise, points are not swapped. The encoding schemes establish a single point as an SNP locus with a corresponding genotype locus at the j 2 + 1 position, where j = 1, 2, …, n/2 is the index in the chromosome and n is the SNP barcode order. Therefore, n/2 bits are randomly generated, and both the j 2 + 1 genotype locus and jth bit representing an SNP are swapped in the parent organisms.

Taguchi Operation
An orthogonal array exhibits Q design factors. Each factor has two levels. An orthogonal array L n (2 n−1 ) exhibits n−1 columns and n individual experiments corresponding to n rows, where n = 2 k and Q ≤ n−1; k is a positive integer, defined as an integer >1, and it is used for adjusting the number of experimental runs.
The SNR (η) is the mean square deviation of the fitness function.
Let two values of η be η i = (y i ) 2 and η i = −(y i ) 2 (where is negative) in the case of a fitness function that is maximized (larger-the-better). Let y i be the function evaluation value of experiment i = 1, 2, 3, …, n, where n is the number of experiments. The effect of factor f is defined as follows: where i is the experiment number, f is the factor name, and l is the level number.

Mutation Operation
The mutation operation aims to prevent the population from falling into local optima. In all suitable solutions, each offspring element has a chance to undergo a mutation operation. Each mutation position with a probability of mutation p m generates a random number in (0, 1). If the number is less than p m at the ith element in an offspring specimen, the ith element will be mutated by a randomly generated possible value.

Replacement Operation
The replacement operation uses an individual to replace the weakest individual in the population. After the completion of the aforementioned operations, the offspring are added to the population, and then, all the parents and offspring are ranked based on their fitness values. Subsequently, top p chromosomes in the population size are selected as the new population for the next generation, where p is the population size.

Termination Condition
The HTGA operation is repeated in successive iterations until the stopping criterion is met. In this study, a maximum number of iterations was used to terminate HTGA operations.

Parameter Setting
This study compared the search effectiveness of the HTGA with that of standard GA, particle swarm optimization (PSO) [30], and chaotic PSO (CPSO) [31] methods. PSO is a swarm intelligence algorithm that simulates the social behavior of organisms. In PSO, each individual represents a particle and considers a potential solution in the swarm population. In CPSO, chaotic theory is incorporated into PSO to increase the search space and enhance PSO performance. PSO and CPSO parameters include population size, iteration size, minimum and maximum inertial weights, and learning factors. In each method, the number of iterations was set to 1000, and the population size was 50 for the test data set. In PSO and CPSO, the minimum and maximum inertial weights were 0.4 and 0.9, respectively. Both weights of learning factors c 1 and c 2 were set to 2. In the tested GA and the proposed HTGA, the probability of crossover (p c ) with an exchange probability was 0.3 and the probability of mutation (p m ) with an exchange probability was 0.05.

Statistical Analysis
The OR was used to evaluate the risk of an SNP barcode [32], and it was defined as follows: where TP represents the number of true positives, TN represents the number of true negatives, FN represents the number of false negatives, and FP represents the number of false positives.

Data Sets
A set of 26 SNPs related to growth factor genes was selected to simulate a data set. Several growth factor-related breast cancer genes (EGF, IGF1, IGF1R, IGF2, IGFBP3, IL10,  TGFB1, and VEGF), including 26 SNPs, were used as simulation data to evaluate existing algorithms and the proposed HTGA. The data set only provided the genotype frequencies of each SNP without the original raw data of genotypes. Table 2 presents the SNPs and genotype distributions. The simulated frequencies of SNPs were acquired from the literature [33]. SNPs used in the original data comprised different numbers of individuals; therefore, the number of every SNP must be normalized to the same number. The new data were randomly generated according to the frequency of the original data. All SNP data from the data source were adjusted to 5000 samples for all genotype distributions.  ). Therefore, the modified data for SNP1 were adjusted to a total of 5000 (4418 + 569 + 13 = 5000). Thus, 5000 simulation samples of SNP genotypes were randomly generated by following fixed distribution.

Comparison Between the Proposed HTGA and Existing Algorithms
We compared PSO [34], CPSO [35], and the GA [24] with the HTGA for 2-SNP to 7-SNP barcodes with protection associations (Table 3). ORs (<1) indicate the impact of the protection association of SNP barcodes for the occurrence of breast cancer. A high difference between cases and controls in the SNP barcodes represents informative protection associations, and P<.05 indicates a significant difference for the SNP barcode between cases and controls. The identified 3-SNP to 7-SNP barcodes showed that the HTGA provided values with a greater degree of difference as compared with PSO, CPSO, and the GA, indicating that the HTGA identified relevant SNP barcodes with protection associations more effectively (Table 3). HTGA-identified SNP barcodes showed ORs ranging from 0.755 to 0.870 (P=.003) for protection associations with breast cancer. The 2-SNP and 3-SNP barcodes in PSO, CPSO, and the GA showed significant differences between cases and controls (2-SNP: P=.003, P=.001, and P=.03, respectively; 3-SNP: P=.04, P=.04, and P=.002, respectively). The 4-SNP barcodes in CPSO and the GA showed significant differences (P=.04 and P=004, respectively), and the 5-SNP barcode in the GA also showed a significant difference (P=.03). Although CPSO and the GA provided better ORs as compared with the HTGA in all SNP barcodes, the degrees of difference indicated that the SNP barcodes identified by the HTGA were superior to those identified by other methods, and P values >.05 indicated that these differences revealed by the models were not significant.
Optimization algorithms have been widely applied to detect relevant high-order SNP barcodes in disease and cancer studies [24,25,34]. Differences between cases and controls are often applied to evaluate the values of SNP barcodes in terms of their fitness function design. As indicated in Table 3, the HTGA effectively identified the relevant protection associations of SNP barcodes for breast cancer. The logistic regression analysis results were strongly validated by the outstanding performance of the HTGA in breast cancer SNP barcode identification. The SNP barcodes detected by the proposed HTGA are simply associations between a barcode and disease, and this type of analysis does not support the inference of causality. Table 3. Estimation of the best protection single-nucleotide polymorphism barcodes for the occurrence of breast cancer as determined by particle swarm optimization, chaotic particle swarm optimization, the genetic algorithm, and the hybrid Taguchi-genetic algorithm.

Principal Findings
Many breast cancer studies have identified the associations among the effects of important related genes [36][37][38][39][40][41][42], including genes related to DNA repair [43,44] and estrogen-response genes [45]. In this study, we introduced a HTGA to identify the SNP barcodes among 26 breast cancer-related SNPs. The HTGA-generated SNP barcodes were examined to determine their protective effects against breast cancer. The results suggest that nonrelevant SNPs might cumulatively reduce the risk of breast cancer, as indicated by the HTGA-generated preventive SNP barcodes. A search space consisting of SNP barcode combinations can generate numerous local optima in multiple regions. These local optima raise challenges for optimization algorithm search operations, because the heuristic and stochastic properties of such optimization algorithms can easily cause searches to become trapped in local optima. A GA population can be updated by referring to other chromosomes to determine the next position in the search space. However, GA operations can result in stagnation if the chromosomes are similar; points of stagnation in a search space are referred to as local optima. The computational processes and comparisons are shown in Figure 2. A Taguchi system is a nonlinear system with deterministic dynamic behavior owing to its ergodic and stochastic properties. Taguchi methods are used to enhance GA crossover operations, and these methods can be remarkably helpful for avoiding population entrapment in local optima because improved solutions can be found through experimentation. Because the population learns from experience, it can be said to exhibit population intelligence. The HTGA can converge quickly to excellent fitness values for SNP barcodes, whereas the GA is very slow to converge and the results are worse than those of the HTGA (Figure 2), indicating that the GA can very easily result in stagnation in regions that may not include any global optima. However, the population is effectively improved in the HTGA, and Figure 2 shows that the fitness values of chromosomes clearly increase over time, proving that the proposed Taguchi method can be used to improve GA performance to identify SNP barcodes. Moreover, our results prove the ability of this Taguchi-based GA to solve SNP barcode identification problems. The optimal parameters of the HTGA could be further analyzed for enhancing the detection ability of SNP barcodes. Our HTGA includes the probability of crossover and mutation. A further investigation with more data sets is required to determine the optimal parameters. Moreover, selection, crossover, mutation, and replacement operations can be analyzed to determine the superior operation strategy for enhancing the ability of our HTGA to detect potential SNP barcodes. If the HTGA is applied for clinical data, we suggest considering permutation testing to examine the relevance of the results obtained. For each trial in permutation testing, the case/control labels would be scrambled, and the algorithm would then search for an optimal solution.
After numerous trials, we would be able to determine the number of times a solution at least as good as the one from the original data is found and then determine if the algorithm is simply fitting the data or identifying underlying associations.

Conclusions
An HTGA was proposed to effectively identify relevant SNP barcodes among genes related to breast cancer. The study results demonstrated that the HTGA could effectively detect SNP barcodes for problems with numerous high-order SNP barcode combinations. The proposed Taguchi method can improve the GA via the identification of high-dimensional SNP barcodes, and hence, it is integrated following GA crossover operations to systematically optimize chromosomes and thus enhance their JMIR Med Inform 2020 | vol. 8 | iss. 6 | e16886 | p. 14 https://medinform.jmir.org/2020/6/e16886 (page number not for citation purposes) values. Moreover, the HTGA can effectively converge to a promising region within the problem space and provide excellent SNP barcode identification. In this study, large data sets were used to evaluate and compare the performances of the GA, PSO, CPSO, and the HTGA, and the results indicated that the HTGA can effectively identify relevant high-order SNP barcodes in breast cancer.