Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Optimize TSK Fuzzy Systems for Classification Problems: Mini-Batch Gradient Descent with Uniform Regularization and Batch Normalization

Optimize TSK Fuzzy Systems for Classification Problems: Mini-Batch Gradient Descent with Uniform... Optimize TSK Fuzzy Systems for Classification Problems: Mini-Batch Gradient Descent with Uniform Regularization and Batch Normalization Yuqi Cui, Dongrui Wu and Jian Huang Abstract—Takagi-Sugeno-Kang (TSK) fuzzy systems are flex- Many efforts have been spent to tackling the difficulty ible and interpretable machine learning models; however, they in optimizing the TSK fuzzy systems on big and/or high- may not be easily optimized when the data size is large, and/or dimensional data [7]–[9]. Dimensionality reduction and/or the data dimensionality is high. This paper proposes a mini- feature selection are usually used to reduce the number of batch gradient descent (MBGD) based algorithm to efficiently fuzzy partitions (rules). Traditional dimensionality reduction and effectively train TSK fuzzy classifiers. It integrates two novel techniques: 1) uniform regularization (UR), which forces the techniques such as principal component analysis (PCA) has rules to have similar average contributions to the output, and been used for TSK fuzzy system optimization [10], [11]. There hence to increase the generalization performance of the TSK are also methods focusing on learning a sparse subspace of classifier; and, 2) batch normalization (BN), which extends BN the original feature space to reduce the number of antecedents from deep neural networks to TSK fuzzy classifiers to expedite in each rule [12], [13]. Once the number of antecedents is the convergence and improve the generalization performance. Ex- periments on 12 UCI datasets from various application domains, determined, different optimization approaches can be used to with varying size and dimensionality, demonstrated that UR and tune the TSK fuzzy system on large datasets. For example, BN are effective individually, and integrating them can further Chung et al. [9] utilized the equivalence between minimum en- improve the classification performance. closing ball and the Mamdani-Larsen fuzzy inference system Index Terms—Batch normalization, mini-batch gradient de- to train the latter using the former. Gacto et al. [14] proposed a scent, TSK fuzzy classifier, uniform regularization multi-objective evolutionary algorithm to optimize TSK fuzzy systems for high-dimensional large-scale regression problems. Mini-batch gradient descent (MBGD) [15], [16] based op- I. INTRODUCTION timization, which is particularly popular in deep learning, Takagi-Sugeno-Kang (TSK) fuzzy systems [1] have can also be a solution to training TSK fuzzy systems on achieved great success in numerous applications, including large and high-dimensional datasets. In each iteration, MBGD both classification and regression problems. Many optimiza- computes the gradients from a randomly selected small batch tion approaches have been proposed for them. of data, instead of the entire dataset [17]. Different batch sizes There are generally three strategies for fine-tuning the TSK can be used, according to the trade-off among the available fuzzy system parameters after initialization: 1) evolutionary memory, the training speed, and the expected generalization algorithms [2], [3]; 2) gradient descent (GD) based algorithms performance. The original MBGD used a constant learning [4]; and, 3) GD plus least squares estimation (LSE), repre- rate to update the model’s parameters [17]. Later, Sutskever sented by the popular adaptive-network-based fuzzy inference et al. [18] found that adding a momentum to MBGD can system (ANFIS) [5]. However, these approaches may have improve the final training performance. However, it still needs challenges when the size and/or the dimensionality of the to manually select a learning rate, and the convergence may data increase. Evolutionary algorithms need to keep a large be very slow at the beginning. Kingma and Ba [19] proposed population of candidate solutions, and evaluate the fitness the well-known Adam algorithm to automatically rescale the of each, which result in high computational cost and heavy gradients to achieve adaptive and individualized learning rate memory requirement for big data. Traditional GD needs to for each parameter, which leads to faster convergence. How- compute the gradients from the entire dataset to iteratively ever, the generalization performance of Adam may not be as update the model parameters, which may be very slow, or good as the momentum [20]; so, Keskar and Socher [21] also even impossible, when the data size is very large. The memory tried to combine the advantages of momentum and Adam requirement and computational cost of LSE also increase to achieve both fast convergence and good generalization. rapidly when the data size and/or dimensionality increase. Recently, Luo et al. [22] also proposed AdaBound to improve Additionally, as shown in [6], ANFIS may result in significant Adam. AdaBound uses an adaptive bound for the learning overfitting in regression problems. rate of each parameter to force the optimizer to behave like Adam at the beginning and like stochastic GD at the end. Our Y. Cui, D. Wu and J. Huang are with the Key Laboratory of the Ministry of Education for Image Processing and Intelligent Control, School of Artificial very recent research [6] has found that TSK fuzzy systems Intelligence and Automation, Huazhong University of Science and Technol- can achieve better performance with AdaBound than Adam ogy, Wuhan 430074, China. Email: yqcui@hust.edu.cn, drwu@hust.edu.cn, for regression problems. huang jan@hust.edu.cn. D. Wu and J. Huang are the corresponding authors. Although MBGD-based optimization has many advantages, arXiv:1908.00636v3 [cs.LG] 9 Jan 2020 2 it may be easily trapped into a local-minimum, and may face Suppose the TSK fuzzy classifier has R rules, in the the gradient vanishing problem. Many other techniques have following form: been proposed to complement MBGD for better performance. Rule : IF x is X and · · · and x is X , r 1 r,1 D r,D In 2015, Ioffe and Szegedy [23] proposed the well-known batch normalization (BN) approach to accelerate the training 1 1 1 THEN y (x) = b + b · x and · · · r r,0 r,d of deep neural networks by reducing the internal covariate (1) d=1 shift . BN normalizes the input distribution of each layer, so it also alleviates the gradient vanishing problem. It has been C C C and y (x) = b + b · x r r,0 r,d used almost ubiquitously in deep learning, and many variants d=1 [25]–[28] have also been proposed. where X (r = 1, ..., R; d = 1, ..., D) is the membership r,d This paper, following our previous research [6] on MBGD- function (MF) for the d-th antecedent in the r-th rule, and based optimization of TSK fuzzy systems for regression prob- c c b and b (c = 1, ..., C) are the consequent parameters for r,0 r,d lems, considers classification problems. We use AdaBound, as the c-th class. in [6], to adjust the learning rates. Additionally, we propose Different types of MFs can be used in our algorithm, as two novel techniques for training TSK fuzzy systems for long as they are differentiable. For simplicity, Gaussian MFs classification problems, namely, uniform regularization (UR) are considered in this paper, and the membership grade of x and BN. Our main contributions are: on X is: r,d 1) We introduce a novel UR term to the cross-entropy loss (x − m ) d r,d function in training TSK fuzzy classifiers, which forces µ (x ) = exp − , (2) X d r,d 2σ all rules to have similar average firing levels on the r,d entire dataset. Experiments show that UR can improve where m and σ are the center and the standard deviation r,d r,d the generalization performance of TSK fuzzy classifiers. of the Gaussian MF, respectively. 2) We extend BN from the training of deep neural networks The output of the TSK fuzzy classifier for the c-th class is: to the training of TSK fuzzy classifiers, and show that f (x)y (x) it can speed up the convergence in training and improve c r=1 r y (x) = , (3) the generalization performance in testing. f (x) r=1 3) We further integrate UR and BN, and show that the where combined approach outperforms each individual ones. D D Y X (x − m ) d r,d The remainder of this paper is organized as follows: Sec- f (x) = µ (x ) = exp − (4) r X d r,d 2σ r,d tion II introduces the proposed UR and BN approaches. d=1 d=1 Section III presents the experimental results to validate the is the firing level of Rule r. We can also re-write (3) as: performances of UR and BN. Section IV draws conclusions and points out some future research directions. c c y (x) = f (x)y (x), (5) r r r=1 where II. UR AND BN f (x) f (x) = (6) f (x) This section introduces the details of the TSK fuzzy i=1 classifier under consideration, our proposed UR for reg- is the normalized firing level of Rule r. 1 C T ularizing the loss function, and BN for more efficient Once the output vector y(x) = [y (x), ..., y (x)] is and effective training of the TSK fuzzy classifier. Python obtained, the input x is assigned to the class with the largest implementation of our algorithm can be downloaded at y (x). https://github.com/YuqiCui/TSK BN UR. To optimize the TSK fuzzy classifier, we need to fine- tune the antecedent MF parameters m and σ , and the r,d r,d c c consequent parameters b and b , where r = 1, ..., R, r,0 r,d d = 1, ..., D, and c = 1, ..., C. A. The TSK Fuzzy Classifier Let the training dataset be D = {x , y } , in which n n n=1 B. Uniform Regularization (UR) T D×1 x = [x , ..., x ] ∈ R is a D-dimensional feature n n,1 n,D Mixture of experts (MoE) [29], which is functionally equiv- vector, and y ∈ {1, 2, ..., C} the corresponding class label alent to TSK fuzzy systems [30]–[32], is a popular machine for a C-class classification problem. learning algorithm. Its model is shown in Fig. 1. It trains multiple local experts, each taking care of only a small local Recently some researchers had different opinions on why BN works. For region of the input space. For a new input, the gating network example, Santurkar et al. [24] argued that BN may not reduce the internal determines the activations (weights) of the local experts, and covariate shift; instead, it helps improve the Lipschitzness of both the loss the final output is a weighted average of the local expert and the gradients, and also reduces the dependency on the training hyper- parameters, such as the learning rate and the regularization weights. outputs. 3 where m and σ are the mean and the standard deviation B B of the samples in the mini-batch, respectively, γ and β are parameters to be learned during training, and ǫ is usually set to 1e − 8 to avoid being divided by zero. During training, exponential weighted averages of m and σ are recorded so B B that they can be used in the test phase. Since TSK fuzzy systems and neural networks share lots of similarity [32], we can extend BN to the optimization of TSK fuzzy classifiers, as shown in Fig. 2. In the training phase, we first compute the firing level of each rule using the unmodified inputs, as in traditional TSK fuzzy systems. Then, we use BN Fig. 1. Mixture of experts (MoE) [29]. to normalize the inputs, according to their mean and standard deviation in the current mini-batch. The normalized inputs are then used to compute the rule consequents. The final output is Although MoE has been used successfully in many applica- a weighted average of the rule consequents, the weights being tions, it may suffer from the “rich get richer” effect [33], [34]: the corresponding rule firing levels. once an expert is slightly better than others, it is always picked by the gating network, whereas other experts starve and are rarely used. This is bad for the generalization performance of the overall model. Since MoE and TSK fuzzy systems are functionally equiv- alent [32], TSK fuzzy systems may also suffer from the “rich get richer” effect, i.e., only a few rules are always activated with large firing levels, whereas others have very small firing levels, and hence not adequately tuned in training. A remedy to the “rich get richer” effect in TSK fuzzy systems is to force the rules to be fired at similar degrees in the input space, so that each rule contributes about equally to the output. Next, we propose UR to achieve this goal. UR forces the rules to have similar average firing levels, by minimizing the following loss: R N X X Fig. 2. BN in training a TSK fuzzy classifier. All rule consequents share the ℓ = f (x ) − τ , (7) same BN layer. UR r r=1 n=1 At the testing phase, the BN operation can be merged where N is the number of training examples, and τ the into the consequent layer. Assume that after training, we expected firing level of each rule, which is set to 1/C in this obtain a BN layer with learned m = (m , ..., m ) , σ = 1 D paper (recall that C is the number of classes). (σ , ..., σ ) , γ and β. Then, the output y of the r-th rule 1 D r ℓ can then be added to the original loss function in UR with BN is: MBGD-based training of TSK fuzzy classifiers, i.e., for each mini-batch with N training samples, x − m n,d d y (BN(x )) = b + γ b + βD, (10) r n r,0 r,d 2 σ + ǫ R N d d=1 X X 1 1 L = ℓ + αℓ + λ f (x ) − , (8) 2 n r which can be re-written as: N R r=1 n=1 ′ ′ y (BN(x )) = b + b x , (11) where ℓ is the cross-entropy loss between the estimated class r n n,d r,0 r,d d=1 probabilities [obtained by applying softmax to y(x)] and the true class probabilities, ℓ the L2 regularization of the rule where consequent parameters, and α and λ the trade-off parameters. m b d r,d b = b + βD − γ p , (12) r,0 r,0 σ + ǫ d=1 C. Batch Normalization (BN) r,d b = γ . (13) BN [23] is a very powerful technique in optimizing deep r,d σ + ǫ neural networks [35]–[37]. It normalizes the data distribution By doing this, the original architecture of the TSK fuzzy in each mini-batch to accelerate the training. For a mini-batch classifier is kept unchanged. B = {x } , the output of BN is [23]: n=1 We also tested two variants of BN, as shown in Fig. 3. x − m n B The TSK with global BN (TSK-MBGD-UR-GBN) approach in x = BN(x ) = γ p + β, (9) σ + ǫ B Fig. 3(a) uses the BN normalized inputs in both antecedents 4 and consequents to compute the final output. In this case, the A. Datasets output of TSK-MBGD-UR-GBN for Class c is: We evaluated our proposed algorithms on 12 classification datasets from the UCI Machine Learning Repository . Their c c characteristics are summarized in Table I. For each dataset, y (x) = f (BN(x))y (BN(x)). (14) r r we randomly selected 70% samples as the training set and the r=1 remaining 30% as the test set for 30 times to get 30 different The TSK with rule-specific BN (TSK-MBGD-UR-RBN) ap- data splits. We ran each algorithm on these 30 data splits and proach in Fig. 3(b) uses the raw inputs to compute the report the average performance. antecedents, and rule-specific BN to compute each consequent individually. The output of TSK-MBGD-UR-RBN for Class c TABLE I is: SUMMARY OF THE 12 DATASETS. Index Dataset No. of Samples No. of Features No. of Classes c c y (x) = f (x)y (BN (x)), (15) r 1 Vehicle 846 18 4 r r r=1 2 Biodeg 1,055 41 2 3 DRD 1151 19 2 where BN represents the BN operation for the r-th rule. r 4 Yeast 1,484 8 10 5 Steel 1,941 27 7 TSK-MBGD-UR-GBN has the same computational cost as 6 IS 2,310 19 7 TSK-MBGD-UR-BN, but TSK-MBGD-UR-RBN has R times 7 Abalone 4,177 10 3 more BN parameters, and hence higher computational cost. 8 8 Waveform21 5,000 21 3 Both of them can be re-expressed in the original TSK archi- 9 Page-blocks 5,473 10 5 10 Satellite 6,435 36 6 tecture. We also evaluate their performances in Section III-G. 11 Clave 10,798 16 4 12 MAGIC 19,020 10 2 https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29 https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen +Data+Set https://archive.ics.uci.edu/ml/datasets/Yeast https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults https://archive.ics.uci.edu/ml/datasets/Image+Segmentation https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator +(Version+1) https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) https://archive.ics.uci.edu/ml/datasets/Firm-Teacher Clave- Direction Classification https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope (a) Some datasets contain both numerical features and cate- #"# gorical features. The categorical features were converted into $"# numerical ones by one-hot coding. We z-normalized each !"# # " feature using the mean and standard deviation computed from ! & N !" # # &"# the training set. #"$ # " " ! ! & $ $ $ N !" $"$ ! B. Algorithms !"$ &"$ We compared nine algorithms to validate our proposed ap- ! " $ !'&' #"% !" proaches. Among them, four were tree based approaches (DT, N % $"% RF, PART, and JRip), one was a TSK fuzzy system optimized ! ! !"' !"% !"% by a traditional approach (TSK-FCM-LSE), and the remaining $ four were TSK fuzzy systems optimized by MBGD based &"% approaches (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, (b) TSK-MBGD-UR-BN). Fig. 3. (a) TSK fuzzy system with global BN (TSK-MBGD-UR-GBN); and, The details of these nine algorithms are as follows: (b) TSK fuzzy system with rule-specific BN (TSK-MBGD-UR-RBN). 1) DT: Decision tree implemented in scikit-learn in Python. We used 5-fold cross-validation to select the maximum depth of the tree from {3, 4, 5, 6, 7} on the III. EXPERIMENTS AND RESULTS training set. Other parameters were set by default. This section validates the performances of our proposed UR http://archive.ics.uci.edu/ml/index.php and BN on multiple datasets from various application domains, https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTree with varying size and feature dimensionality. Classifier.html 5 2) RF: Random forest implemented in scikit-learn in d = 1, ..., D) randomly from a uniform distribution U(−1, 1). Python. We set the number of trees to 20 and used 5- fold cross-validation to select the maximum depth of C. Performance Measures the trees from {3, 4, 5, 6, 7} on the training set. Other The raw classification accuracy (RCA), which is the total parameters were set by default. number of correctly classified test samples divided by the total 3) PART [38]: The PART (partial decision tree) classifier number of test samples, was used as our primary performance implemented in RWeka . All parameters were set by measure. default. Since some datasets have significant class imbalance, in 4) JRip [39]: The RIPPER (Repeated Incremental Pruning addition to the RCA, we also computed the balanced clas- to Produce Error Reduction) classifier implemented in sification accuracy (BCA), which is the mean of the per-class RWeka. All parameters were set by default. RCAs, as our second performance measure. 5) TSK-FCM-LSE [40]: We used fuzzy c-means (FCM) clustering to estimate the antecedent parameters, and D. Experimental Results LSE with L2 regularization to estimate the consequent The average test RCAs and BCAs are shown in Tables II parameters. and III, respectively. The largest value (best performance) on 6) TSK-MBGD: We used MBGD and AdaBound [22] to each dataset is marked in bold. To facilitate the comparison, optimize both the antecedent and the consequent param- we also show the ranks of the RCAs and BCAs in Tables IV eters. and V, respectively. 7) TSK-MBGD-UR: We used MBGD, AdaBound and UR The following observations can be made from the above (Section II-B) to optimize both the antecedent and the four tables: consequent parameters. The UR weight λ in (8) was 1) Generally, UR improved both RCA and BCA. Comparing selected from {0.1, 1, 10, 20, 50} by cross-validation on TSK-MBGD with TSK-MBGD-UR, and TSK-MBGD-BN the training set. with TSK-MBGD-UR-BN, we can conclude that gen- 8) TSK-MBGD-BN: We used MBGD, AdaBound and BN erally UR improved the classification performance, (Section II-C) to optimize both the antecedent and the regardless of whether BN was used or not. The consequent parameters. average ranks in the last row of Tables IV and 9) TSK-MBGD-UR-BN: We used MBGD, AdaBound, BN V demonstrate this more clearly: the average rank and UR to optimize both the antecedent and the conse- of TSK-MBGD-UR (TSK-MBGD-UR-BN) was smaller quent parameters. The UR weight λ in (8) was selected than that of TSK-MBGD (TSK-MBGD-BN). from {0.1, 1, 10, 20, 50} by cross-validation on the train- 2) Generally, BN improved both RCA and BCA. Comparing ing set. TSK-MBGD with TSK-MBGD-BN, and TSK-MBGD-UR For TSK-FCM-LSE, TSK-MBGD, TSK-MBGD-BN, with TSK-MBGD-UR-BN, we can conclude that gen- TSK-MBGD-UR and TSK-MBGD-UR-BN, we set the L2 erally BN improved the classification performance, regularization weight α = 0.05, and the number of rules regardless of whether UR was used or not. The R = 20. For TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR average ranks in the last row of Tables IV and and TSK-MBGD-UR-BN, we set the learning rate of V demonstrate this more clearly: the average rank AdaBound to 0.01, following our previous work [6]. In order of TSK-MBGD-BN (TSK-MBGD-UR-BN) was smaller to make use of all data in the training set and to reduce than that of TSK-MBGD (TSK-MBGD-UR). overfitting simultaneously, we randomly sampled 20% data 3) Generally, integrating BN and UR achieved from the training set and trained the TSK model with early further RCA and BCA improvements. Comparing stopping five times. The maximum epoch number was 2,000, TSK-MBGD-UR-BN with TSK-MBGD, TSK-MBGD-UR and the patience of early stopping 40. We recorded the and TSK-MBGD-BN, we can conclude that number of epochs at stopping in each run, and trained the TSK-MBGD-UR-BN almost always performed the final model with the average stopping epoch number on the best on both RCA and BCA, as shown in Fig. 4. entire training set. This indicated that BN and UR are somehow k-mean clustering was used in the MBGD-based algo- complementary, and hence integrating them may rithms (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, and achieve better performance than using each one alone. TSK-MBGD-UR-BN) to initialize the antecedent parameters. 4) Overall, TSK-MBGD-UR-BN achieved the best perfor- We performed k-means clustering on the training set, where mance among the nine algorithms. The last row of Ta- k equaled R, the number of rules. We then initialized the ble V shows that TSK-MBGD-UR-BN achieved the best rule centers to the cluster centers, and randomly initialized the average BCA performance, and the last row of Table IV standard deviation σ from a Gaussian distribution N (1, 0.2). r,d shows that TSK-MBGD-UR-BN achieved the second For the consequent parameters, we set the initial bias of each best average RCA performance. Interestingly, RF had rule to zero, and the attribute weight b (r = 1, ..., R; r,d the best average rank on RCA, but only ranked the fifth on BCA, suggesting that RF may tend to overlook the https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.Random minority classes. On the contrary, TSK-MBGD-UR-BN ForestClassifier.html performed well on both RCA and BCA. https://cran.r-project.org/web/packages/RWeka/index.html 6 TABLE II AVERAGE RCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6907 0.7407 0.6892 0.7110 0.7411 0.6970 0.7354 0.7089 0.7907 Biodeg 0.8202 0.8572 0.8222 0.8362 0.8377 0.8523 0.8531 0.8539 0.8609 DRD 0.6283 0.6589 0.6240 0.6364 0.6824 0.6623 0.6618 0.6713 0.6720 Yeast 0.5564 0.5963 0.5731 0.5340 0.5851 0.5673 0.5770 0.5722 0.5725 Steel 0.7017 0.7328 0.7135 0.7120 0.6527 0.5864 0.7110 0.7248 0.7350 IS 0.9320 0.9529 0.9481 0.9608 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.7170 0.7314 0.7254 0.7104 0.7323 0.5821 0.7129 0.6238 0.7306 Waveform21 0.7641 0.8369 0.7908 0.7843 0.8647 0.6779 0.8002 0.8363 0.8234 Page-blocks 0.9651 0.9688 0.9681 0.9677 0.9499 0.9375 0.9419 0.9515 0.9580 Satellite 0.8524 0.8863 0.8587 0.8592 0.8864 0.4890 0.8001 0.8929 0.8943 Clave 0.7103 0.7600 0.7344 0.7779 0.7690 0.8223 0.8427 0.8187 0.8192 MAGIC 0.8427 0.8531 0.8455 0.8488 0.8319 0.7347 0.7861 0.8574 0.8392 Average 0.7651 0.7979 0.7744 0.7782 0.7909 0.6821 0.7648 0.7806 0.8038 TABLE III AVERAGE BCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6936 0.744 0.6939 0.7131 0.7443 0.7010 0.7380 0.7127 0.7930 Biodeg 0.7973 0.8306 0.7899 0.8122 0.8205 0.8368 0.8318 0.8390 0.8439 DRD 0.634 0.6624 0.6227 0.6422 0.6845 0.6642 0.6634 0.6717 0.6729 Yeast 0.3998 0.4867 0.5203 0.4889 0.5102 0.4951 0.5184 0.4946 0.5332 Steel 0.7005 0.6937 0.7129 0.7267 0.6319 0.5933 0.7258 0.7245 0.7515 IS 0.932 0.9529 0.9481 0.9607 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.5319 0.5362 0.5371 0.5280 0.5402 0.4567 0.5236 0.4791 0.5402 Waveform21 0.7637 0.8365 0.7905 0.7844 0.8645 0.6784 0.8003 0.8362 0.8233 Page-blocks 0.7986 0.7385 0.8192 0.8162 0.6003 0.5129 0.5609 0.6033 0.671 Satellite 0.8204 0.8480 0.8308 0.834 0.8558 0.4337 0.7651 0.8679 0.8700 Clave 0.4701 0.4878 0.4985 0.6507 0.4825 0.5876 0.6468 0.6374 0.6421 MAGIC 0.8058 0.8108 0.8052 0.8135 0.7886 0.6325 0.7128 0.8225 0.7934 Average 0.6956 0.7190 0.714 0.7309 0.7067 0.5974 0.6869 0.7120 0.7404 TABLE IV RCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 8 3 9 5 2 7 4 6 1 Biodeg 9 2 8 7 6 5 4 3 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 8 1 4 9 2 7 3 6 5 Steel 7 2 4 5 8 9 6 3 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 2 4 7 1 9 6 8 3 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 4 1 2 3 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 8 5 6 2 1 4 3 MAGIC 5 2 4 3 7 9 8 1 6 Average 7.0 2.9 5.8 5.3 3.8 7.3 5.5 4.3 3.0 TABLE V BCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 9 3 8 5 2 7 4 6 1 Biodeg 8 5 9 7 6 3 4 2 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 9 8 2 7 4 5 3 6 1 Steel 6 7 5 2 8 9 3 4 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 4 3 6 1 9 7 8 2 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 3 4 1 2 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 6 1 8 5 2 4 3 MAGIC 4 3 5 2 7 9 8 1 6 Average 6.8 4.7 5.4 4.3 4.2 7.3 5.4 4.3 2.6 7 TABLE VI p-VALUES OF NON-PARAMETRIC MULTIPLE COMPARISONS ON THE RCAS AND BCAS. Metric CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR RCA 0.0097 0.0628 0.1368 0.3239 0.2723 0.0000 - - TSK-MBGD-BN BCA 0.1547 0.1627 0.4508 0.0981 0.4518 0.0000 - - RCA 0.0001 0.3740 0.0090 0.0460 0.2452 0.0000 0.1146 - TSK-MBGD-UR BCA 0.0036 0.2900 0.0912 0.4420 0.0921 0.0000 0.0731 - RCA 0.0000 0.2113 0.0002 0.0025 0.0404 0.0000 0.0094 0.1409 TSK-MBGD-UR-BN BCA 0.0000 0.0291 0.0021 0.0730 0.0022 0.0000 0.0013 0.0986 1.2 MBGD-based TSK models were trained, on three represen- TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN tative datasets. For TSK-MBGD, a few “richest” rules had much larger average firing levels than others, and hence the 0.8 rules contributed significantly differently to the output. BN 0.6 may help alleviate this problem a little bit, as the average 0.4 normalized rule firing levels in TSK-MBGD-BN were more 10 4 6 7 5 3 8 1 12 11 uniform than those in TSK-MBGD, which also resulted in Sorted Dataset Index better classification performances, as demonstrated in the pre- (a) vious subsection. However, UR had the most direct effect on 1.2 TSK-MBGD TSK-MBGD-BN alleviating the “rich get richer” problem, as TSK-MBGD-UR TSK-MBGD-UR TSK-MBGD-UR-BN (TSK-MBGD-UR-BN) had much more uniform average nor- 0.8 malized rule firing levels than TSK-MBGD (TSK-MBGD-BN), 0.6 and hence also better classification performance. Note that we set τ = 1/C in (8), where C = 6 for Satellite, 0.4 10 4 6 7 5 3 8 1 12 11 C = 4 for Vehicle, and C = 2 for Biodeg. However, the Sorted Dataset Index actual average normalized rule firing levels were not exactly (b) τ on these datasets. Our experiments showed that although UR cannot guarantee the average normalized rule firing levels to Fig. 4. (a) RCAs and (b) BCAs of the four MBGD-based TSK fuzzy classifiers on the 12 datasets. Datasets were sorted according to the RCAs be around τ, it can indeed make the rules fired more uniformly. of the TSK-MBGD model. The indices along the horizontal axis denote the Why may making the rules fired more uniformly help dataset indices in Table I. improve the generalization performance? In [32] we pointed out that a TSK fuzzy system may be functionally equivalent E. Statistical Analysis to an adaptive stacking ensemble model, in which each rule can be viewed as a base learner, and the aggregation weights To further evaluate the performance improvement of our equal the corresponding rule firing levels. When the rule firing proposed TSK-MBGD-UR-BN over others, we also performed levels are more uniform, generally more rules are utilized in non-parametric multiple comparison tests on the RCAs and computing the output, i.e., more base learners are used in BCAs using Dunn’s procedure [41], with a p-value correction the stacking ensemble model, which may help improve the using the False Discovery Rate method [42]. The results are generalization performance. shown in Table VI, where the statistically significant ones are To demonstrate this, we computed the entropy of the nor- marked in bold. malized rule firing levels for each input example: Table VI demonstrates that our proposed BN and UR can significantly improve the generalization perfor- E = − f log f , (16) mance of the traditional MBGD optimization for TSK r r fuzzy classifiers. TSK-MBGD-UR-BN statistically signifi- cantly outperformed CART, JRip, PART, TSK-MBGD and where f is the normalized firing level of the r-th rule. TSK-MBGD-BN on RCA, and also statistically significantly Generally, a larger entropy means more rules were fired. outperformed CART, JRip, TSK-FCM-LSE, TSK-MBGD and Fig. 6 shows the histogram of the entropy distributions TSK-MBGD-BN on BCA. Although the performance improve- on the Satellite dataset. When training TSK fuzzy systems ment of TSK-MBGD-UR-BN over RF and TSK-MBGD-UR without UR, many samples had close to zero E, i.e., all were not statistically significant, they were quite close to the except one rule had firing levels close to zero. When UR was threshold, especially for the BCA. added, the number of examples with close to zero E decreased significantly, i.e., more rules with larger firing levels were used in computing the output. F. Effect of UR As mentioned in Section II-B, using MBGD to optimize the G. Effect of BN TSK fuzzy system may face the “rich get richer” problem. To demonstrate this, Fig. 5 shows the average normalized We also used the Satellite dataset to analyze the effect of firing levels of the rules on the entire dataset after the four BN. BCA RCA 8 We set the UR weight λ = 1 and recorded the training 0.2 TSK-MBGD TSK-MBGD-UR loss and test BCA in the first 20 training epochs. This TSK-MBGD-BN TSK-MBGD-UR-BN 0.15 process was repeated 10 times, and the average results are shown in Figs. 7(a) and 7(b), respectively. BN resulted in 0.1 smaller training losses and better generalization performances in testing. 0.05 There is still no agreement on theoretically why BN is helpful in optimizing deep neural networks [24]; thus, it is 1 2 4 6 8 10 12 14 16 18 20 also challenging to analyze theoretically why BN can help Sorted Rule Index the optimization of TSK fuzzy systems. Nevertheless, we (a) performed an empirical study to peek into this, by recording 0.25 the L1 norm of the antecedent parameters’ gradients and the TSK-MBGD TSK-MBGD-UR L1 norm of the consequent parameters’ gradients in the first 20 0.2 TSK-MBGD-BN TSK-MBGD-UR-BN training epochs on the Satellite dataset. The results are shown 0.15 in Figs. 7(c) and 7(d), respectively. BN significantly increased the gradients of both antecedent and consequent parameters. 0.1 With the same learning rate, this can expedite the convergence. 0.05 TSK-MBGD TSK-MBGD-UR 1 2 4 6 8 10 12 14 16 18 20 TSK-MBGD-BN TSK-MBGD-UR-BN Sorted Rule Index 0.8 (b) 0.6 0.4 TSK-MBGD TSK-MBGD-UR 0.4 TSK-MBGD-BN TSK-MBGD-UR-BN 0.3 3 4 6 8 10 12 14 16 18 20 Epoch 0.2 (a) 0.84 0.1 0.83 0 0.82 1 2 4 6 8 10 12 14 16 18 20 0.81 TSK-MBGD TSK-MBGD-UR Sorted Rule Index TSK-MBGD-BN TSK-MBGD-UR-BN 0.8 (c) 3 4 6 8 10 12 14 16 18 20 Epoch Fig. 5. Average normalized rule firing levels of TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR and TSK-MBGD-UR-BN on (a) Satellite, (b) Vehicle, and (c) (b) Biodeg datasets. 1.3 1.2 TSK-MBGD TSK-MBGD-UR 1.1 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 0.9 3 4 6 8 10 12 14 16 18 20 Epoch (c) 0 0.5 1 1.5 2 (a) TSK-MBGD-BN TSK-MBGD-UR-BN 400 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 3 4 6 8 10 12 14 16 18 20 Epoch (d) 0 0.5 1 1.5 2 Fig. 7. (a) Training loss, (b) test BCA, (c) L1 norm of the antecedent param- eters’ gradients, and (d) L1 norm of the consequent parameters’ gradients, in (b) the first 20 training epochs on the Satellite dataset. The horizontal axis starts from 3 epochs so that the differences among the curves can be more clearly Fig. 6. Histogram of the normalized rule firing level entropy E visualized. of (a) TSK-MBGD and TSK-MBGD-UR, and, (b) TSK-MBGD-BN and TSK-MBGD-UR-BN, on the Satellite dataset. We also evaluated the performances of the Number of Samples Number of Samples Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Consequent Grad. Antecedent Grad. Test BCA Training Loss 9 two BN variants introduced in Section II-C. The IV. CONCLUSIONS AND FUTURE RESEARCH BCAs of TSK-MBGD-UR, TSK-MBGD-UR-BN, TSK fuzzy systems are powerful and frequently used ma- TSK-MBGD-UR-GBN and TSK-MBGD-UR-RBN are shown chine learning models, for both regression and classifica- in Table VII. TSK-MBGD-UR-BN performed the best, and tion. However, they may not be easily applicable to large TSK-MBGD-UR-GBN the worst. Since TSK-MBGD-UR-RBN and/or high-dimensional datasets. Our very recent research had more parameters to optimize, its training was not as [6] proposed an MBGD-based efficient and effective training stable as TSK-MBGD-UR-BN and TSK-MBGD-UR-GBN. algorithm (MBGD-RDA) for TSK fuzzy systems for regres- Therefore, TSK-MBGD-UR-BN is the best choice. sion problems. This paper has proposed an MBGD-based algorithm, TSK-MBGD-UR-BN, to train TSK fuzzy systems TABLE VII for classification problems. It can deal with both small and AVERAGE BCAS OF THE THREE BN VARIANTS ON THE 12 DATASETS. big data with different dimensionalities, and may be the only TSK-MBGD TSK-MBGD TSK-MBGD TSK-MBGD algorithm that can train a TSK fuzzy classifier on big and Dataset -UR -UR-BN -UR-GBN -UR-RBN high-dimensional datasets. TSK-MBGD-UR-BN integrates two Vehicle 0.7127 0.7930 0.7261 0.7679 Biodeg 0.8390 0.8439 0.8422 0.8440 novel techniques, which are also first proposed in this paper: DRD 0.6717 0.6729 0.6636 0.6650 1) UR, which is a regularization term in the loss function Yeast 0.4946 0.5332 0.4352 0.5339 to ensure that all rules are fired similarly on average, Steel 0.7245 0.7515 0.7332 0.7219 IS 0.8559 0.9501 0.9115 0.8938 and hence to improve the generalization performance. Abalone 0.4791 0.5402 0.4924 0.5275 2) BN, which normalizes the inputs in computing the rule Waveform21 0.8362 0.8233 0.8232 0.8334 consequents to speedup the convergence and to improve Page-blocks 0.6033 0.6710 0.5912 0.6333 Satellite 0.8679 0.8700 0.8679 0.8216 the generalization. Clave 0.6374 0.6421 0.6090 0.6442 Experiments on 12 UCI datasets from various domains, MAGIC 0.8225 0.7934 0.8319 0.8318 Average 0.7121 0.7404 0.7106 0.7265 with varying size and feature dimensionality, demonstrated that each of UR and BN has its own unique advantages, and integrating them can achieve the best classification per- formance. TSK-MBGD-UR-BN, together with MBGD-RDA H. Effect of the Batch Size proposed in [6], shall greatly promote the applications of TSK The batch size is an important hyper-parameter in MBGD- fuzzy systems in both classification and regression, especially based optimization. It determines the memory requirement for big data problems. and the convergence speed in training. A larger batch size The proposed TSK-MBGD-UR-BN also has some limita- leads to faster convergence but also requires more memory. tions, which will be addressed in our future research. First, In [43], the authors analyzed the effect of the batch size for very high dimensional data, fuzzy partitions of the input on the generalization performance. Their results showed that space become very complicated, and numeric underflow may using a larger batch size causes degradation in the model happen when the product t-norm is used. Further research shall generalization performance, because it tends to converge to consider rules that automatically select the most relevant at- a shaper minimum, which makes the model sensitive to noise. tributes as the antecedents. Second, we shall investigate how to A similar finding was presented in [44] that a smaller batch improve the interpretability of data-driven TSK fuzzy systems. size leads to more stable and reliable training. However, since This is also partially linked to the first problem, as reducing we used the mean, standard deviation and mean firing level the number of antecedents can improve the interpretability of of each batch to compute the losses, too small batch size may the rules. also lead to poor performance. We validated our model on the Satellite dataset with batch REFERENCES size varying from 16 to 2,048. The test RCAs and BCAs averaged over 30 runs are shown in Fig. 8. The test perfor- [1] A.-T. Nguyen, T. Taniguchi, L. Eciolaza, V. Campos, R. Palhares, and mance decreased with too small or too large batch sizes. For M. Sugeno, “Fuzzy control systems: Past, present and future,” IEEE Computational Intelligence Magazine, vol. 14, no. 1, pp. 56–68, 2019. TSK-MBGD-UR-BN, it seems that a batch size within [64, [2] Y. Shi, R. Eberhart, and Y. Chen, “Implementation of evolutionary fuzzy 256] is a good choice. systems,” IEEE Trans. on Fuzzy Systems, vol. 7, no. 2, pp. 109–119, 0.9 [3] D. Wu and W. W. Tan, “Genetic learning and performance evaluation of interval type-2 fuzzy logic controllers,” Engineering Applications of 0.85 0.85 Artificial Intelligence, vol. 19, no. 8, pp. 829–841, 2006. 0.8 [4] L.-X. Wang and J. M. Mendel, “Back-propagation of fuzzy systems 0.8 as nonlinear dynamic system identifiers,” in Proc. IEEE Int’l Conf. on 0.75 Fuzzy Systems, San Diego, CA, Sep. 1992, pp. 1409–1418. 0.75 0.7 [5] J. S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” 0.7 0.65 IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665– 16 32 64 128 256 512 1024 2048 685, 1993. Batch Size [6] D. Wu, Y. Yuan, J. Huang, and Y. Tan, “Optimize TSK fuzzy systems for big data regression problems: Mini-batch gradient descent Fig. 8. Average RCAs and BCAs of TSK-MBGD-UR-BN on the Satellite with regularization, DropRule and AdaBound (MBGD-RDA),” IEEE dataset, using different batch sizes. Trans. on Fuzzy Systems, 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10951 RCA BCA 10 [7] Y. Jin, “Fuzzy modeling of high-dimensional systems: complexity reduc- [32] D. Wu, C.-T. Lin, J. Huang, and Z. Zeng, “On the functional equivalence tion and interpretability improvement,” IEEE Trans. on Fuzzy Systems, of TSK fuzzy systems to neural networks, mixture of experts, CART, vol. 8, no. 2, pp. 212–221, 2000. and stacking ensemble regression,” IEEE Trans. on Fuzzy Systems, [8] Y. Deng, Z. Ren, Y. Kong, F. Bao, and Q. Dai, “A hierarchical fused 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10572 fuzzy deep neural network for data classification,” IEEE Trans. on Fuzzy [33] T. Shen, M. Ott, M. Auli, and M. Ranzato, “Mixture models for Systems, vol. 25, no. 4, pp. 1006–1012, 2016. diverse machine translation: Tricks of the trade,” arXiv preprint arXiv:1902.07816, 2019. [9] F.-L. Chung, Z. Deng, and S. Wang, “From minimum enclosing ball to fast fuzzy inference system training on large datasets,” IEEE Trans. on [34] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, Fuzzy Systems, vol. 17, no. 1, pp. 173–184, 2008. and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017. [10] M. Nilashi, O. Bin Ibrahim, N. Ithnin, and N. H. Sarmin, “A multi- [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for criteria collaborative filtering recommender system for the tourism image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern domain using Expectation Maximization (EM) and PCA–ANFIS,” Elec- Recognition, Las Vegas, NV, Jun. 2016, pp. 770–778. tronic Commerce Research and Applications, vol. 14, no. 6, pp. 542–562, [36] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016. [11] C. K. Lau, K. Ghosh, M. A. Hussain, and C. R. C. Hassan, “Fault [37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely diagnosis of Tennessee Eastman process with multi-scale PCA and connected convolutional networks,” in Proc. IEEE Conf. on Computer ANFIS,” Chemometrics and Intelligent Laboratory Systems, vol. 120, Vision and Pattern Recognition, Honolulu, HI, Jul. 2017, pp. 4700–4708. pp. 1–14, 2013. [38] E. Frank and I. H. Witten, “Generating accurate rule sets without global [12] Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, and S. Wang, “A survey on soft optimization,” in Proc. Int’l Conf. on Machine Learning, San Francisco, subspace clustering,” Information Sciences, vol. 348, pp. 84–106, 2016. CA, Jul. 1998. [13] Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Enhanced soft [39] W. W. Cohen, “Repeated incremental pruning to produce error reduc- subspace clustering integrating within-cluster and between-cluster in- tion,” in Proc. Int’l Conf. on Machine Learning, Tahoe City, CA, Jun. formation,” Pattern Recognition, vol. 43, no. 3, pp. 767–781, 2010. [14] M. J. Gacto, M. Galende, R. Alcala´, and F. Herrera, “METSK-HDe: [40] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, “Neuro-fuzzy and soft A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy computing-a computational approach to learning and machine intelli- systems in high-dimensional and large-scale regression problems,” In- gence,” IEEE Trans. on Automatic Control, vol. 42, no. 10, pp. 1482– formation Sciences, vol. 276, pp. 63–79, 2014. 1484, 1997. [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Boston, [41] O. J. Dunn, “Multiple comparisons using rank sums,” Technometrics, MA: MIT press, 2016. vol. 6, no. 3, pp. 241–252, 1964. [16] S. Ruder, “An overview of gradient descent optimization algorithms,” [42] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: arXiv preprint arXiv:1609.04747, 2016. A practical and powerful approach to multiple testing,” Journal of the [17] L. Bottou, “Large-scale machine learning with stochastic gradient de- Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995. scent,” in Proc. Int’l Conf. on Computational Statistics. Paris, France: [43] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Springer, Aug. 2010, pp. 177–186. Tang, “On large-batch training for deep learning: Generalization gap [18] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance and sharp minima,” in Proc. Int’l Conf. on Learning Representations, of initialization and momentum in deep learning,” in Proc. Int’l Conf. Toulon, France, Apr. 2017. on Machine Learning, Atlanta, GA, Jun. 2013, pp. 1139–1147. [44] D. Masters and C. Luschi, “Revisiting small batch training for deep [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” neural networks,” arXiv preprint arXiv:1804.07612, 2018. in Proc. Int’l Conf. on Learning Representations, San Diego, CA, May [20] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The marginal value of adaptive gradient methods in machine learning,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 4148–4158. [21] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017. [22] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” in Proc. Int’l Conf. on Learning Representations, New Orleans, LA, May 2019. [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int’l Conf. on Machine Learning, Lille, France, Jul. 2015. [24] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch nor- malization help optimization?” in Proc. Advances in Neural Information Processing Systems, Montral , Canada, Dec. 2018, pp. 2483–2493. [25] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016. [26] L. Fan, “Revisit fuzzy neural network: Demystifying batch normalization and ReLU with generalized hamming network,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 1923–1932. [27] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015. [28] Y. Wu and K. He, “Group normalization,” in Proc. European Conf. on Computer Vision, Munich, Germany, Sep. 2018, pp. 3–19. [29] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, [30] H. Bersini and G. Bontempi, “Now comes the time to defuzzify neuro- fuzzy models,” Fuzzy Sets and Systems, vol. 90, no. 2, pp. 161–169, [31] H. Andersen, A. Lotfi, and L. Westphal, “Comments on ‘functional equivalence between radial basis function networks and fuzzy inference systems’ [and author’s reply],” IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1529–1532, 1998. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Optimize TSK Fuzzy Systems for Classification Problems: Mini-Batch Gradient Descent with Uniform Regularization and Batch Normalization

Statistics , Volume 2020 (1908) – Aug 1, 2019

Loading next page...
 
/lp/arxiv-cornell-university/optimize-tsk-fuzzy-systems-for-classification-problems-mini-batch-RVuPoaN0hJ
ISSN
1063-6706
eISSN
ARCH-3347
DOI
10.1109/TFUZZ.2020.2967282
Publisher site
See Article on Publisher Site

Abstract

Optimize TSK Fuzzy Systems for Classification Problems: Mini-Batch Gradient Descent with Uniform Regularization and Batch Normalization Yuqi Cui, Dongrui Wu and Jian Huang Abstract—Takagi-Sugeno-Kang (TSK) fuzzy systems are flex- Many efforts have been spent to tackling the difficulty ible and interpretable machine learning models; however, they in optimizing the TSK fuzzy systems on big and/or high- may not be easily optimized when the data size is large, and/or dimensional data [7]–[9]. Dimensionality reduction and/or the data dimensionality is high. This paper proposes a mini- feature selection are usually used to reduce the number of batch gradient descent (MBGD) based algorithm to efficiently fuzzy partitions (rules). Traditional dimensionality reduction and effectively train TSK fuzzy classifiers. It integrates two novel techniques: 1) uniform regularization (UR), which forces the techniques such as principal component analysis (PCA) has rules to have similar average contributions to the output, and been used for TSK fuzzy system optimization [10], [11]. There hence to increase the generalization performance of the TSK are also methods focusing on learning a sparse subspace of classifier; and, 2) batch normalization (BN), which extends BN the original feature space to reduce the number of antecedents from deep neural networks to TSK fuzzy classifiers to expedite in each rule [12], [13]. Once the number of antecedents is the convergence and improve the generalization performance. Ex- periments on 12 UCI datasets from various application domains, determined, different optimization approaches can be used to with varying size and dimensionality, demonstrated that UR and tune the TSK fuzzy system on large datasets. For example, BN are effective individually, and integrating them can further Chung et al. [9] utilized the equivalence between minimum en- improve the classification performance. closing ball and the Mamdani-Larsen fuzzy inference system Index Terms—Batch normalization, mini-batch gradient de- to train the latter using the former. Gacto et al. [14] proposed a scent, TSK fuzzy classifier, uniform regularization multi-objective evolutionary algorithm to optimize TSK fuzzy systems for high-dimensional large-scale regression problems. Mini-batch gradient descent (MBGD) [15], [16] based op- I. INTRODUCTION timization, which is particularly popular in deep learning, Takagi-Sugeno-Kang (TSK) fuzzy systems [1] have can also be a solution to training TSK fuzzy systems on achieved great success in numerous applications, including large and high-dimensional datasets. In each iteration, MBGD both classification and regression problems. Many optimiza- computes the gradients from a randomly selected small batch tion approaches have been proposed for them. of data, instead of the entire dataset [17]. Different batch sizes There are generally three strategies for fine-tuning the TSK can be used, according to the trade-off among the available fuzzy system parameters after initialization: 1) evolutionary memory, the training speed, and the expected generalization algorithms [2], [3]; 2) gradient descent (GD) based algorithms performance. The original MBGD used a constant learning [4]; and, 3) GD plus least squares estimation (LSE), repre- rate to update the model’s parameters [17]. Later, Sutskever sented by the popular adaptive-network-based fuzzy inference et al. [18] found that adding a momentum to MBGD can system (ANFIS) [5]. However, these approaches may have improve the final training performance. However, it still needs challenges when the size and/or the dimensionality of the to manually select a learning rate, and the convergence may data increase. Evolutionary algorithms need to keep a large be very slow at the beginning. Kingma and Ba [19] proposed population of candidate solutions, and evaluate the fitness the well-known Adam algorithm to automatically rescale the of each, which result in high computational cost and heavy gradients to achieve adaptive and individualized learning rate memory requirement for big data. Traditional GD needs to for each parameter, which leads to faster convergence. How- compute the gradients from the entire dataset to iteratively ever, the generalization performance of Adam may not be as update the model parameters, which may be very slow, or good as the momentum [20]; so, Keskar and Socher [21] also even impossible, when the data size is very large. The memory tried to combine the advantages of momentum and Adam requirement and computational cost of LSE also increase to achieve both fast convergence and good generalization. rapidly when the data size and/or dimensionality increase. Recently, Luo et al. [22] also proposed AdaBound to improve Additionally, as shown in [6], ANFIS may result in significant Adam. AdaBound uses an adaptive bound for the learning overfitting in regression problems. rate of each parameter to force the optimizer to behave like Adam at the beginning and like stochastic GD at the end. Our Y. Cui, D. Wu and J. Huang are with the Key Laboratory of the Ministry of Education for Image Processing and Intelligent Control, School of Artificial very recent research [6] has found that TSK fuzzy systems Intelligence and Automation, Huazhong University of Science and Technol- can achieve better performance with AdaBound than Adam ogy, Wuhan 430074, China. Email: yqcui@hust.edu.cn, drwu@hust.edu.cn, for regression problems. huang jan@hust.edu.cn. D. Wu and J. Huang are the corresponding authors. Although MBGD-based optimization has many advantages, arXiv:1908.00636v3 [cs.LG] 9 Jan 2020 2 it may be easily trapped into a local-minimum, and may face Suppose the TSK fuzzy classifier has R rules, in the the gradient vanishing problem. Many other techniques have following form: been proposed to complement MBGD for better performance. Rule : IF x is X and · · · and x is X , r 1 r,1 D r,D In 2015, Ioffe and Szegedy [23] proposed the well-known batch normalization (BN) approach to accelerate the training 1 1 1 THEN y (x) = b + b · x and · · · r r,0 r,d of deep neural networks by reducing the internal covariate (1) d=1 shift . BN normalizes the input distribution of each layer, so it also alleviates the gradient vanishing problem. It has been C C C and y (x) = b + b · x r r,0 r,d used almost ubiquitously in deep learning, and many variants d=1 [25]–[28] have also been proposed. where X (r = 1, ..., R; d = 1, ..., D) is the membership r,d This paper, following our previous research [6] on MBGD- function (MF) for the d-th antecedent in the r-th rule, and based optimization of TSK fuzzy systems for regression prob- c c b and b (c = 1, ..., C) are the consequent parameters for r,0 r,d lems, considers classification problems. We use AdaBound, as the c-th class. in [6], to adjust the learning rates. Additionally, we propose Different types of MFs can be used in our algorithm, as two novel techniques for training TSK fuzzy systems for long as they are differentiable. For simplicity, Gaussian MFs classification problems, namely, uniform regularization (UR) are considered in this paper, and the membership grade of x and BN. Our main contributions are: on X is: r,d 1) We introduce a novel UR term to the cross-entropy loss (x − m ) d r,d function in training TSK fuzzy classifiers, which forces µ (x ) = exp − , (2) X d r,d 2σ all rules to have similar average firing levels on the r,d entire dataset. Experiments show that UR can improve where m and σ are the center and the standard deviation r,d r,d the generalization performance of TSK fuzzy classifiers. of the Gaussian MF, respectively. 2) We extend BN from the training of deep neural networks The output of the TSK fuzzy classifier for the c-th class is: to the training of TSK fuzzy classifiers, and show that f (x)y (x) it can speed up the convergence in training and improve c r=1 r y (x) = , (3) the generalization performance in testing. f (x) r=1 3) We further integrate UR and BN, and show that the where combined approach outperforms each individual ones. D D Y X (x − m ) d r,d The remainder of this paper is organized as follows: Sec- f (x) = µ (x ) = exp − (4) r X d r,d 2σ r,d tion II introduces the proposed UR and BN approaches. d=1 d=1 Section III presents the experimental results to validate the is the firing level of Rule r. We can also re-write (3) as: performances of UR and BN. Section IV draws conclusions and points out some future research directions. c c y (x) = f (x)y (x), (5) r r r=1 where II. UR AND BN f (x) f (x) = (6) f (x) This section introduces the details of the TSK fuzzy i=1 classifier under consideration, our proposed UR for reg- is the normalized firing level of Rule r. 1 C T ularizing the loss function, and BN for more efficient Once the output vector y(x) = [y (x), ..., y (x)] is and effective training of the TSK fuzzy classifier. Python obtained, the input x is assigned to the class with the largest implementation of our algorithm can be downloaded at y (x). https://github.com/YuqiCui/TSK BN UR. To optimize the TSK fuzzy classifier, we need to fine- tune the antecedent MF parameters m and σ , and the r,d r,d c c consequent parameters b and b , where r = 1, ..., R, r,0 r,d d = 1, ..., D, and c = 1, ..., C. A. The TSK Fuzzy Classifier Let the training dataset be D = {x , y } , in which n n n=1 B. Uniform Regularization (UR) T D×1 x = [x , ..., x ] ∈ R is a D-dimensional feature n n,1 n,D Mixture of experts (MoE) [29], which is functionally equiv- vector, and y ∈ {1, 2, ..., C} the corresponding class label alent to TSK fuzzy systems [30]–[32], is a popular machine for a C-class classification problem. learning algorithm. Its model is shown in Fig. 1. It trains multiple local experts, each taking care of only a small local Recently some researchers had different opinions on why BN works. For region of the input space. For a new input, the gating network example, Santurkar et al. [24] argued that BN may not reduce the internal determines the activations (weights) of the local experts, and covariate shift; instead, it helps improve the Lipschitzness of both the loss the final output is a weighted average of the local expert and the gradients, and also reduces the dependency on the training hyper- parameters, such as the learning rate and the regularization weights. outputs. 3 where m and σ are the mean and the standard deviation B B of the samples in the mini-batch, respectively, γ and β are parameters to be learned during training, and ǫ is usually set to 1e − 8 to avoid being divided by zero. During training, exponential weighted averages of m and σ are recorded so B B that they can be used in the test phase. Since TSK fuzzy systems and neural networks share lots of similarity [32], we can extend BN to the optimization of TSK fuzzy classifiers, as shown in Fig. 2. In the training phase, we first compute the firing level of each rule using the unmodified inputs, as in traditional TSK fuzzy systems. Then, we use BN Fig. 1. Mixture of experts (MoE) [29]. to normalize the inputs, according to their mean and standard deviation in the current mini-batch. The normalized inputs are then used to compute the rule consequents. The final output is Although MoE has been used successfully in many applica- a weighted average of the rule consequents, the weights being tions, it may suffer from the “rich get richer” effect [33], [34]: the corresponding rule firing levels. once an expert is slightly better than others, it is always picked by the gating network, whereas other experts starve and are rarely used. This is bad for the generalization performance of the overall model. Since MoE and TSK fuzzy systems are functionally equiv- alent [32], TSK fuzzy systems may also suffer from the “rich get richer” effect, i.e., only a few rules are always activated with large firing levels, whereas others have very small firing levels, and hence not adequately tuned in training. A remedy to the “rich get richer” effect in TSK fuzzy systems is to force the rules to be fired at similar degrees in the input space, so that each rule contributes about equally to the output. Next, we propose UR to achieve this goal. UR forces the rules to have similar average firing levels, by minimizing the following loss: R N X X Fig. 2. BN in training a TSK fuzzy classifier. All rule consequents share the ℓ = f (x ) − τ , (7) same BN layer. UR r r=1 n=1 At the testing phase, the BN operation can be merged where N is the number of training examples, and τ the into the consequent layer. Assume that after training, we expected firing level of each rule, which is set to 1/C in this obtain a BN layer with learned m = (m , ..., m ) , σ = 1 D paper (recall that C is the number of classes). (σ , ..., σ ) , γ and β. Then, the output y of the r-th rule 1 D r ℓ can then be added to the original loss function in UR with BN is: MBGD-based training of TSK fuzzy classifiers, i.e., for each mini-batch with N training samples, x − m n,d d y (BN(x )) = b + γ b + βD, (10) r n r,0 r,d 2 σ + ǫ R N d d=1 X X 1 1 L = ℓ + αℓ + λ f (x ) − , (8) 2 n r which can be re-written as: N R r=1 n=1 ′ ′ y (BN(x )) = b + b x , (11) where ℓ is the cross-entropy loss between the estimated class r n n,d r,0 r,d d=1 probabilities [obtained by applying softmax to y(x)] and the true class probabilities, ℓ the L2 regularization of the rule where consequent parameters, and α and λ the trade-off parameters. m b d r,d b = b + βD − γ p , (12) r,0 r,0 σ + ǫ d=1 C. Batch Normalization (BN) r,d b = γ . (13) BN [23] is a very powerful technique in optimizing deep r,d σ + ǫ neural networks [35]–[37]. It normalizes the data distribution By doing this, the original architecture of the TSK fuzzy in each mini-batch to accelerate the training. For a mini-batch classifier is kept unchanged. B = {x } , the output of BN is [23]: n=1 We also tested two variants of BN, as shown in Fig. 3. x − m n B The TSK with global BN (TSK-MBGD-UR-GBN) approach in x = BN(x ) = γ p + β, (9) σ + ǫ B Fig. 3(a) uses the BN normalized inputs in both antecedents 4 and consequents to compute the final output. In this case, the A. Datasets output of TSK-MBGD-UR-GBN for Class c is: We evaluated our proposed algorithms on 12 classification datasets from the UCI Machine Learning Repository . Their c c characteristics are summarized in Table I. For each dataset, y (x) = f (BN(x))y (BN(x)). (14) r r we randomly selected 70% samples as the training set and the r=1 remaining 30% as the test set for 30 times to get 30 different The TSK with rule-specific BN (TSK-MBGD-UR-RBN) ap- data splits. We ran each algorithm on these 30 data splits and proach in Fig. 3(b) uses the raw inputs to compute the report the average performance. antecedents, and rule-specific BN to compute each consequent individually. The output of TSK-MBGD-UR-RBN for Class c TABLE I is: SUMMARY OF THE 12 DATASETS. Index Dataset No. of Samples No. of Features No. of Classes c c y (x) = f (x)y (BN (x)), (15) r 1 Vehicle 846 18 4 r r r=1 2 Biodeg 1,055 41 2 3 DRD 1151 19 2 where BN represents the BN operation for the r-th rule. r 4 Yeast 1,484 8 10 5 Steel 1,941 27 7 TSK-MBGD-UR-GBN has the same computational cost as 6 IS 2,310 19 7 TSK-MBGD-UR-BN, but TSK-MBGD-UR-RBN has R times 7 Abalone 4,177 10 3 more BN parameters, and hence higher computational cost. 8 8 Waveform21 5,000 21 3 Both of them can be re-expressed in the original TSK archi- 9 Page-blocks 5,473 10 5 10 Satellite 6,435 36 6 tecture. We also evaluate their performances in Section III-G. 11 Clave 10,798 16 4 12 MAGIC 19,020 10 2 https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29 https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen +Data+Set https://archive.ics.uci.edu/ml/datasets/Yeast https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults https://archive.ics.uci.edu/ml/datasets/Image+Segmentation https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator +(Version+1) https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classification https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) https://archive.ics.uci.edu/ml/datasets/Firm-Teacher Clave- Direction Classification https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope (a) Some datasets contain both numerical features and cate- #"# gorical features. The categorical features were converted into $"# numerical ones by one-hot coding. We z-normalized each !"# # " feature using the mean and standard deviation computed from ! & N !" # # &"# the training set. #"$ # " " ! ! & $ $ $ N !" $"$ ! B. Algorithms !"$ &"$ We compared nine algorithms to validate our proposed ap- ! " $ !'&' #"% !" proaches. Among them, four were tree based approaches (DT, N % $"% RF, PART, and JRip), one was a TSK fuzzy system optimized ! ! !"' !"% !"% by a traditional approach (TSK-FCM-LSE), and the remaining $ four were TSK fuzzy systems optimized by MBGD based &"% approaches (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, (b) TSK-MBGD-UR-BN). Fig. 3. (a) TSK fuzzy system with global BN (TSK-MBGD-UR-GBN); and, The details of these nine algorithms are as follows: (b) TSK fuzzy system with rule-specific BN (TSK-MBGD-UR-RBN). 1) DT: Decision tree implemented in scikit-learn in Python. We used 5-fold cross-validation to select the maximum depth of the tree from {3, 4, 5, 6, 7} on the III. EXPERIMENTS AND RESULTS training set. Other parameters were set by default. This section validates the performances of our proposed UR http://archive.ics.uci.edu/ml/index.php and BN on multiple datasets from various application domains, https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTree with varying size and feature dimensionality. Classifier.html 5 2) RF: Random forest implemented in scikit-learn in d = 1, ..., D) randomly from a uniform distribution U(−1, 1). Python. We set the number of trees to 20 and used 5- fold cross-validation to select the maximum depth of C. Performance Measures the trees from {3, 4, 5, 6, 7} on the training set. Other The raw classification accuracy (RCA), which is the total parameters were set by default. number of correctly classified test samples divided by the total 3) PART [38]: The PART (partial decision tree) classifier number of test samples, was used as our primary performance implemented in RWeka . All parameters were set by measure. default. Since some datasets have significant class imbalance, in 4) JRip [39]: The RIPPER (Repeated Incremental Pruning addition to the RCA, we also computed the balanced clas- to Produce Error Reduction) classifier implemented in sification accuracy (BCA), which is the mean of the per-class RWeka. All parameters were set by default. RCAs, as our second performance measure. 5) TSK-FCM-LSE [40]: We used fuzzy c-means (FCM) clustering to estimate the antecedent parameters, and D. Experimental Results LSE with L2 regularization to estimate the consequent The average test RCAs and BCAs are shown in Tables II parameters. and III, respectively. The largest value (best performance) on 6) TSK-MBGD: We used MBGD and AdaBound [22] to each dataset is marked in bold. To facilitate the comparison, optimize both the antecedent and the consequent param- we also show the ranks of the RCAs and BCAs in Tables IV eters. and V, respectively. 7) TSK-MBGD-UR: We used MBGD, AdaBound and UR The following observations can be made from the above (Section II-B) to optimize both the antecedent and the four tables: consequent parameters. The UR weight λ in (8) was 1) Generally, UR improved both RCA and BCA. Comparing selected from {0.1, 1, 10, 20, 50} by cross-validation on TSK-MBGD with TSK-MBGD-UR, and TSK-MBGD-BN the training set. with TSK-MBGD-UR-BN, we can conclude that gen- 8) TSK-MBGD-BN: We used MBGD, AdaBound and BN erally UR improved the classification performance, (Section II-C) to optimize both the antecedent and the regardless of whether BN was used or not. The consequent parameters. average ranks in the last row of Tables IV and 9) TSK-MBGD-UR-BN: We used MBGD, AdaBound, BN V demonstrate this more clearly: the average rank and UR to optimize both the antecedent and the conse- of TSK-MBGD-UR (TSK-MBGD-UR-BN) was smaller quent parameters. The UR weight λ in (8) was selected than that of TSK-MBGD (TSK-MBGD-BN). from {0.1, 1, 10, 20, 50} by cross-validation on the train- 2) Generally, BN improved both RCA and BCA. Comparing ing set. TSK-MBGD with TSK-MBGD-BN, and TSK-MBGD-UR For TSK-FCM-LSE, TSK-MBGD, TSK-MBGD-BN, with TSK-MBGD-UR-BN, we can conclude that gen- TSK-MBGD-UR and TSK-MBGD-UR-BN, we set the L2 erally BN improved the classification performance, regularization weight α = 0.05, and the number of rules regardless of whether UR was used or not. The R = 20. For TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR average ranks in the last row of Tables IV and and TSK-MBGD-UR-BN, we set the learning rate of V demonstrate this more clearly: the average rank AdaBound to 0.01, following our previous work [6]. In order of TSK-MBGD-BN (TSK-MBGD-UR-BN) was smaller to make use of all data in the training set and to reduce than that of TSK-MBGD (TSK-MBGD-UR). overfitting simultaneously, we randomly sampled 20% data 3) Generally, integrating BN and UR achieved from the training set and trained the TSK model with early further RCA and BCA improvements. Comparing stopping five times. The maximum epoch number was 2,000, TSK-MBGD-UR-BN with TSK-MBGD, TSK-MBGD-UR and the patience of early stopping 40. We recorded the and TSK-MBGD-BN, we can conclude that number of epochs at stopping in each run, and trained the TSK-MBGD-UR-BN almost always performed the final model with the average stopping epoch number on the best on both RCA and BCA, as shown in Fig. 4. entire training set. This indicated that BN and UR are somehow k-mean clustering was used in the MBGD-based algo- complementary, and hence integrating them may rithms (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, and achieve better performance than using each one alone. TSK-MBGD-UR-BN) to initialize the antecedent parameters. 4) Overall, TSK-MBGD-UR-BN achieved the best perfor- We performed k-means clustering on the training set, where mance among the nine algorithms. The last row of Ta- k equaled R, the number of rules. We then initialized the ble V shows that TSK-MBGD-UR-BN achieved the best rule centers to the cluster centers, and randomly initialized the average BCA performance, and the last row of Table IV standard deviation σ from a Gaussian distribution N (1, 0.2). r,d shows that TSK-MBGD-UR-BN achieved the second For the consequent parameters, we set the initial bias of each best average RCA performance. Interestingly, RF had rule to zero, and the attribute weight b (r = 1, ..., R; r,d the best average rank on RCA, but only ranked the fifth on BCA, suggesting that RF may tend to overlook the https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.Random minority classes. On the contrary, TSK-MBGD-UR-BN ForestClassifier.html performed well on both RCA and BCA. https://cran.r-project.org/web/packages/RWeka/index.html 6 TABLE II AVERAGE RCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6907 0.7407 0.6892 0.7110 0.7411 0.6970 0.7354 0.7089 0.7907 Biodeg 0.8202 0.8572 0.8222 0.8362 0.8377 0.8523 0.8531 0.8539 0.8609 DRD 0.6283 0.6589 0.6240 0.6364 0.6824 0.6623 0.6618 0.6713 0.6720 Yeast 0.5564 0.5963 0.5731 0.5340 0.5851 0.5673 0.5770 0.5722 0.5725 Steel 0.7017 0.7328 0.7135 0.7120 0.6527 0.5864 0.7110 0.7248 0.7350 IS 0.9320 0.9529 0.9481 0.9608 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.7170 0.7314 0.7254 0.7104 0.7323 0.5821 0.7129 0.6238 0.7306 Waveform21 0.7641 0.8369 0.7908 0.7843 0.8647 0.6779 0.8002 0.8363 0.8234 Page-blocks 0.9651 0.9688 0.9681 0.9677 0.9499 0.9375 0.9419 0.9515 0.9580 Satellite 0.8524 0.8863 0.8587 0.8592 0.8864 0.4890 0.8001 0.8929 0.8943 Clave 0.7103 0.7600 0.7344 0.7779 0.7690 0.8223 0.8427 0.8187 0.8192 MAGIC 0.8427 0.8531 0.8455 0.8488 0.8319 0.7347 0.7861 0.8574 0.8392 Average 0.7651 0.7979 0.7744 0.7782 0.7909 0.6821 0.7648 0.7806 0.8038 TABLE III AVERAGE BCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6936 0.744 0.6939 0.7131 0.7443 0.7010 0.7380 0.7127 0.7930 Biodeg 0.7973 0.8306 0.7899 0.8122 0.8205 0.8368 0.8318 0.8390 0.8439 DRD 0.634 0.6624 0.6227 0.6422 0.6845 0.6642 0.6634 0.6717 0.6729 Yeast 0.3998 0.4867 0.5203 0.4889 0.5102 0.4951 0.5184 0.4946 0.5332 Steel 0.7005 0.6937 0.7129 0.7267 0.6319 0.5933 0.7258 0.7245 0.7515 IS 0.932 0.9529 0.9481 0.9607 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.5319 0.5362 0.5371 0.5280 0.5402 0.4567 0.5236 0.4791 0.5402 Waveform21 0.7637 0.8365 0.7905 0.7844 0.8645 0.6784 0.8003 0.8362 0.8233 Page-blocks 0.7986 0.7385 0.8192 0.8162 0.6003 0.5129 0.5609 0.6033 0.671 Satellite 0.8204 0.8480 0.8308 0.834 0.8558 0.4337 0.7651 0.8679 0.8700 Clave 0.4701 0.4878 0.4985 0.6507 0.4825 0.5876 0.6468 0.6374 0.6421 MAGIC 0.8058 0.8108 0.8052 0.8135 0.7886 0.6325 0.7128 0.8225 0.7934 Average 0.6956 0.7190 0.714 0.7309 0.7067 0.5974 0.6869 0.7120 0.7404 TABLE IV RCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 8 3 9 5 2 7 4 6 1 Biodeg 9 2 8 7 6 5 4 3 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 8 1 4 9 2 7 3 6 5 Steel 7 2 4 5 8 9 6 3 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 2 4 7 1 9 6 8 3 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 4 1 2 3 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 8 5 6 2 1 4 3 MAGIC 5 2 4 3 7 9 8 1 6 Average 7.0 2.9 5.8 5.3 3.8 7.3 5.5 4.3 3.0 TABLE V BCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 9 3 8 5 2 7 4 6 1 Biodeg 8 5 9 7 6 3 4 2 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 9 8 2 7 4 5 3 6 1 Steel 6 7 5 2 8 9 3 4 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 4 3 6 1 9 7 8 2 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 3 4 1 2 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 6 1 8 5 2 4 3 MAGIC 4 3 5 2 7 9 8 1 6 Average 6.8 4.7 5.4 4.3 4.2 7.3 5.4 4.3 2.6 7 TABLE VI p-VALUES OF NON-PARAMETRIC MULTIPLE COMPARISONS ON THE RCAS AND BCAS. Metric CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR RCA 0.0097 0.0628 0.1368 0.3239 0.2723 0.0000 - - TSK-MBGD-BN BCA 0.1547 0.1627 0.4508 0.0981 0.4518 0.0000 - - RCA 0.0001 0.3740 0.0090 0.0460 0.2452 0.0000 0.1146 - TSK-MBGD-UR BCA 0.0036 0.2900 0.0912 0.4420 0.0921 0.0000 0.0731 - RCA 0.0000 0.2113 0.0002 0.0025 0.0404 0.0000 0.0094 0.1409 TSK-MBGD-UR-BN BCA 0.0000 0.0291 0.0021 0.0730 0.0022 0.0000 0.0013 0.0986 1.2 MBGD-based TSK models were trained, on three represen- TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN tative datasets. For TSK-MBGD, a few “richest” rules had much larger average firing levels than others, and hence the 0.8 rules contributed significantly differently to the output. BN 0.6 may help alleviate this problem a little bit, as the average 0.4 normalized rule firing levels in TSK-MBGD-BN were more 10 4 6 7 5 3 8 1 12 11 uniform than those in TSK-MBGD, which also resulted in Sorted Dataset Index better classification performances, as demonstrated in the pre- (a) vious subsection. However, UR had the most direct effect on 1.2 TSK-MBGD TSK-MBGD-BN alleviating the “rich get richer” problem, as TSK-MBGD-UR TSK-MBGD-UR TSK-MBGD-UR-BN (TSK-MBGD-UR-BN) had much more uniform average nor- 0.8 malized rule firing levels than TSK-MBGD (TSK-MBGD-BN), 0.6 and hence also better classification performance. Note that we set τ = 1/C in (8), where C = 6 for Satellite, 0.4 10 4 6 7 5 3 8 1 12 11 C = 4 for Vehicle, and C = 2 for Biodeg. However, the Sorted Dataset Index actual average normalized rule firing levels were not exactly (b) τ on these datasets. Our experiments showed that although UR cannot guarantee the average normalized rule firing levels to Fig. 4. (a) RCAs and (b) BCAs of the four MBGD-based TSK fuzzy classifiers on the 12 datasets. Datasets were sorted according to the RCAs be around τ, it can indeed make the rules fired more uniformly. of the TSK-MBGD model. The indices along the horizontal axis denote the Why may making the rules fired more uniformly help dataset indices in Table I. improve the generalization performance? In [32] we pointed out that a TSK fuzzy system may be functionally equivalent E. Statistical Analysis to an adaptive stacking ensemble model, in which each rule can be viewed as a base learner, and the aggregation weights To further evaluate the performance improvement of our equal the corresponding rule firing levels. When the rule firing proposed TSK-MBGD-UR-BN over others, we also performed levels are more uniform, generally more rules are utilized in non-parametric multiple comparison tests on the RCAs and computing the output, i.e., more base learners are used in BCAs using Dunn’s procedure [41], with a p-value correction the stacking ensemble model, which may help improve the using the False Discovery Rate method [42]. The results are generalization performance. shown in Table VI, where the statistically significant ones are To demonstrate this, we computed the entropy of the nor- marked in bold. malized rule firing levels for each input example: Table VI demonstrates that our proposed BN and UR can significantly improve the generalization perfor- E = − f log f , (16) mance of the traditional MBGD optimization for TSK r r fuzzy classifiers. TSK-MBGD-UR-BN statistically signifi- cantly outperformed CART, JRip, PART, TSK-MBGD and where f is the normalized firing level of the r-th rule. TSK-MBGD-BN on RCA, and also statistically significantly Generally, a larger entropy means more rules were fired. outperformed CART, JRip, TSK-FCM-LSE, TSK-MBGD and Fig. 6 shows the histogram of the entropy distributions TSK-MBGD-BN on BCA. Although the performance improve- on the Satellite dataset. When training TSK fuzzy systems ment of TSK-MBGD-UR-BN over RF and TSK-MBGD-UR without UR, many samples had close to zero E, i.e., all were not statistically significant, they were quite close to the except one rule had firing levels close to zero. When UR was threshold, especially for the BCA. added, the number of examples with close to zero E decreased significantly, i.e., more rules with larger firing levels were used in computing the output. F. Effect of UR As mentioned in Section II-B, using MBGD to optimize the G. Effect of BN TSK fuzzy system may face the “rich get richer” problem. To demonstrate this, Fig. 5 shows the average normalized We also used the Satellite dataset to analyze the effect of firing levels of the rules on the entire dataset after the four BN. BCA RCA 8 We set the UR weight λ = 1 and recorded the training 0.2 TSK-MBGD TSK-MBGD-UR loss and test BCA in the first 20 training epochs. This TSK-MBGD-BN TSK-MBGD-UR-BN 0.15 process was repeated 10 times, and the average results are shown in Figs. 7(a) and 7(b), respectively. BN resulted in 0.1 smaller training losses and better generalization performances in testing. 0.05 There is still no agreement on theoretically why BN is helpful in optimizing deep neural networks [24]; thus, it is 1 2 4 6 8 10 12 14 16 18 20 also challenging to analyze theoretically why BN can help Sorted Rule Index the optimization of TSK fuzzy systems. Nevertheless, we (a) performed an empirical study to peek into this, by recording 0.25 the L1 norm of the antecedent parameters’ gradients and the TSK-MBGD TSK-MBGD-UR L1 norm of the consequent parameters’ gradients in the first 20 0.2 TSK-MBGD-BN TSK-MBGD-UR-BN training epochs on the Satellite dataset. The results are shown 0.15 in Figs. 7(c) and 7(d), respectively. BN significantly increased the gradients of both antecedent and consequent parameters. 0.1 With the same learning rate, this can expedite the convergence. 0.05 TSK-MBGD TSK-MBGD-UR 1 2 4 6 8 10 12 14 16 18 20 TSK-MBGD-BN TSK-MBGD-UR-BN Sorted Rule Index 0.8 (b) 0.6 0.4 TSK-MBGD TSK-MBGD-UR 0.4 TSK-MBGD-BN TSK-MBGD-UR-BN 0.3 3 4 6 8 10 12 14 16 18 20 Epoch 0.2 (a) 0.84 0.1 0.83 0 0.82 1 2 4 6 8 10 12 14 16 18 20 0.81 TSK-MBGD TSK-MBGD-UR Sorted Rule Index TSK-MBGD-BN TSK-MBGD-UR-BN 0.8 (c) 3 4 6 8 10 12 14 16 18 20 Epoch Fig. 5. Average normalized rule firing levels of TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR and TSK-MBGD-UR-BN on (a) Satellite, (b) Vehicle, and (c) (b) Biodeg datasets. 1.3 1.2 TSK-MBGD TSK-MBGD-UR 1.1 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 0.9 3 4 6 8 10 12 14 16 18 20 Epoch (c) 0 0.5 1 1.5 2 (a) TSK-MBGD-BN TSK-MBGD-UR-BN 400 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 3 4 6 8 10 12 14 16 18 20 Epoch (d) 0 0.5 1 1.5 2 Fig. 7. (a) Training loss, (b) test BCA, (c) L1 norm of the antecedent param- eters’ gradients, and (d) L1 norm of the consequent parameters’ gradients, in (b) the first 20 training epochs on the Satellite dataset. The horizontal axis starts from 3 epochs so that the differences among the curves can be more clearly Fig. 6. Histogram of the normalized rule firing level entropy E visualized. of (a) TSK-MBGD and TSK-MBGD-UR, and, (b) TSK-MBGD-BN and TSK-MBGD-UR-BN, on the Satellite dataset. We also evaluated the performances of the Number of Samples Number of Samples Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Consequent Grad. Antecedent Grad. Test BCA Training Loss 9 two BN variants introduced in Section II-C. The IV. CONCLUSIONS AND FUTURE RESEARCH BCAs of TSK-MBGD-UR, TSK-MBGD-UR-BN, TSK fuzzy systems are powerful and frequently used ma- TSK-MBGD-UR-GBN and TSK-MBGD-UR-RBN are shown chine learning models, for both regression and classifica- in Table VII. TSK-MBGD-UR-BN performed the best, and tion. However, they may not be easily applicable to large TSK-MBGD-UR-GBN the worst. Since TSK-MBGD-UR-RBN and/or high-dimensional datasets. Our very recent research had more parameters to optimize, its training was not as [6] proposed an MBGD-based efficient and effective training stable as TSK-MBGD-UR-BN and TSK-MBGD-UR-GBN. algorithm (MBGD-RDA) for TSK fuzzy systems for regres- Therefore, TSK-MBGD-UR-BN is the best choice. sion problems. This paper has proposed an MBGD-based algorithm, TSK-MBGD-UR-BN, to train TSK fuzzy systems TABLE VII for classification problems. It can deal with both small and AVERAGE BCAS OF THE THREE BN VARIANTS ON THE 12 DATASETS. big data with different dimensionalities, and may be the only TSK-MBGD TSK-MBGD TSK-MBGD TSK-MBGD algorithm that can train a TSK fuzzy classifier on big and Dataset -UR -UR-BN -UR-GBN -UR-RBN high-dimensional datasets. TSK-MBGD-UR-BN integrates two Vehicle 0.7127 0.7930 0.7261 0.7679 Biodeg 0.8390 0.8439 0.8422 0.8440 novel techniques, which are also first proposed in this paper: DRD 0.6717 0.6729 0.6636 0.6650 1) UR, which is a regularization term in the loss function Yeast 0.4946 0.5332 0.4352 0.5339 to ensure that all rules are fired similarly on average, Steel 0.7245 0.7515 0.7332 0.7219 IS 0.8559 0.9501 0.9115 0.8938 and hence to improve the generalization performance. Abalone 0.4791 0.5402 0.4924 0.5275 2) BN, which normalizes the inputs in computing the rule Waveform21 0.8362 0.8233 0.8232 0.8334 consequents to speedup the convergence and to improve Page-blocks 0.6033 0.6710 0.5912 0.6333 Satellite 0.8679 0.8700 0.8679 0.8216 the generalization. Clave 0.6374 0.6421 0.6090 0.6442 Experiments on 12 UCI datasets from various domains, MAGIC 0.8225 0.7934 0.8319 0.8318 Average 0.7121 0.7404 0.7106 0.7265 with varying size and feature dimensionality, demonstrated that each of UR and BN has its own unique advantages, and integrating them can achieve the best classification per- formance. TSK-MBGD-UR-BN, together with MBGD-RDA H. Effect of the Batch Size proposed in [6], shall greatly promote the applications of TSK The batch size is an important hyper-parameter in MBGD- fuzzy systems in both classification and regression, especially based optimization. It determines the memory requirement for big data problems. and the convergence speed in training. A larger batch size The proposed TSK-MBGD-UR-BN also has some limita- leads to faster convergence but also requires more memory. tions, which will be addressed in our future research. First, In [43], the authors analyzed the effect of the batch size for very high dimensional data, fuzzy partitions of the input on the generalization performance. Their results showed that space become very complicated, and numeric underflow may using a larger batch size causes degradation in the model happen when the product t-norm is used. Further research shall generalization performance, because it tends to converge to consider rules that automatically select the most relevant at- a shaper minimum, which makes the model sensitive to noise. tributes as the antecedents. Second, we shall investigate how to A similar finding was presented in [44] that a smaller batch improve the interpretability of data-driven TSK fuzzy systems. size leads to more stable and reliable training. However, since This is also partially linked to the first problem, as reducing we used the mean, standard deviation and mean firing level the number of antecedents can improve the interpretability of of each batch to compute the losses, too small batch size may the rules. also lead to poor performance. We validated our model on the Satellite dataset with batch REFERENCES size varying from 16 to 2,048. The test RCAs and BCAs averaged over 30 runs are shown in Fig. 8. The test perfor- [1] A.-T. Nguyen, T. Taniguchi, L. Eciolaza, V. Campos, R. Palhares, and mance decreased with too small or too large batch sizes. For M. Sugeno, “Fuzzy control systems: Past, present and future,” IEEE Computational Intelligence Magazine, vol. 14, no. 1, pp. 56–68, 2019. TSK-MBGD-UR-BN, it seems that a batch size within [64, [2] Y. Shi, R. Eberhart, and Y. Chen, “Implementation of evolutionary fuzzy 256] is a good choice. systems,” IEEE Trans. on Fuzzy Systems, vol. 7, no. 2, pp. 109–119, 0.9 [3] D. Wu and W. W. Tan, “Genetic learning and performance evaluation of interval type-2 fuzzy logic controllers,” Engineering Applications of 0.85 0.85 Artificial Intelligence, vol. 19, no. 8, pp. 829–841, 2006. 0.8 [4] L.-X. Wang and J. M. Mendel, “Back-propagation of fuzzy systems 0.8 as nonlinear dynamic system identifiers,” in Proc. IEEE Int’l Conf. on 0.75 Fuzzy Systems, San Diego, CA, Sep. 1992, pp. 1409–1418. 0.75 0.7 [5] J. S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” 0.7 0.65 IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665– 16 32 64 128 256 512 1024 2048 685, 1993. Batch Size [6] D. Wu, Y. Yuan, J. Huang, and Y. Tan, “Optimize TSK fuzzy systems for big data regression problems: Mini-batch gradient descent Fig. 8. Average RCAs and BCAs of TSK-MBGD-UR-BN on the Satellite with regularization, DropRule and AdaBound (MBGD-RDA),” IEEE dataset, using different batch sizes. Trans. on Fuzzy Systems, 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10951 RCA BCA 10 [7] Y. Jin, “Fuzzy modeling of high-dimensional systems: complexity reduc- [32] D. Wu, C.-T. Lin, J. Huang, and Z. Zeng, “On the functional equivalence tion and interpretability improvement,” IEEE Trans. on Fuzzy Systems, of TSK fuzzy systems to neural networks, mixture of experts, CART, vol. 8, no. 2, pp. 212–221, 2000. and stacking ensemble regression,” IEEE Trans. on Fuzzy Systems, [8] Y. Deng, Z. Ren, Y. Kong, F. Bao, and Q. Dai, “A hierarchical fused 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10572 fuzzy deep neural network for data classification,” IEEE Trans. on Fuzzy [33] T. Shen, M. Ott, M. Auli, and M. Ranzato, “Mixture models for Systems, vol. 25, no. 4, pp. 1006–1012, 2016. diverse machine translation: Tricks of the trade,” arXiv preprint arXiv:1902.07816, 2019. [9] F.-L. Chung, Z. Deng, and S. Wang, “From minimum enclosing ball to fast fuzzy inference system training on large datasets,” IEEE Trans. on [34] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, Fuzzy Systems, vol. 17, no. 1, pp. 173–184, 2008. and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017. [10] M. Nilashi, O. Bin Ibrahim, N. Ithnin, and N. H. Sarmin, “A multi- [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for criteria collaborative filtering recommender system for the tourism image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern domain using Expectation Maximization (EM) and PCA–ANFIS,” Elec- Recognition, Las Vegas, NV, Jun. 2016, pp. 770–778. tronic Commerce Research and Applications, vol. 14, no. 6, pp. 542–562, [36] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016. [11] C. K. Lau, K. Ghosh, M. A. Hussain, and C. R. C. Hassan, “Fault [37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely diagnosis of Tennessee Eastman process with multi-scale PCA and connected convolutional networks,” in Proc. IEEE Conf. on Computer ANFIS,” Chemometrics and Intelligent Laboratory Systems, vol. 120, Vision and Pattern Recognition, Honolulu, HI, Jul. 2017, pp. 4700–4708. pp. 1–14, 2013. [38] E. Frank and I. H. Witten, “Generating accurate rule sets without global [12] Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, and S. Wang, “A survey on soft optimization,” in Proc. Int’l Conf. on Machine Learning, San Francisco, subspace clustering,” Information Sciences, vol. 348, pp. 84–106, 2016. CA, Jul. 1998. [13] Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Enhanced soft [39] W. W. Cohen, “Repeated incremental pruning to produce error reduc- subspace clustering integrating within-cluster and between-cluster in- tion,” in Proc. Int’l Conf. on Machine Learning, Tahoe City, CA, Jun. formation,” Pattern Recognition, vol. 43, no. 3, pp. 767–781, 2010. [14] M. J. Gacto, M. Galende, R. Alcala´, and F. Herrera, “METSK-HDe: [40] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, “Neuro-fuzzy and soft A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy computing-a computational approach to learning and machine intelli- systems in high-dimensional and large-scale regression problems,” In- gence,” IEEE Trans. on Automatic Control, vol. 42, no. 10, pp. 1482– formation Sciences, vol. 276, pp. 63–79, 2014. 1484, 1997. [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Boston, [41] O. J. Dunn, “Multiple comparisons using rank sums,” Technometrics, MA: MIT press, 2016. vol. 6, no. 3, pp. 241–252, 1964. [16] S. Ruder, “An overview of gradient descent optimization algorithms,” [42] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: arXiv preprint arXiv:1609.04747, 2016. A practical and powerful approach to multiple testing,” Journal of the [17] L. Bottou, “Large-scale machine learning with stochastic gradient de- Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995. scent,” in Proc. Int’l Conf. on Computational Statistics. Paris, France: [43] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Springer, Aug. 2010, pp. 177–186. Tang, “On large-batch training for deep learning: Generalization gap [18] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance and sharp minima,” in Proc. Int’l Conf. on Learning Representations, of initialization and momentum in deep learning,” in Proc. Int’l Conf. Toulon, France, Apr. 2017. on Machine Learning, Atlanta, GA, Jun. 2013, pp. 1139–1147. [44] D. Masters and C. Luschi, “Revisiting small batch training for deep [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” neural networks,” arXiv preprint arXiv:1804.07612, 2018. in Proc. Int’l Conf. on Learning Representations, San Diego, CA, May [20] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The marginal value of adaptive gradient methods in machine learning,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 4148–4158. [21] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017. [22] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” in Proc. Int’l Conf. on Learning Representations, New Orleans, LA, May 2019. [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int’l Conf. on Machine Learning, Lille, France, Jul. 2015. [24] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch nor- malization help optimization?” in Proc. Advances in Neural Information Processing Systems, Montral , Canada, Dec. 2018, pp. 2483–2493. [25] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016. [26] L. Fan, “Revisit fuzzy neural network: Demystifying batch normalization and ReLU with generalized hamming network,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 1923–1932. [27] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015. [28] Y. Wu and K. He, “Group normalization,” in Proc. European Conf. on Computer Vision, Munich, Germany, Sep. 2018, pp. 3–19. [29] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, [30] H. Bersini and G. Bontempi, “Now comes the time to defuzzify neuro- fuzzy models,” Fuzzy Sets and Systems, vol. 90, no. 2, pp. 161–169, [31] H. Andersen, A. Lotfi, and L. Westphal, “Comments on ‘functional equivalence between radial basis function networks and fuzzy inference systems’ [and author’s reply],” IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1529–1532, 1998.

Journal

StatisticsarXiv (Cornell University)

Published: Aug 1, 2019

There are no references for this article.