Access the full text.

Sign up today, get DeepDyve free for 14 days.

Statistics
, Volume 2020 (1908) – Aug 1, 2019

/lp/arxiv-cornell-university/optimize-tsk-fuzzy-systems-for-classification-problems-mini-batch-RVuPoaN0hJ

- ISSN
- 1063-6706
- eISSN
- ARCH-3347
- DOI
- 10.1109/TFUZZ.2020.2967282
- Publisher site
- See Article on Publisher Site

Optimize TSK Fuzzy Systems for Classiﬁcation Problems: Mini-Batch Gradient Descent with Uniform Regularization and Batch Normalization Yuqi Cui, Dongrui Wu and Jian Huang Abstract—Takagi-Sugeno-Kang (TSK) fuzzy systems are ﬂex- Many efforts have been spent to tackling the difﬁculty ible and interpretable machine learning models; however, they in optimizing the TSK fuzzy systems on big and/or high- may not be easily optimized when the data size is large, and/or dimensional data [7]–[9]. Dimensionality reduction and/or the data dimensionality is high. This paper proposes a mini- feature selection are usually used to reduce the number of batch gradient descent (MBGD) based algorithm to efﬁciently fuzzy partitions (rules). Traditional dimensionality reduction and effectively train TSK fuzzy classiﬁers. It integrates two novel techniques: 1) uniform regularization (UR), which forces the techniques such as principal component analysis (PCA) has rules to have similar average contributions to the output, and been used for TSK fuzzy system optimization [10], [11]. There hence to increase the generalization performance of the TSK are also methods focusing on learning a sparse subspace of classiﬁer; and, 2) batch normalization (BN), which extends BN the original feature space to reduce the number of antecedents from deep neural networks to TSK fuzzy classiﬁers to expedite in each rule [12], [13]. Once the number of antecedents is the convergence and improve the generalization performance. Ex- periments on 12 UCI datasets from various application domains, determined, different optimization approaches can be used to with varying size and dimensionality, demonstrated that UR and tune the TSK fuzzy system on large datasets. For example, BN are effective individually, and integrating them can further Chung et al. [9] utilized the equivalence between minimum en- improve the classiﬁcation performance. closing ball and the Mamdani-Larsen fuzzy inference system Index Terms—Batch normalization, mini-batch gradient de- to train the latter using the former. Gacto et al. [14] proposed a scent, TSK fuzzy classiﬁer, uniform regularization multi-objective evolutionary algorithm to optimize TSK fuzzy systems for high-dimensional large-scale regression problems. Mini-batch gradient descent (MBGD) [15], [16] based op- I. INTRODUCTION timization, which is particularly popular in deep learning, Takagi-Sugeno-Kang (TSK) fuzzy systems [1] have can also be a solution to training TSK fuzzy systems on achieved great success in numerous applications, including large and high-dimensional datasets. In each iteration, MBGD both classiﬁcation and regression problems. Many optimiza- computes the gradients from a randomly selected small batch tion approaches have been proposed for them. of data, instead of the entire dataset [17]. Different batch sizes There are generally three strategies for ﬁne-tuning the TSK can be used, according to the trade-off among the available fuzzy system parameters after initialization: 1) evolutionary memory, the training speed, and the expected generalization algorithms [2], [3]; 2) gradient descent (GD) based algorithms performance. The original MBGD used a constant learning [4]; and, 3) GD plus least squares estimation (LSE), repre- rate to update the model’s parameters [17]. Later, Sutskever sented by the popular adaptive-network-based fuzzy inference et al. [18] found that adding a momentum to MBGD can system (ANFIS) [5]. However, these approaches may have improve the ﬁnal training performance. However, it still needs challenges when the size and/or the dimensionality of the to manually select a learning rate, and the convergence may data increase. Evolutionary algorithms need to keep a large be very slow at the beginning. Kingma and Ba [19] proposed population of candidate solutions, and evaluate the ﬁtness the well-known Adam algorithm to automatically rescale the of each, which result in high computational cost and heavy gradients to achieve adaptive and individualized learning rate memory requirement for big data. Traditional GD needs to for each parameter, which leads to faster convergence. How- compute the gradients from the entire dataset to iteratively ever, the generalization performance of Adam may not be as update the model parameters, which may be very slow, or good as the momentum [20]; so, Keskar and Socher [21] also even impossible, when the data size is very large. The memory tried to combine the advantages of momentum and Adam requirement and computational cost of LSE also increase to achieve both fast convergence and good generalization. rapidly when the data size and/or dimensionality increase. Recently, Luo et al. [22] also proposed AdaBound to improve Additionally, as shown in [6], ANFIS may result in signiﬁcant Adam. AdaBound uses an adaptive bound for the learning overﬁtting in regression problems. rate of each parameter to force the optimizer to behave like Adam at the beginning and like stochastic GD at the end. Our Y. Cui, D. Wu and J. Huang are with the Key Laboratory of the Ministry of Education for Image Processing and Intelligent Control, School of Artiﬁcial very recent research [6] has found that TSK fuzzy systems Intelligence and Automation, Huazhong University of Science and Technol- can achieve better performance with AdaBound than Adam ogy, Wuhan 430074, China. Email: yqcui@hust.edu.cn, drwu@hust.edu.cn, for regression problems. huang jan@hust.edu.cn. D. Wu and J. Huang are the corresponding authors. Although MBGD-based optimization has many advantages, arXiv:1908.00636v3 [cs.LG] 9 Jan 2020 2 it may be easily trapped into a local-minimum, and may face Suppose the TSK fuzzy classiﬁer has R rules, in the the gradient vanishing problem. Many other techniques have following form: been proposed to complement MBGD for better performance. Rule : IF x is X and · · · and x is X , r 1 r,1 D r,D In 2015, Ioffe and Szegedy [23] proposed the well-known batch normalization (BN) approach to accelerate the training 1 1 1 THEN y (x) = b + b · x and · · · r r,0 r,d of deep neural networks by reducing the internal covariate (1) d=1 shift . BN normalizes the input distribution of each layer, so it also alleviates the gradient vanishing problem. It has been C C C and y (x) = b + b · x r r,0 r,d used almost ubiquitously in deep learning, and many variants d=1 [25]–[28] have also been proposed. where X (r = 1, ..., R; d = 1, ..., D) is the membership r,d This paper, following our previous research [6] on MBGD- function (MF) for the d-th antecedent in the r-th rule, and based optimization of TSK fuzzy systems for regression prob- c c b and b (c = 1, ..., C) are the consequent parameters for r,0 r,d lems, considers classiﬁcation problems. We use AdaBound, as the c-th class. in [6], to adjust the learning rates. Additionally, we propose Different types of MFs can be used in our algorithm, as two novel techniques for training TSK fuzzy systems for long as they are differentiable. For simplicity, Gaussian MFs classiﬁcation problems, namely, uniform regularization (UR) are considered in this paper, and the membership grade of x and BN. Our main contributions are: on X is: r,d 1) We introduce a novel UR term to the cross-entropy loss (x − m ) d r,d function in training TSK fuzzy classiﬁers, which forces µ (x ) = exp − , (2) X d r,d 2σ all rules to have similar average ﬁring levels on the r,d entire dataset. Experiments show that UR can improve where m and σ are the center and the standard deviation r,d r,d the generalization performance of TSK fuzzy classiﬁers. of the Gaussian MF, respectively. 2) We extend BN from the training of deep neural networks The output of the TSK fuzzy classiﬁer for the c-th class is: to the training of TSK fuzzy classiﬁers, and show that f (x)y (x) it can speed up the convergence in training and improve c r=1 r y (x) = , (3) the generalization performance in testing. f (x) r=1 3) We further integrate UR and BN, and show that the where combined approach outperforms each individual ones. D D Y X (x − m ) d r,d The remainder of this paper is organized as follows: Sec- f (x) = µ (x ) = exp − (4) r X d r,d 2σ r,d tion II introduces the proposed UR and BN approaches. d=1 d=1 Section III presents the experimental results to validate the is the ﬁring level of Rule r. We can also re-write (3) as: performances of UR and BN. Section IV draws conclusions and points out some future research directions. c c y (x) = f (x)y (x), (5) r r r=1 where II. UR AND BN f (x) f (x) = (6) f (x) This section introduces the details of the TSK fuzzy i=1 classiﬁer under consideration, our proposed UR for reg- is the normalized ﬁring level of Rule r. 1 C T ularizing the loss function, and BN for more efﬁcient Once the output vector y(x) = [y (x), ..., y (x)] is and effective training of the TSK fuzzy classiﬁer. Python obtained, the input x is assigned to the class with the largest implementation of our algorithm can be downloaded at y (x). https://github.com/YuqiCui/TSK BN UR. To optimize the TSK fuzzy classiﬁer, we need to ﬁne- tune the antecedent MF parameters m and σ , and the r,d r,d c c consequent parameters b and b , where r = 1, ..., R, r,0 r,d d = 1, ..., D, and c = 1, ..., C. A. The TSK Fuzzy Classiﬁer Let the training dataset be D = {x , y } , in which n n n=1 B. Uniform Regularization (UR) T D×1 x = [x , ..., x ] ∈ R is a D-dimensional feature n n,1 n,D Mixture of experts (MoE) [29], which is functionally equiv- vector, and y ∈ {1, 2, ..., C} the corresponding class label alent to TSK fuzzy systems [30]–[32], is a popular machine for a C-class classiﬁcation problem. learning algorithm. Its model is shown in Fig. 1. It trains multiple local experts, each taking care of only a small local Recently some researchers had different opinions on why BN works. For region of the input space. For a new input, the gating network example, Santurkar et al. [24] argued that BN may not reduce the internal determines the activations (weights) of the local experts, and covariate shift; instead, it helps improve the Lipschitzness of both the loss the ﬁnal output is a weighted average of the local expert and the gradients, and also reduces the dependency on the training hyper- parameters, such as the learning rate and the regularization weights. outputs. 3 where m and σ are the mean and the standard deviation B B of the samples in the mini-batch, respectively, γ and β are parameters to be learned during training, and ǫ is usually set to 1e − 8 to avoid being divided by zero. During training, exponential weighted averages of m and σ are recorded so B B that they can be used in the test phase. Since TSK fuzzy systems and neural networks share lots of similarity [32], we can extend BN to the optimization of TSK fuzzy classiﬁers, as shown in Fig. 2. In the training phase, we ﬁrst compute the ﬁring level of each rule using the unmodiﬁed inputs, as in traditional TSK fuzzy systems. Then, we use BN Fig. 1. Mixture of experts (MoE) [29]. to normalize the inputs, according to their mean and standard deviation in the current mini-batch. The normalized inputs are then used to compute the rule consequents. The ﬁnal output is Although MoE has been used successfully in many applica- a weighted average of the rule consequents, the weights being tions, it may suffer from the “rich get richer” effect [33], [34]: the corresponding rule ﬁring levels. once an expert is slightly better than others, it is always picked by the gating network, whereas other experts starve and are rarely used. This is bad for the generalization performance of the overall model. Since MoE and TSK fuzzy systems are functionally equiv- alent [32], TSK fuzzy systems may also suffer from the “rich get richer” effect, i.e., only a few rules are always activated with large ﬁring levels, whereas others have very small ﬁring levels, and hence not adequately tuned in training. A remedy to the “rich get richer” effect in TSK fuzzy systems is to force the rules to be ﬁred at similar degrees in the input space, so that each rule contributes about equally to the output. Next, we propose UR to achieve this goal. UR forces the rules to have similar average ﬁring levels, by minimizing the following loss: R N X X Fig. 2. BN in training a TSK fuzzy classiﬁer. All rule consequents share the ℓ = f (x ) − τ , (7) same BN layer. UR r r=1 n=1 At the testing phase, the BN operation can be merged where N is the number of training examples, and τ the into the consequent layer. Assume that after training, we expected ﬁring level of each rule, which is set to 1/C in this obtain a BN layer with learned m = (m , ..., m ) , σ = 1 D paper (recall that C is the number of classes). (σ , ..., σ ) , γ and β. Then, the output y of the r-th rule 1 D r ℓ can then be added to the original loss function in UR with BN is: MBGD-based training of TSK fuzzy classiﬁers, i.e., for each mini-batch with N training samples, x − m n,d d y (BN(x )) = b + γ b + βD, (10) r n r,0 r,d 2 σ + ǫ R N d d=1 X X 1 1 L = ℓ + αℓ + λ f (x ) − , (8) 2 n r which can be re-written as: N R r=1 n=1 ′ ′ y (BN(x )) = b + b x , (11) where ℓ is the cross-entropy loss between the estimated class r n n,d r,0 r,d d=1 probabilities [obtained by applying softmax to y(x)] and the true class probabilities, ℓ the L2 regularization of the rule where consequent parameters, and α and λ the trade-off parameters. m b d r,d b = b + βD − γ p , (12) r,0 r,0 σ + ǫ d=1 C. Batch Normalization (BN) r,d b = γ . (13) BN [23] is a very powerful technique in optimizing deep r,d σ + ǫ neural networks [35]–[37]. It normalizes the data distribution By doing this, the original architecture of the TSK fuzzy in each mini-batch to accelerate the training. For a mini-batch classiﬁer is kept unchanged. B = {x } , the output of BN is [23]: n=1 We also tested two variants of BN, as shown in Fig. 3. x − m n B The TSK with global BN (TSK-MBGD-UR-GBN) approach in x = BN(x ) = γ p + β, (9) σ + ǫ B Fig. 3(a) uses the BN normalized inputs in both antecedents 4 and consequents to compute the ﬁnal output. In this case, the A. Datasets output of TSK-MBGD-UR-GBN for Class c is: We evaluated our proposed algorithms on 12 classiﬁcation datasets from the UCI Machine Learning Repository . Their c c characteristics are summarized in Table I. For each dataset, y (x) = f (BN(x))y (BN(x)). (14) r r we randomly selected 70% samples as the training set and the r=1 remaining 30% as the test set for 30 times to get 30 different The TSK with rule-speciﬁc BN (TSK-MBGD-UR-RBN) ap- data splits. We ran each algorithm on these 30 data splits and proach in Fig. 3(b) uses the raw inputs to compute the report the average performance. antecedents, and rule-speciﬁc BN to compute each consequent individually. The output of TSK-MBGD-UR-RBN for Class c TABLE I is: SUMMARY OF THE 12 DATASETS. Index Dataset No. of Samples No. of Features No. of Classes c c y (x) = f (x)y (BN (x)), (15) r 1 Vehicle 846 18 4 r r r=1 2 Biodeg 1,055 41 2 3 DRD 1151 19 2 where BN represents the BN operation for the r-th rule. r 4 Yeast 1,484 8 10 5 Steel 1,941 27 7 TSK-MBGD-UR-GBN has the same computational cost as 6 IS 2,310 19 7 TSK-MBGD-UR-BN, but TSK-MBGD-UR-RBN has R times 7 Abalone 4,177 10 3 more BN parameters, and hence higher computational cost. 8 8 Waveform21 5,000 21 3 Both of them can be re-expressed in the original TSK archi- 9 Page-blocks 5,473 10 5 10 Satellite 6,435 36 6 tecture. We also evaluate their performances in Section III-G. 11 Clave 10,798 16 4 12 MAGIC 19,020 10 2 https://archive.ics.uci.edu/ml/datasets/Statlog+%28Vehicle+Silhouettes%29 https://archive.ics.uci.edu/ml/datasets/QSAR+biodegradation https://archive.ics.uci.edu/ml/datasets/Diabetic+Retinopathy+Debrecen +Data+Set https://archive.ics.uci.edu/ml/datasets/Yeast https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults https://archive.ics.uci.edu/ml/datasets/Image+Segmentation https://archive.ics.uci.edu/ml/datasets/Abalone https://archive.ics.uci.edu/ml/datasets/Waveform+Database+Generator +(Version+1) https://archive.ics.uci.edu/ml/datasets/Page+Blocks+Classiﬁcation https://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite) https://archive.ics.uci.edu/ml/datasets/Firm-Teacher Clave- Direction Classiﬁcation https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope (a) Some datasets contain both numerical features and cate- #"# gorical features. The categorical features were converted into $"# numerical ones by one-hot coding. We z-normalized each !"# # " feature using the mean and standard deviation computed from ! & N !" # # &"# the training set. #"$ # " " ! ! & $ $ $ N !" $"$ ! B. Algorithms !"$ &"$ We compared nine algorithms to validate our proposed ap- ! " $ !'&' #"% !" proaches. Among them, four were tree based approaches (DT, N % $"% RF, PART, and JRip), one was a TSK fuzzy system optimized ! ! !"' !"% !"% by a traditional approach (TSK-FCM-LSE), and the remaining $ four were TSK fuzzy systems optimized by MBGD based &"% approaches (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, (b) TSK-MBGD-UR-BN). Fig. 3. (a) TSK fuzzy system with global BN (TSK-MBGD-UR-GBN); and, The details of these nine algorithms are as follows: (b) TSK fuzzy system with rule-speciﬁc BN (TSK-MBGD-UR-RBN). 1) DT: Decision tree implemented in scikit-learn in Python. We used 5-fold cross-validation to select the maximum depth of the tree from {3, 4, 5, 6, 7} on the III. EXPERIMENTS AND RESULTS training set. Other parameters were set by default. This section validates the performances of our proposed UR http://archive.ics.uci.edu/ml/index.php and BN on multiple datasets from various application domains, https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTree with varying size and feature dimensionality. Classiﬁer.html 5 2) RF: Random forest implemented in scikit-learn in d = 1, ..., D) randomly from a uniform distribution U(−1, 1). Python. We set the number of trees to 20 and used 5- fold cross-validation to select the maximum depth of C. Performance Measures the trees from {3, 4, 5, 6, 7} on the training set. Other The raw classiﬁcation accuracy (RCA), which is the total parameters were set by default. number of correctly classiﬁed test samples divided by the total 3) PART [38]: The PART (partial decision tree) classiﬁer number of test samples, was used as our primary performance implemented in RWeka . All parameters were set by measure. default. Since some datasets have signiﬁcant class imbalance, in 4) JRip [39]: The RIPPER (Repeated Incremental Pruning addition to the RCA, we also computed the balanced clas- to Produce Error Reduction) classiﬁer implemented in siﬁcation accuracy (BCA), which is the mean of the per-class RWeka. All parameters were set by default. RCAs, as our second performance measure. 5) TSK-FCM-LSE [40]: We used fuzzy c-means (FCM) clustering to estimate the antecedent parameters, and D. Experimental Results LSE with L2 regularization to estimate the consequent The average test RCAs and BCAs are shown in Tables II parameters. and III, respectively. The largest value (best performance) on 6) TSK-MBGD: We used MBGD and AdaBound [22] to each dataset is marked in bold. To facilitate the comparison, optimize both the antecedent and the consequent param- we also show the ranks of the RCAs and BCAs in Tables IV eters. and V, respectively. 7) TSK-MBGD-UR: We used MBGD, AdaBound and UR The following observations can be made from the above (Section II-B) to optimize both the antecedent and the four tables: consequent parameters. The UR weight λ in (8) was 1) Generally, UR improved both RCA and BCA. Comparing selected from {0.1, 1, 10, 20, 50} by cross-validation on TSK-MBGD with TSK-MBGD-UR, and TSK-MBGD-BN the training set. with TSK-MBGD-UR-BN, we can conclude that gen- 8) TSK-MBGD-BN: We used MBGD, AdaBound and BN erally UR improved the classiﬁcation performance, (Section II-C) to optimize both the antecedent and the regardless of whether BN was used or not. The consequent parameters. average ranks in the last row of Tables IV and 9) TSK-MBGD-UR-BN: We used MBGD, AdaBound, BN V demonstrate this more clearly: the average rank and UR to optimize both the antecedent and the conse- of TSK-MBGD-UR (TSK-MBGD-UR-BN) was smaller quent parameters. The UR weight λ in (8) was selected than that of TSK-MBGD (TSK-MBGD-BN). from {0.1, 1, 10, 20, 50} by cross-validation on the train- 2) Generally, BN improved both RCA and BCA. Comparing ing set. TSK-MBGD with TSK-MBGD-BN, and TSK-MBGD-UR For TSK-FCM-LSE, TSK-MBGD, TSK-MBGD-BN, with TSK-MBGD-UR-BN, we can conclude that gen- TSK-MBGD-UR and TSK-MBGD-UR-BN, we set the L2 erally BN improved the classiﬁcation performance, regularization weight α = 0.05, and the number of rules regardless of whether UR was used or not. The R = 20. For TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR average ranks in the last row of Tables IV and and TSK-MBGD-UR-BN, we set the learning rate of V demonstrate this more clearly: the average rank AdaBound to 0.01, following our previous work [6]. In order of TSK-MBGD-BN (TSK-MBGD-UR-BN) was smaller to make use of all data in the training set and to reduce than that of TSK-MBGD (TSK-MBGD-UR). overﬁtting simultaneously, we randomly sampled 20% data 3) Generally, integrating BN and UR achieved from the training set and trained the TSK model with early further RCA and BCA improvements. Comparing stopping ﬁve times. The maximum epoch number was 2,000, TSK-MBGD-UR-BN with TSK-MBGD, TSK-MBGD-UR and the patience of early stopping 40. We recorded the and TSK-MBGD-BN, we can conclude that number of epochs at stopping in each run, and trained the TSK-MBGD-UR-BN almost always performed the ﬁnal model with the average stopping epoch number on the best on both RCA and BCA, as shown in Fig. 4. entire training set. This indicated that BN and UR are somehow k-mean clustering was used in the MBGD-based algo- complementary, and hence integrating them may rithms (TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR, and achieve better performance than using each one alone. TSK-MBGD-UR-BN) to initialize the antecedent parameters. 4) Overall, TSK-MBGD-UR-BN achieved the best perfor- We performed k-means clustering on the training set, where mance among the nine algorithms. The last row of Ta- k equaled R, the number of rules. We then initialized the ble V shows that TSK-MBGD-UR-BN achieved the best rule centers to the cluster centers, and randomly initialized the average BCA performance, and the last row of Table IV standard deviation σ from a Gaussian distribution N (1, 0.2). r,d shows that TSK-MBGD-UR-BN achieved the second For the consequent parameters, we set the initial bias of each best average RCA performance. Interestingly, RF had rule to zero, and the attribute weight b (r = 1, ..., R; r,d the best average rank on RCA, but only ranked the ﬁfth on BCA, suggesting that RF may tend to overlook the https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.Random minority classes. On the contrary, TSK-MBGD-UR-BN ForestClassiﬁer.html performed well on both RCA and BCA. https://cran.r-project.org/web/packages/RWeka/index.html 6 TABLE II AVERAGE RCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6907 0.7407 0.6892 0.7110 0.7411 0.6970 0.7354 0.7089 0.7907 Biodeg 0.8202 0.8572 0.8222 0.8362 0.8377 0.8523 0.8531 0.8539 0.8609 DRD 0.6283 0.6589 0.6240 0.6364 0.6824 0.6623 0.6618 0.6713 0.6720 Yeast 0.5564 0.5963 0.5731 0.5340 0.5851 0.5673 0.5770 0.5722 0.5725 Steel 0.7017 0.7328 0.7135 0.7120 0.6527 0.5864 0.7110 0.7248 0.7350 IS 0.9320 0.9529 0.9481 0.9608 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.7170 0.7314 0.7254 0.7104 0.7323 0.5821 0.7129 0.6238 0.7306 Waveform21 0.7641 0.8369 0.7908 0.7843 0.8647 0.6779 0.8002 0.8363 0.8234 Page-blocks 0.9651 0.9688 0.9681 0.9677 0.9499 0.9375 0.9419 0.9515 0.9580 Satellite 0.8524 0.8863 0.8587 0.8592 0.8864 0.4890 0.8001 0.8929 0.8943 Clave 0.7103 0.7600 0.7344 0.7779 0.7690 0.8223 0.8427 0.8187 0.8192 MAGIC 0.8427 0.8531 0.8455 0.8488 0.8319 0.7347 0.7861 0.8574 0.8392 Average 0.7651 0.7979 0.7744 0.7782 0.7909 0.6821 0.7648 0.7806 0.8038 TABLE III AVERAGE BCAS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 0.6936 0.744 0.6939 0.7131 0.7443 0.7010 0.7380 0.7127 0.7930 Biodeg 0.7973 0.8306 0.7899 0.8122 0.8205 0.8368 0.8318 0.8390 0.8439 DRD 0.634 0.6624 0.6227 0.6422 0.6845 0.6642 0.6634 0.6717 0.6729 Yeast 0.3998 0.4867 0.5203 0.4889 0.5102 0.4951 0.5184 0.4946 0.5332 Steel 0.7005 0.6937 0.7129 0.7267 0.6319 0.5933 0.7258 0.7245 0.7515 IS 0.932 0.9529 0.9481 0.9607 0.9571 0.5762 0.7557 0.8559 0.9501 Abalone 0.5319 0.5362 0.5371 0.5280 0.5402 0.4567 0.5236 0.4791 0.5402 Waveform21 0.7637 0.8365 0.7905 0.7844 0.8645 0.6784 0.8003 0.8362 0.8233 Page-blocks 0.7986 0.7385 0.8192 0.8162 0.6003 0.5129 0.5609 0.6033 0.671 Satellite 0.8204 0.8480 0.8308 0.834 0.8558 0.4337 0.7651 0.8679 0.8700 Clave 0.4701 0.4878 0.4985 0.6507 0.4825 0.5876 0.6468 0.6374 0.6421 MAGIC 0.8058 0.8108 0.8052 0.8135 0.7886 0.6325 0.7128 0.8225 0.7934 Average 0.6956 0.7190 0.714 0.7309 0.7067 0.5974 0.6869 0.7120 0.7404 TABLE IV RCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 8 3 9 5 2 7 4 6 1 Biodeg 9 2 8 7 6 5 4 3 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 8 1 4 9 2 7 3 6 5 Steel 7 2 4 5 8 9 6 3 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 2 4 7 1 9 6 8 3 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 4 1 2 3 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 8 5 6 2 1 4 3 MAGIC 5 2 4 3 7 9 8 1 6 Average 7.0 2.9 5.8 5.3 3.8 7.3 5.5 4.3 3.0 TABLE V BCA RANKS OF THE NINE ALGORITHMS ON THE 12 DATASETS. Dataset CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN Vehicle 9 3 8 5 2 7 4 6 1 Biodeg 8 5 9 7 6 3 4 2 1 DRD 8 6 9 7 1 4 5 3 2 Yeast 9 8 2 7 4 5 3 6 1 Steel 6 7 5 2 8 9 3 4 1 IS 6 3 5 1 2 9 8 7 4 Abalone 5 4 3 6 1 9 7 8 2 Waveform21 8 2 6 7 1 9 5 3 4 Page-blocks 3 4 1 2 7 9 8 6 5 Satellite 7 4 6 5 3 9 8 2 1 Clave 9 7 6 1 8 5 2 4 3 MAGIC 4 3 5 2 7 9 8 1 6 Average 6.8 4.7 5.4 4.3 4.2 7.3 5.4 4.3 2.6 7 TABLE VI p-VALUES OF NON-PARAMETRIC MULTIPLE COMPARISONS ON THE RCAS AND BCAS. Metric CART RF JRip PART TSK-FCM-LSE TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR RCA 0.0097 0.0628 0.1368 0.3239 0.2723 0.0000 - - TSK-MBGD-BN BCA 0.1547 0.1627 0.4508 0.0981 0.4518 0.0000 - - RCA 0.0001 0.3740 0.0090 0.0460 0.2452 0.0000 0.1146 - TSK-MBGD-UR BCA 0.0036 0.2900 0.0912 0.4420 0.0921 0.0000 0.0731 - RCA 0.0000 0.2113 0.0002 0.0025 0.0404 0.0000 0.0094 0.1409 TSK-MBGD-UR-BN BCA 0.0000 0.0291 0.0021 0.0730 0.0022 0.0000 0.0013 0.0986 1.2 MBGD-based TSK models were trained, on three represen- TSK-MBGD TSK-MBGD-BN TSK-MBGD-UR TSK-MBGD-UR-BN tative datasets. For TSK-MBGD, a few “richest” rules had much larger average ﬁring levels than others, and hence the 0.8 rules contributed signiﬁcantly differently to the output. BN 0.6 may help alleviate this problem a little bit, as the average 0.4 normalized rule ﬁring levels in TSK-MBGD-BN were more 10 4 6 7 5 3 8 1 12 11 uniform than those in TSK-MBGD, which also resulted in Sorted Dataset Index better classiﬁcation performances, as demonstrated in the pre- (a) vious subsection. However, UR had the most direct effect on 1.2 TSK-MBGD TSK-MBGD-BN alleviating the “rich get richer” problem, as TSK-MBGD-UR TSK-MBGD-UR TSK-MBGD-UR-BN (TSK-MBGD-UR-BN) had much more uniform average nor- 0.8 malized rule ﬁring levels than TSK-MBGD (TSK-MBGD-BN), 0.6 and hence also better classiﬁcation performance. Note that we set τ = 1/C in (8), where C = 6 for Satellite, 0.4 10 4 6 7 5 3 8 1 12 11 C = 4 for Vehicle, and C = 2 for Biodeg. However, the Sorted Dataset Index actual average normalized rule ﬁring levels were not exactly (b) τ on these datasets. Our experiments showed that although UR cannot guarantee the average normalized rule ﬁring levels to Fig. 4. (a) RCAs and (b) BCAs of the four MBGD-based TSK fuzzy classiﬁers on the 12 datasets. Datasets were sorted according to the RCAs be around τ, it can indeed make the rules ﬁred more uniformly. of the TSK-MBGD model. The indices along the horizontal axis denote the Why may making the rules ﬁred more uniformly help dataset indices in Table I. improve the generalization performance? In [32] we pointed out that a TSK fuzzy system may be functionally equivalent E. Statistical Analysis to an adaptive stacking ensemble model, in which each rule can be viewed as a base learner, and the aggregation weights To further evaluate the performance improvement of our equal the corresponding rule ﬁring levels. When the rule ﬁring proposed TSK-MBGD-UR-BN over others, we also performed levels are more uniform, generally more rules are utilized in non-parametric multiple comparison tests on the RCAs and computing the output, i.e., more base learners are used in BCAs using Dunn’s procedure [41], with a p-value correction the stacking ensemble model, which may help improve the using the False Discovery Rate method [42]. The results are generalization performance. shown in Table VI, where the statistically signiﬁcant ones are To demonstrate this, we computed the entropy of the nor- marked in bold. malized rule ﬁring levels for each input example: Table VI demonstrates that our proposed BN and UR can signiﬁcantly improve the generalization perfor- E = − f log f , (16) mance of the traditional MBGD optimization for TSK r r fuzzy classiﬁers. TSK-MBGD-UR-BN statistically signiﬁ- cantly outperformed CART, JRip, PART, TSK-MBGD and where f is the normalized ﬁring level of the r-th rule. TSK-MBGD-BN on RCA, and also statistically signiﬁcantly Generally, a larger entropy means more rules were ﬁred. outperformed CART, JRip, TSK-FCM-LSE, TSK-MBGD and Fig. 6 shows the histogram of the entropy distributions TSK-MBGD-BN on BCA. Although the performance improve- on the Satellite dataset. When training TSK fuzzy systems ment of TSK-MBGD-UR-BN over RF and TSK-MBGD-UR without UR, many samples had close to zero E, i.e., all were not statistically signiﬁcant, they were quite close to the except one rule had ﬁring levels close to zero. When UR was threshold, especially for the BCA. added, the number of examples with close to zero E decreased signiﬁcantly, i.e., more rules with larger ﬁring levels were used in computing the output. F. Effect of UR As mentioned in Section II-B, using MBGD to optimize the G. Effect of BN TSK fuzzy system may face the “rich get richer” problem. To demonstrate this, Fig. 5 shows the average normalized We also used the Satellite dataset to analyze the effect of ﬁring levels of the rules on the entire dataset after the four BN. BCA RCA 8 We set the UR weight λ = 1 and recorded the training 0.2 TSK-MBGD TSK-MBGD-UR loss and test BCA in the ﬁrst 20 training epochs. This TSK-MBGD-BN TSK-MBGD-UR-BN 0.15 process was repeated 10 times, and the average results are shown in Figs. 7(a) and 7(b), respectively. BN resulted in 0.1 smaller training losses and better generalization performances in testing. 0.05 There is still no agreement on theoretically why BN is helpful in optimizing deep neural networks [24]; thus, it is 1 2 4 6 8 10 12 14 16 18 20 also challenging to analyze theoretically why BN can help Sorted Rule Index the optimization of TSK fuzzy systems. Nevertheless, we (a) performed an empirical study to peek into this, by recording 0.25 the L1 norm of the antecedent parameters’ gradients and the TSK-MBGD TSK-MBGD-UR L1 norm of the consequent parameters’ gradients in the ﬁrst 20 0.2 TSK-MBGD-BN TSK-MBGD-UR-BN training epochs on the Satellite dataset. The results are shown 0.15 in Figs. 7(c) and 7(d), respectively. BN signiﬁcantly increased the gradients of both antecedent and consequent parameters. 0.1 With the same learning rate, this can expedite the convergence. 0.05 TSK-MBGD TSK-MBGD-UR 1 2 4 6 8 10 12 14 16 18 20 TSK-MBGD-BN TSK-MBGD-UR-BN Sorted Rule Index 0.8 (b) 0.6 0.4 TSK-MBGD TSK-MBGD-UR 0.4 TSK-MBGD-BN TSK-MBGD-UR-BN 0.3 3 4 6 8 10 12 14 16 18 20 Epoch 0.2 (a) 0.84 0.1 0.83 0 0.82 1 2 4 6 8 10 12 14 16 18 20 0.81 TSK-MBGD TSK-MBGD-UR Sorted Rule Index TSK-MBGD-BN TSK-MBGD-UR-BN 0.8 (c) 3 4 6 8 10 12 14 16 18 20 Epoch Fig. 5. Average normalized rule ﬁring levels of TSK-MBGD, TSK-MBGD-BN, TSK-MBGD-UR and TSK-MBGD-UR-BN on (a) Satellite, (b) Vehicle, and (c) (b) Biodeg datasets. 1.3 1.2 TSK-MBGD TSK-MBGD-UR 1.1 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 0.9 3 4 6 8 10 12 14 16 18 20 Epoch (c) 0 0.5 1 1.5 2 (a) TSK-MBGD-BN TSK-MBGD-UR-BN 400 TSK-MBGD TSK-MBGD-UR TSK-MBGD-BN TSK-MBGD-UR-BN 3 4 6 8 10 12 14 16 18 20 Epoch (d) 0 0.5 1 1.5 2 Fig. 7. (a) Training loss, (b) test BCA, (c) L1 norm of the antecedent param- eters’ gradients, and (d) L1 norm of the consequent parameters’ gradients, in (b) the ﬁrst 20 training epochs on the Satellite dataset. The horizontal axis starts from 3 epochs so that the differences among the curves can be more clearly Fig. 6. Histogram of the normalized rule ﬁring level entropy E visualized. of (a) TSK-MBGD and TSK-MBGD-UR, and, (b) TSK-MBGD-BN and TSK-MBGD-UR-BN, on the Satellite dataset. We also evaluated the performances of the Number of Samples Number of Samples Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Avg Norm. Rule Firing Level Consequent Grad. Antecedent Grad. Test BCA Training Loss 9 two BN variants introduced in Section II-C. The IV. CONCLUSIONS AND FUTURE RESEARCH BCAs of TSK-MBGD-UR, TSK-MBGD-UR-BN, TSK fuzzy systems are powerful and frequently used ma- TSK-MBGD-UR-GBN and TSK-MBGD-UR-RBN are shown chine learning models, for both regression and classiﬁca- in Table VII. TSK-MBGD-UR-BN performed the best, and tion. However, they may not be easily applicable to large TSK-MBGD-UR-GBN the worst. Since TSK-MBGD-UR-RBN and/or high-dimensional datasets. Our very recent research had more parameters to optimize, its training was not as [6] proposed an MBGD-based efﬁcient and effective training stable as TSK-MBGD-UR-BN and TSK-MBGD-UR-GBN. algorithm (MBGD-RDA) for TSK fuzzy systems for regres- Therefore, TSK-MBGD-UR-BN is the best choice. sion problems. This paper has proposed an MBGD-based algorithm, TSK-MBGD-UR-BN, to train TSK fuzzy systems TABLE VII for classiﬁcation problems. It can deal with both small and AVERAGE BCAS OF THE THREE BN VARIANTS ON THE 12 DATASETS. big data with different dimensionalities, and may be the only TSK-MBGD TSK-MBGD TSK-MBGD TSK-MBGD algorithm that can train a TSK fuzzy classiﬁer on big and Dataset -UR -UR-BN -UR-GBN -UR-RBN high-dimensional datasets. TSK-MBGD-UR-BN integrates two Vehicle 0.7127 0.7930 0.7261 0.7679 Biodeg 0.8390 0.8439 0.8422 0.8440 novel techniques, which are also ﬁrst proposed in this paper: DRD 0.6717 0.6729 0.6636 0.6650 1) UR, which is a regularization term in the loss function Yeast 0.4946 0.5332 0.4352 0.5339 to ensure that all rules are ﬁred similarly on average, Steel 0.7245 0.7515 0.7332 0.7219 IS 0.8559 0.9501 0.9115 0.8938 and hence to improve the generalization performance. Abalone 0.4791 0.5402 0.4924 0.5275 2) BN, which normalizes the inputs in computing the rule Waveform21 0.8362 0.8233 0.8232 0.8334 consequents to speedup the convergence and to improve Page-blocks 0.6033 0.6710 0.5912 0.6333 Satellite 0.8679 0.8700 0.8679 0.8216 the generalization. Clave 0.6374 0.6421 0.6090 0.6442 Experiments on 12 UCI datasets from various domains, MAGIC 0.8225 0.7934 0.8319 0.8318 Average 0.7121 0.7404 0.7106 0.7265 with varying size and feature dimensionality, demonstrated that each of UR and BN has its own unique advantages, and integrating them can achieve the best classiﬁcation per- formance. TSK-MBGD-UR-BN, together with MBGD-RDA H. Effect of the Batch Size proposed in [6], shall greatly promote the applications of TSK The batch size is an important hyper-parameter in MBGD- fuzzy systems in both classiﬁcation and regression, especially based optimization. It determines the memory requirement for big data problems. and the convergence speed in training. A larger batch size The proposed TSK-MBGD-UR-BN also has some limita- leads to faster convergence but also requires more memory. tions, which will be addressed in our future research. First, In [43], the authors analyzed the effect of the batch size for very high dimensional data, fuzzy partitions of the input on the generalization performance. Their results showed that space become very complicated, and numeric underﬂow may using a larger batch size causes degradation in the model happen when the product t-norm is used. Further research shall generalization performance, because it tends to converge to consider rules that automatically select the most relevant at- a shaper minimum, which makes the model sensitive to noise. tributes as the antecedents. Second, we shall investigate how to A similar ﬁnding was presented in [44] that a smaller batch improve the interpretability of data-driven TSK fuzzy systems. size leads to more stable and reliable training. However, since This is also partially linked to the ﬁrst problem, as reducing we used the mean, standard deviation and mean ﬁring level the number of antecedents can improve the interpretability of of each batch to compute the losses, too small batch size may the rules. also lead to poor performance. We validated our model on the Satellite dataset with batch REFERENCES size varying from 16 to 2,048. The test RCAs and BCAs averaged over 30 runs are shown in Fig. 8. The test perfor- [1] A.-T. Nguyen, T. Taniguchi, L. Eciolaza, V. Campos, R. Palhares, and mance decreased with too small or too large batch sizes. For M. Sugeno, “Fuzzy control systems: Past, present and future,” IEEE Computational Intelligence Magazine, vol. 14, no. 1, pp. 56–68, 2019. TSK-MBGD-UR-BN, it seems that a batch size within [64, [2] Y. Shi, R. Eberhart, and Y. Chen, “Implementation of evolutionary fuzzy 256] is a good choice. systems,” IEEE Trans. on Fuzzy Systems, vol. 7, no. 2, pp. 109–119, 0.9 [3] D. Wu and W. W. Tan, “Genetic learning and performance evaluation of interval type-2 fuzzy logic controllers,” Engineering Applications of 0.85 0.85 Artiﬁcial Intelligence, vol. 19, no. 8, pp. 829–841, 2006. 0.8 [4] L.-X. Wang and J. M. Mendel, “Back-propagation of fuzzy systems 0.8 as nonlinear dynamic system identiﬁers,” in Proc. IEEE Int’l Conf. on 0.75 Fuzzy Systems, San Diego, CA, Sep. 1992, pp. 1409–1418. 0.75 0.7 [5] J. S. R. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” 0.7 0.65 IEEE Trans. on Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665– 16 32 64 128 256 512 1024 2048 685, 1993. Batch Size [6] D. Wu, Y. Yuan, J. Huang, and Y. Tan, “Optimize TSK fuzzy systems for big data regression problems: Mini-batch gradient descent Fig. 8. Average RCAs and BCAs of TSK-MBGD-UR-BN on the Satellite with regularization, DropRule and AdaBound (MBGD-RDA),” IEEE dataset, using different batch sizes. Trans. on Fuzzy Systems, 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10951 RCA BCA 10 [7] Y. Jin, “Fuzzy modeling of high-dimensional systems: complexity reduc- [32] D. Wu, C.-T. Lin, J. Huang, and Z. Zeng, “On the functional equivalence tion and interpretability improvement,” IEEE Trans. on Fuzzy Systems, of TSK fuzzy systems to neural networks, mixture of experts, CART, vol. 8, no. 2, pp. 212–221, 2000. and stacking ensemble regression,” IEEE Trans. on Fuzzy Systems, [8] Y. Deng, Z. Ren, Y. Kong, F. Bao, and Q. Dai, “A hierarchical fused 2020, in press. [Online]. Available: https://arxiv.org/abs/1903.10572 fuzzy deep neural network for data classiﬁcation,” IEEE Trans. on Fuzzy [33] T. Shen, M. Ott, M. Auli, and M. Ranzato, “Mixture models for Systems, vol. 25, no. 4, pp. 1006–1012, 2016. diverse machine translation: Tricks of the trade,” arXiv preprint arXiv:1902.07816, 2019. [9] F.-L. Chung, Z. Deng, and S. Wang, “From minimum enclosing ball to fast fuzzy inference system training on large datasets,” IEEE Trans. on [34] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, Fuzzy Systems, vol. 17, no. 1, pp. 173–184, 2008. and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017. [10] M. Nilashi, O. Bin Ibrahim, N. Ithnin, and N. H. Sarmin, “A multi- [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for criteria collaborative ﬁltering recommender system for the tourism image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern domain using Expectation Maximization (EM) and PCA–ANFIS,” Elec- Recognition, Las Vegas, NV, Jun. 2016, pp. 770–778. tronic Commerce Research and Applications, vol. 14, no. 6, pp. 542–562, [36] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016. [11] C. K. Lau, K. Ghosh, M. A. Hussain, and C. R. C. Hassan, “Fault [37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely diagnosis of Tennessee Eastman process with multi-scale PCA and connected convolutional networks,” in Proc. IEEE Conf. on Computer ANFIS,” Chemometrics and Intelligent Laboratory Systems, vol. 120, Vision and Pattern Recognition, Honolulu, HI, Jul. 2017, pp. 4700–4708. pp. 1–14, 2013. [38] E. Frank and I. H. Witten, “Generating accurate rule sets without global [12] Z. Deng, K.-S. Choi, Y. Jiang, J. Wang, and S. Wang, “A survey on soft optimization,” in Proc. Int’l Conf. on Machine Learning, San Francisco, subspace clustering,” Information Sciences, vol. 348, pp. 84–106, 2016. CA, Jul. 1998. [13] Z. Deng, K.-S. Choi, F.-L. Chung, and S. Wang, “Enhanced soft [39] W. W. Cohen, “Repeated incremental pruning to produce error reduc- subspace clustering integrating within-cluster and between-cluster in- tion,” in Proc. Int’l Conf. on Machine Learning, Tahoe City, CA, Jun. formation,” Pattern Recognition, vol. 43, no. 3, pp. 767–781, 2010. [14] M. J. Gacto, M. Galende, R. Alcala´, and F. Herrera, “METSK-HDe: [40] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, “Neuro-fuzzy and soft A multiobjective evolutionary algorithm to learn accurate TSK-fuzzy computing-a computational approach to learning and machine intelli- systems in high-dimensional and large-scale regression problems,” In- gence,” IEEE Trans. on Automatic Control, vol. 42, no. 10, pp. 1482– formation Sciences, vol. 276, pp. 63–79, 2014. 1484, 1997. [15] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Boston, [41] O. J. Dunn, “Multiple comparisons using rank sums,” Technometrics, MA: MIT press, 2016. vol. 6, no. 3, pp. 241–252, 1964. [16] S. Ruder, “An overview of gradient descent optimization algorithms,” [42] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: arXiv preprint arXiv:1609.04747, 2016. A practical and powerful approach to multiple testing,” Journal of the [17] L. Bottou, “Large-scale machine learning with stochastic gradient de- Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995. scent,” in Proc. Int’l Conf. on Computational Statistics. Paris, France: [43] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Springer, Aug. 2010, pp. 177–186. Tang, “On large-batch training for deep learning: Generalization gap [18] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance and sharp minima,” in Proc. Int’l Conf. on Learning Representations, of initialization and momentum in deep learning,” in Proc. Int’l Conf. Toulon, France, Apr. 2017. on Machine Learning, Atlanta, GA, Jun. 2013, pp. 1139–1147. [44] D. Masters and C. Luschi, “Revisiting small batch training for deep [19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” neural networks,” arXiv preprint arXiv:1804.07612, 2018. in Proc. Int’l Conf. on Learning Representations, San Diego, CA, May [20] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The marginal value of adaptive gradient methods in machine learning,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 4148–4158. [21] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017. [22] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive gradient methods with dynamic bound of learning rate,” in Proc. Int’l Conf. on Learning Representations, New Orleans, LA, May 2019. [23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. Int’l Conf. on Machine Learning, Lille, France, Jul. 2015. [24] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How does batch nor- malization help optimization?” in Proc. Advances in Neural Information Processing Systems, Montral , Canada, Dec. 2018, pp. 2483–2493. [25] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016. [26] L. Fan, “Revisit fuzzy neural network: Demystifying batch normalization and ReLU with generalized hamming network,” in Proc. Advances in Neural Information Processing Systems, Long Beach, CA, Dec. 2017, pp. 1923–1932. [27] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” arXiv preprint arXiv:1511.07289, 2015. [28] Y. Wu and K. He, “Group normalization,” in Proc. European Conf. on Computer Vision, Munich, Germany, Sep. 2018, pp. 3–19. [29] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Computation, vol. 3, no. 1, pp. 79–87, [30] H. Bersini and G. Bontempi, “Now comes the time to defuzzify neuro- fuzzy models,” Fuzzy Sets and Systems, vol. 90, no. 2, pp. 161–169, [31] H. Andersen, A. Lotﬁ, and L. Westphal, “Comments on ‘functional equivalence between radial basis function networks and fuzzy inference systems’ [and author’s reply],” IEEE Trans. on Neural Networks, vol. 9, no. 6, pp. 1529–1532, 1998.

Statistics – arXiv (Cornell University)

**Published: ** Aug 1, 2019

Loading...

You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!

Read and print from thousands of top scholarly journals.

System error. Please try again!

Already have an account? Log in

Bookmark this article. You can see your Bookmarks on your DeepDyve Library.

To save an article, **log in** first, or **sign up** for a DeepDyve account if you don’t already have one.

Copy and paste the desired citation format or use the link below to download a file formatted for EndNote

Access the full text.

Sign up today, get DeepDyve free for 14 days.

All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.