An Experimental Evaluation of Large Scale GBDT Systems
An Experimental Evaluation of Large Scale GBDT Systems
Fu, Fangcheng;Jiang, Jiawei;Shao, Yingxia;Cui, Bin
2019-07-03 00:00:00
x# x# y x{ Fangcheng Fu Jiawei Jiang Yingxia Shao Bin Cui Department of Computer Science and Technology & Key Laboratory of High Confidence Software Technologies (MOE), Peking University Center for Data Science & National Engineering Laboratory for Big Data Analysis and Applications, Peking University y # Beijing University of Posts and Telecommunications Tencent Inc. {ccchengff,bin.cui}@pku.edu.cn shaoyx@bupt.edu.cn jeremyjiang@tencent.com ABSTRACT only the data scientists choose it as a favorite tool for data analytic competitions such as Kaggle, but also users from Gradient boosting decision tree (GBDT) is a widely-used industry raise interests in deploying GBDT in production machine learning algorithm in both data analytic competi- environments [16, 43, 17]. tions and real-world industrial applications. Further, driven With the rapid increase in data volume, distributed GBDT by the rapid increase in data volume, efforts have been made has been intensively studied to improve the performance. to train GBDT in a distributed setting to support large-scale Recently, a range of distributed machine learning systems has workloads. However, we find it surprising that the existing been developed to train GBDT, such as XGBoost, LightGBM systems manage the training dataset in different ways, but and DimBoost [8, 20, 43, 30, 23, 17]. However, in practical none of them have studied the impact of data management. use, there is no such system able to outperform the others in To that end, this paper aims to study the pros and cons of dif- all cases. We notice that these systems manage the training ferent data management methods regarding the performance dataset in different ways. This motivates us to conduct a of distributed GBDT. study of the data management in distributed GBDT. We first introduce a quadrant categorization of data man- Consider the training dataset as a matrix, where each agement policies based on data partitioning and data storage. row represents one instance and each column refers to one Then we conduct an in-depth systematic analysis and sum- dimension of feature. To make distributed machine learning marize the advantageous scenarios of the quadrants. Based possible, we need to partition the dataset among the workers on the analysis, we further propose a novel distributed GBDT in a cluster. Afterwards, each worker uses some storage system named Vero, which adopts the unexplored composi- structure to store the data partition. As a result, there are tion of vertical partitioning and row-store and suits for many two orthogonal aspects in the data management of distributed large-scale cases. To validate our analysis empirically, we GBDT — data partitioning and data storage. implement different quadrants in the same code base and Data Partitioning. Since the dataset is a two-dimensional compare them under extensive workloads, and finally com- matrix, there are two different schemes to partition the pare Vero with other state-of-the-art systems over a wide dataset over the workers. Horizontal partitioning, which range of datasets. Our theoretical and experimental results is the de facto choice of most distributed machine learning provide a guideline on choosing a proper data management algorithms, horizontally partitions the dataset by instances policy for a given workload. (rows). Vertical partitioning is an alternative to hori- zontal partitioning. The workers partition the dataset by features (columns) and each worker stores a feature subset. Data Storage. After data partitioning, each worker has a portion of the training data, either a horizontal partition or a vertical partition. Without loss of generality, we assume the dataset is sparse. There are two avenues to store the data. 1. INTRODUCTION Row-store is a popular choice in machine learning. Each in- Gradient boosting decision tree (GBDT) [12] is an en- stance is stored as a set of hfeature index, feature valueipairs, semble model which uses decision tree as weak learner and a.k.a. Compressed Sparse Row (CSR) format. Many algo- improves model quality with a boosting strategy [11, 38]. rithms follow a row-based training routine which supports It has achieved superior performance in various workloads, scanning the training data sequentially. Column-store puts such as prediction, regression, and ranking [27, 37, 7]. Not together one column (feature) of the partition, and stores each column as a set of hinstance index, feature valuei pairs, a.k.a. Compressed Sparse Column (CSC) format. If we revisit the methods of data management, there are two data partitioning choices and two data storage choices, yielding four possible combinations. Using a quadrant-based manner, Figure 1 summarizes four combinations into four quadrants. Interestingly, three quadrants have been explored by existing systems, but none of these works study which is the best combination. As a result, the researchers and arXiv:1907.01882v2 [cs.LG] 5 Aug 2019 Column-store Row-store Married Salary<100 Tree 1 Tree 2 QD2 Y N N QD1 Horizontal LightGBM Age<35 XGBoost partitioning DimBoost Y N QD3 Vertical QD4 Yggdrasil w=10 partitioning Vero (This work) w=5 w=10 w=3 w=5 Prediction( )=3+5=8 Prediction( )=10+10=20 Figure 1: Quadrants of existing works Figure 2: An illustration of GBDT engineers might be confused when they need to choose the platform for their specific workloads. To address this issue, (Comprehensive Evaluation) We implement distributed we ask the question what are the advantages and disadvan- GBDT on top of Spark [39], a popular distributed engine tages of different data management schemes, and how can for large-scale data processing, and conduct extensive exper- we make a proper choice facing different scenarios? iments to validate our analysis empirically. Breakdown comparison of data management. To fairly 1.1 Summary of Contributions evaluate each candidate in data partitioning and data storage, We list the main contributions of this work below. we implement different partitioning schemes and storage (Anatomy of existing systems) To answer the above patterns in the same code base, and compare them under questions, we first study how data management influences the different circumstances using a wide range of datasets. Our performance of distributed GBDT. Specifically, we conduct experimental results regarding computation, communication, a theoretical analysis of data partitioning and data storage. and memory cost validate our theoretical anatomy. Anatomy of data partitioning. The data partitioning di- End-to-end evaluation. We compare Vero with other popu- rectly affects the communication and memory cost due to lar GBDT systems over extensive datasets, including public, a data structure called gradient histogram, which summa- synthetic, and industrial datasets. Empirical results show rizes gradient statistics for fast and accurate split finding that our analytical comparison also holds for the state-of-the- in GBDT. We find that vertical partitioning is more suit- art systems. Regarding the results, we provide suggestions able for a range of workloads, including high-dimensional on how to choose a proper platform for a given workload. features, deep trees, and multi-classification. The fundamen- tal reason is that these factors could cause extreme large 2. BACKGROUND gradient histograms, and vertical partitioning helps avoid intensive communication and memory overhead. In contrast, 2.1 Preliminaries of GBDT horizontal partitioning works better for datasets with low dimensionality and a large number of instances. 2.1.1 Overview of GBDT Anatomy of data storage. In GBDT, the training pro- Gradient boosting decision tree is a boosting algorithm cedures, especially the construction of gradient histograms that uses decision tree as weak learner. Figure 2 shows involve complex data access and indexing, and the efficiency an illustration of GBDT. Given a training dataset with N is influenced by the data storage. We carefully investigate N D instances and D features f(x ; y )g , where x 2 R and i i i the computation efficiency of row- and column-store in terms i=1 y 2 R are the feature vector and label of an instance, GBDT of data access and indexing. We find that although column- i trains a set of decision trees ff (x)g , puts each instance store seems more natural for vertical partitioning, as adopted t=1 onto one leaf node, and sums the leaf predictions of all trees by database design, the computation overhead is rather un- as the final instance prediction: y ^ = f (x ), where T i t i desirable. Row-store is superior to column-store given a large t=1 denotes the total number of trees and is a hyper-parameter number of instances, achieving a higher computation effi- called learning rate (a.k.a. step size). ciency. In short, our main finding is that row-store is almost GBDT trains the decision trees sequentially. For the t-th always a wiser choice unless the dataset is high-dimensional tree, it tries to minimize the loss given the predictions of and meanwhile contains very few instances. prior trees, defined by the regularized objective function: (Proposal of Vero) Unfortunately, although our study discovers that the fourth quadrant in Figure 1 is suitable X X for a wide range of large-scale scenarios, including high- (t) (t 1) (t) F = l(y ; y ^ ) + (f )= l(y ; y ^ +f (x )) + (f ); i t i t i t i i dimensional datasets, multi-classification tasks, and deep trees, it is never investigated by previous works. In this where l is usually a differentiable convex loss function that work, we propose Vero, an end-to-end distributed GBDT measures the loss given prediction and target, e.g., logistic system that uses vertical partitioning and row-store. loss or square loss. is a regularization term to avoid over- Horizontal-to-vertical transformation. We develop an effi- fitting. We follow the popular choice in [8, 17], which is cient algorithm to transform the horizontally stored datasets (f ) =
J + jj! jj =2, where ! denotes the weight vector t t t 2 t to vertically partitioned. To reduce the network overhead, comprised of J leaf values in in the t-th tree.
and are we compress both feature indices and feature values, without hyper-parameters that control the complexity of one tree. any loss of model accuracy. To quickly optimize the objective function, LogitBoost [11] Training with Vertical Row-store. We redesign the training (t) proposes to approximate F with second-order Taylor ex- routine of GBDT to match the vertical partitioning and row- pansion when training the t-th tree, i.e., store policy. Specifically, we adapt the split finding and node splitting procedures to vertical partitioning, and adopts a node-to-instance index for row-store to construct the gradient X (t) (t 1) 2 F l(y ; y ^ ) + g f (x ) + h f (x ) + (f ); i i t i i t i t histograms efficiently. i 2 broadcast aggregated Worker or Master Worker #1 local histograms f-th feature of all instances Candidate splits Propose histograms feat1 feat2 candidate q << Nf v v … … … v s … … s feat1 feat2 best split 1f 2f Nf 1 q construct splits find split first-order gradients second-order gradients g g … … … g h h … … … h 1 2 N 1 2 N Construct gradient decision tree first-order second-order Worker #2 local histograms Worker #1 and #2 ∑" … … ∑" ∑# … … ∑# histogram histogram histogram feat1 feat2 set split best split s s s s s s 1 2 q 1 2 q decision tree Worker #1 construct split node $ → $ % ' G G H H set split L R L R indexing best split , = , + , , / = / + / - 2 - 2 split node Find split ∑" ∑" ∑# ∑# … … … … . . . update , , , - 2 indexing of feature ()* ( + − − 4) $ →$ % ' / + 1 / + 1 / + 1 - 2 histogram bin (a) Horizontal partitioning. Workers construct local his- update tograms for all features and aggregate into global ones. broadcast Figure 3: Histogram-based split finding for one feature decision tree Worker #2 decision tree set split histograms Worker #1 Worker who proposes best split instance feat1 global best split set split local best split node placement global best split (t 1) (t 1) L R 2 construct find split split node indexing ins1: L where g = @ l(y ; y ^ ) and h = @ l(y ; y ^ ) split i (t 1) i i i (t 1) i i y^ indexing y^ ins2: R update …… are the first- and second-order gradients. Denote I = fijx 2 update j i broadcast leaf g as the set of instances classified onto the j-th leaf. decision tree Other worker(s) histograms Worker #2 instance Omitting the constant term, we should minimize feat2 set split local best placement global best split split node L R construct find split ins1: L split indexing ins2: R t h i …… X X X update (t) 2 F = g ! + h + ! +
J : i j i t (b) Vertical partitioning. Worker who proposes global best j=1 i2I i2I j j split broadcasts the instance placement after node splitting. If the tree is not going to be expanded (no leaf to be split), Row-store Column-store we can obtain its optimal weight vector and minimal loss by row-by-row i instance index P P feature index J 2 g X ( g ) i i feat1 feat2 3 i2I 1 i2I j j (t) feature value ! = P ; F = P +
J : (1) 4 instance gradient h + 2 h + i i i2I i2I j j=1 j (c) Row-store and column-store Equation 1 can be reckoned as a measurement to evaluate the performance of a decision tree, which can be analogous Figure 4: Illustration of data partitioning and storage to the impurity functions of decision tree algorithms, such as entropy for ID3 [31] or Gini-index for CART [6]. To grow a tree w.r.t. minimizing the total loss, the common approach is to select a tree node (beginning with the root node) and whose f-th feature values fall into that range. In this way, find the best split (a split feature and a split value) that can each feature is summarized by two histograms. We find the achieve the maximal split gain. The split gain is defined as best split of feature f upon the histograms by Equation 2, and the global best split is the best split over all features. P P 2 2 Histogram subtraction technique. Another advan- ( g ) ( g ) 1 i i ( g ) i2I i2I i i2I L R inst i-1: … Gain = P + P P
; tage of the histogram-based algorithm is that we can accel- inst i: (feat1, v1), (feat3, v3) 2 h + h + h + i i i i2I i2I i2I L R erate the algorithm by a histogram subtraction technique. inst i+1: … (2) The instances on two children nodes are non-overlapping and where I and I indicate the left and right child nodes after L R mutual exclusive, since an instance will be classified onto the splitting. After the current tree finishes, the predictions either left or right child node when the parent node gets split. of all instances are updated, the gradient statistics are re- Considering the basic operation of histogram is adding gra- computed, and the algorithm will proceed to next tree. dients, therefore, for feature f, the element-wise sum of first- or second-order histograms of children nodes equals to that 2.1.2 Histogram-based Algorithm of parent. Motivated by this, we can significantly accelerate Histogram-based split finding. It is vital to find the training by first constructing the histograms of the one child optimal split of a tree node efficiently, as enumerating every node with fewer instances, and then getting those of the possible split in a brute-force manner is impractical. Cur- sibling node via histogram subtraction (histograms of parent rent works generally adopt a histogram-based algorithm for node are persist in memory). By doing so, we can skip at fast and accurate split finding, as illustrated in Figure 3. least one half of the instances. Since histogram construction The algorithm considers only q values for each feature f as usually dominates the computation cost, such subtraction candidate splits rather than all possible splits. The most technique can speed up the training process considerably. common approach to propose the candidates is using the quantile sketch [15, 22, 14] to approximate the feature distri- 2.2 Data Management in GBDT bution. After candidate splits are prepared, we enumerate all instances on a tree node and accumulate their gradient As aforementioned, the combinations of partitioning schemes statistics into two histograms, first- and second-order gradi- and storage patterns together form four quadrants (QD). ents, respectively. The histogram consists of q bins, each of Although the four quadrants entail similar memory consump- which sums the first- or second-order gradients of instances tion to store the dataset in expectation, the manipulation hess grad hess grad hess grad hess grad column-by-column hess grad (including computation, storing, and communication) of gra- 2 D. (2) Number of candidate splits. The number of dient histograms can be significantly different. bins in one histogram equals to the number of candidate splits q, which makes the histogram size proportional to q. 2.2.1 Data Partitioning in GBDT (3) Number of classes. In multi-classification tasks, the Since gradient histograms can be reckoned as summaries gradient is a vector of partial derivatives on all classes. The of features, different partitioning choices affect the way we histogram size is therefore proportional to C. To sum up, construct and exchange histograms. the histogram size on one tree node, denoted by Size , hist Since values of each feature are scattered among workers is 2 D q C 8 bytes, where 8 bytes is the size of a in horizontal partitioning, as presented in Figure 4(a), each double-precision floating-point number. worker needs to construct histograms for all features based 3.1.2 Memory Cost on its data shard. Then the local histograms are aggregated into global histograms via element-wise summation, so that Obviously, the memory cost for both partitioning to store all values of each feature are correctly summarized. the dataset is similar. Nonetheless, the memory cost to store As shown in Figure 4(b), each worker maintains one or the gradient histograms is quite different. Here we focus on several complete columns in vertical partitioning, therefore the memory consumed by storing the histograms. there is no need to aggregate the histograms. Each worker In order to perform histogram subtraction, we have to obtains the local best split regarding its feature subset, and conserve the histograms of the parent nodes. The maximum then all workers exchange the local best splits and choose the number of histograms to be held in memory equals to the global best one. Nevertheless, since the feature values of an number of tree nodes in the last but one layer , which is L 2 instance are partitioned, its placement after node splitting, 2 . With horizontal partitioning, each worker needs to i.e., left or right child node, is only known by the worker who construct the histograms of all features, thus the memory L 2 proposes the global best split. As a result, the placement of cost of histograms is Size 2 . Nevertheless, with hist instances must be broadcast to all workers. vertical partitioning, each worker constructs the histograms of a portion of features. As a result, the expected memory L 2 2.2.2 Data Storage in GBDT cost is Size 2 =W , which is significantly smaller than hist The most distinguished difference brought by storage pat- the horizontal partitioning counterpart. tern is the way we index and access the values during the 3.1.3 Communication Cost construction of histograms, as shown in Figure 4(c). With row-store, each worker iterates the data shard row-by- The dominant communication cost in horizontal parti- row, and accumulates the gradient statistics to corresponding tioning scheme is the aggregation of histograms. Despite histograms. When processing one instance, the worker needs the existence of different aggregation methods [36], such as to update multiple histograms of different features. To ac- map-reduce, all-reduce, and reduce-scatter, the minimal celerate the construction, each worker further maintains an transferred data of each worker is the size of local histograms. indexing between tree nodes and instances. Thus the total communication cost among the cluster build- L 1 With column-store, as all values of one feature are held ing one tree is at least Size W (2 1). It is hist together, each worker constructs histograms one-by-one by obvious that as the tree goes deeper, i.e., as L increases, the processing the columns individually. However, the indexing communication cost grows quadratically. between the values on a column and tree nodes must be Unlike horizontal partitioning scheme, vertical partitioning maintained carefully. As we will discuss in Section 3.2, the scheme does not need to aggregate the histograms since each data access and indexing might take extra efforts. worker holds all the values of a specific feature. However, as described in Section 2, after splitting a tree node, the placement of instances must be broadcast to all workers. 3. ANATOMY OF QUADRANTS Since the communication cost is only affected by the number In this section, we provide an in-depth study of the four of instances, the overhead in one tree layer remains the same quadrants when training a GBDT model distributedly. To as the tree goes deeper. As we will elaborate in Section 4.2.2, formally describe the results, we assume there are W workers, the placement is encoded into a bitmap so that the commu- and the GBDT model is comprised of T decision trees, where nication overhead can be reduced sharply. To conclude, the each of them has L layers. The number of candidate splits communication cost for an L-layer tree is dN=8e W L is denoted by q. For classification tasks, we denote C as bytes, where dN=8e bytes is the size of one bitmap. the dimension of a gradient, where C equals 1 in binary- classification or the number of classes in multi-classification. 3.1.4 Summary of Analysis Undoubtedly, the choice of partitioning scheme highly 3.1 Analysis of Partitioning Scheme depends on Size . Undoubtedly, horizontal partitioning hist Here we theoretically analyze the performance of horizontal works well for datasets with low dimensionality, since the and vertical partitioning schemes, including memory and resulting histograms are small. However, in both industry communication cost. and academia, the following three cases become more and more popular — high dimensional features, deep trees, and 3.1.1 Histogram Size multi-classification. In these cases, the histogram size can The core operation of GBDT is the construction and ma- be very large. Therefore, vertical partitioning is far more nipulation of gradient histograms. We first study the size of memory- and communication-efficient than horizontal par- histograms, which is determined by three factors. (1) Fea- titioning. Take an industrial dataset Age as an example, ture dimension. Since two histograms are built for each which is also used in our experimental study, we suppose feature (one first-order gradient histogram and one second- order gradient histogram), the total size is proportional to We assume all histograms are preserved in memory. 4 ins-to-node node-to-ins 0 0 0 0 1 2 3 4 2 3 1 4 2 1 1 2 node 1 node 2 implicit index Instance-to-node Node-to-instance Column-wise (1, N) (1, 35) (2, Y) (2, 40) index index node-to-instance index (3, Y) (3, 20) (2, Y) (2, 40) 1 1 instance index 1 (3, Y) (3, 20) (1, N) (1, 35) 1 2 2 1 feature index 2 1 2 1 3 (4, N) (4, 10) (4, N) (4, 10) feature value 4 4 Married Age Married Age tree node 2 3 5 5 Figure 6: Update of column-wise node-to-instance index Gradient Gradient Row-store Column-store Histogram Histogram (w.r.t. the first tree in Figure 2). 1 1 2 2 each pair, we add the instance gradient to the histograms of that tree node. Furthermore, the node-to-instance index enables the histogram subtraction technique since we can Figure 5: Illustration of different indexes directly get the instances of any tree node. If two tree nodes are siblings, we only build histogram for the tree node with fewer instances, and apply histogram subtraction for the (2, Y) (2, 40) (1, N) (1, 35) running GBDT on 8 workers. The dataset contains 48M (2, Y) (2, 40) (3, Y) (3, 20) other one. Consequently, combining the node-to-instance instances, 330K features and 9 classes. The decision trees (3, Y) (3, 20) (1, N) (1, 35) index and row-store can save large amount of data accesses. have 8 layers and the number of candidate splits is 20. Then (4, N) (4, 10) (4, N) (4, 10) the estimated size of histograms on one tree node can be Married Age Married Age 3.2.3 Column-store up to 906MB. Using the horizontal approach, the memory When building the gradient histogram with column-store, consumption would be 56.6GB and the total communication a straight-forward way is to use a column-wise access method cost would be 900GB for merely one tree in the worst case. to scan the columns. Each column summarizes the values To the contrary, when the vertical scheme is applied, the of one feature, which includes the feature id and a list of expected memory cost of histograms is 7.08GB per tree and hinstance id, feature valuei pairs. the communication cost is merely 366MB for one tree. Instance-to-node index. Since the key of each pair in column-store is instance id, a natural idea is creating 3.2 Analysis of Storage Pattern an instance-to-node index. As shown in Figure 5, for each In this section, we discuss the impact brought by different hinstance id, feature valuei pair, we query the tree node it storage patterns. Although there exist various works dis- belongs to, and then update the corresponding histograms. cussing the different storage patterns in database designs, Nonetheless, we find that using such method is not efficient the conclusion cannot be transferred to distributed GBDT. in practice. The reason is that in many real cases, the dataset The choice of storage pattern only influences the computa- is often sparse (especially for high-dimensional datasets). By tion cost, rather than communication or memory cost. The default, given an optimal node split with feature f, instances most time-consuming computation in GBDT is histogram with missing value on f are classified to the same child node, construction. However, the data access in GBDT is different causing imbalance sibling nodes. Histogram subtraction from other ML models. Specifically, since GBDT conducts should be able to boost the performance, however, with tree splitting in a top-to-bottom way, we need to create an instance-to-node index, we cannot directly get the instances index between tree nodes and training instances, and update of two child nodes without queries, i.e., we need to access all the index during the training. Below, we discuss how to instances of the two nodes. Therefore, a lot of time is wasted design the index with different storage patterns. on scanning unnecessary data, resulting in poor performance. Node-to-instance index. One solution to avoid scan- 3.2.1 Choice of Index ning all instances is using node-to-instance index for column- To understand the computation complexity of histogram store. However, there still exists a fatal drawback. Once construction, we first illustrate the possible index choices obtaining an instance id from the index, we need to locate used in GBDT training. As illustrated in Figure 5, there the feature values of the instance from column-store. To that are three commonly used indexes indicating the position of end, we have to perform a binary search on all the feature training instances in the tree. columns, which brings in a log (N ) computation complexity. Node-to-instance index maps a tree node to the cor- When N is large, the overhead becomes unacceptable. responding training instances, meaning that the key is a Column-wise node-to-instance index. Another way tree node and the value is the instances on the tree node. to escape from both scanning unnecessary data and binary Instance-to-node index maps a training instance to the search is deploying an index on each column, which actually corresponding tree node. maintains a node-to-instance index for each column. When Column-wise node-to-instance index maintains a node- building histograms for one node, we can locate the hinstance to-instance index for each feature column. id, feature valuei pairs on all columns directly. Nevertheless, although locating the instances is fast, updating the index 3.2.2 Row-store is expensive. As shown in Figure 6, whenever we split some When building the gradient histogram with row-store, we tree node, we have to update the indexes on all columns. The computation complexity of splitting tree nodes is about adopt a row-wise access method to scan rows sequentially. Each row is an instance, which consists of the instance index D times of the two indexes described above. As a result, and a list of nonzero hfeature id, feature valuei pairs. the column-wise node-to-instance index is only applicable Node-to-instance index is designed for row-store. We for low-dimensional datasets. get the instance rows of one tree node from the index. For 3.2.4 Summary of Analysis each row, we iterate the hfeature id, feature valuei pairs. For node 1 node 2 node 2 node 1 Table 1: Summary of advantageous scenarios among different quadrants. Technique Data Characteristics Model Quadrants Partitioning Storage High dim. Low dim. High ins. Low ins. Multi-class Deep tree QD1 Horizontal Column QD2 Horizontal Row X X QD3 Vertical Column X X QD4 Vertical Row X X X X Here we summarize the computation complexity of differ- ent combinations by considering the number of accesses to We conclude the advantageous scenarios of different data dataset or other data structures. management methods in Table 1. Considering large-scale Cost of histogram construction. In histogram con- cases is becoming more and more ubiquitous, we have the struction, since we need to access the feature values on the following take-away results: data shard, and the expected number of key-value pairs Vertical partitioning is able to outperform horizontal par- is Nd=W , where d is the average number of non-zeros of titioning for the high-dimensional features, deep trees and one instance, the complexity of histogram construction for multi-classification tasks, since it is more memory- and one layer is at least O(Nd=W ). There are three combina- communication-efficient, while horizontal partitioning is tions that can theoretically achieve the lowest complexity, better the low-dimensional datasets. which are row-store with node-to-instance index, column- Row-store is better than column-store unless the number store with instance-to-node index, and column-store with of instances is very small, since it can achieve minimal com- column-wise index. However, as discussed above, column- putation complexity and avoid redundant data accesses. store with instance-to-node index cannot benefit from the Overall, the composition of vertical partitioning and row- histogram subtraction technique, and thereby spends more store (QD4) achieves optimal performance under many time than row-store with node-to-instance index in practice; real-world large-scale cases as aforementioned. In Section 5 while column-store with column-wise index entails a much and 6, we will validate this through extensive experiments. higher complexity when node splitting although it works well for histogram construction. For the last combination, column-store with node-to-instance index, it incurs binary 4. REPRESENTATIVES OF QUADRANTS search on the feature columns whenever accessing an instance. In this section, we first introduce the representatives of In expectation, the complexity of binary search is approxi- QD1-3, and then propose Vero, a brand new distributed mately O(log Nd=WD). Therefore, the overall complexity GBDT system with vertical partitioning and row-store (QD4). becomes O(Nd=W log Nd=WD). Cost of split finding and node splitting. Except for 4.1 Taxonomy of Existing Systems histogram construction, there are two other phases in GBDT, which are split finding and node splitting. To make the anal- XGBoost (QD1, Horizontal & Column). XGBoost [8] ysis self-contained, here we briefly analyze the computation is a popular GBDT system that achieves great success, and cost in these two phases. For split finding, the algorithm it chooses horizontal partitioning scheme and column-store needs to iterate all split candidates, causing a computa- pattern. In XGBoost, each worker maintains an instance-to- tion complexity of O(qD=W ), regardless of the partitioning node indexing. To construct histograms of one layer, workers scheme. For node splitting, we need to update the index linearly scan the feature columns, accumulate the gradient described above. The computation on one tree layer for both statistics to corresponding histogram bins, and finally ag- store patterns is proportional to the number of instances, if gregate the histograms in an all-reduce manner. After we do not use the column-wise node-to-instance index . The aggregation, the histograms are owned by a leader worker. complexity is O(N=W ) for horizontal partitioning and O(N ) Then it finds the best split by enumerating the candidate for vertical partitioning. Obviously, both of the two phases splits in the histograms. In node splitting phase, each worker have a significantly lower computation cost than histogram updates its own instance-to-node index. construction. Therefore, we should pay more attention to LightGBM and DimBoost (QD2, Horizontal & Row). the impact of storage pattern on histogram construction. Both LightGBM [23] and DimBoost [17] belong to this quad- Summary. As analyzed, column-store is not efficient with rant. A node-to-instance indexing that maps tree nodes different index structures. To the contrary, the combination to instances is maintained. To construct the histograms of row-store and node-to-instance index can achieve minimal of one node, the workers scan the feature vectors of in- computation since it leverages histogram subtraction to re- stances on that node, accumulate the gradient statistics to duce instance scanning and incurs the smallest cost of index corresponding histogram bins, and finally aggregate the his- update. As a result, unless the dataset contains very few tograms. LightGBM accomplishes the aggregation using instances so that the extra cost in indexing will not be large, reduce-scatter. Instead of aggregating all histograms on a we should choose row-store for distributed GBDT. single worker, each worker is responsible for a part of features. All workers then find splits on aggregated histograms and 3.3 Take-away Results synchronize to obtain the global best one. While DimBoost, with parameter-server architecture [26, 18], aggregates the The complexity of column-wise node-to-instance index is O(Nd=W ), so we exclude it from our consideration. histograms on parameter servers and enables server-side 6 … Master Master Master Master Generate Column Broadcast 2 Find split 2 3 5 candidate splits grouping instance label Worker #2 Worker #1 Instance rows Column groups Instance rows Global Local inst 1 inst 1 inst 1 best split 5 Split node best split 3 4 inst 2 inst 2 inst 2 Calculate inst 3 3 instance inst 3 inst 3 L R placement 5 Update indexing instance label 1 4 1 Build histogram LRRLRRLL feature index Broadcast Build Repartition feature value 3 Binary encode instance quantile column histogram bin index sketches placement groups Worker Worker Worker Worker 01101100 Split result column group Instance rows Horizontal to vertical partitioning Figure 8: Horizontal to vertical transformation ETL Partitioner Training Data the local sketches of one feature are sent to the same worker. Finally, the workers merge local sketches of the same feature Figure 7: Overview of Vero into a global sketch. 2. Generate candidate splits. The workers generate can- didate splits for each feature from merged quantile sketch, split finding. In either way we can avoid the single-point- using a set of quantiles, e.g., 0.1, 0.2, ..., 1.0. Then the bottleneck in communication. The node-to-instance indexing master collects the candidate splits and broadcasts them to is also updated during node splitting. all workers for further use. Yggdrasil (QD3, Vertical & Column). Although Yg- 3. Column grouping. Each worker changes the represen- gdrasil [3] is designed for vanilla decision tree algorithms tation of its local data shard by putting those features to be instead of GBDT, it is the first work that introduces vertical assigned to the same worker into one group. (The strategy of partitioning into distributed decision tree. In Yggdrasil, each feature assignment will be described in Section 4.2.3.) The worker maintains several complete columns so that it can ob- key-value pairs are encoded into a more compact form simul- tain the best split of its own feature (column) subset without taneously. (i) For each feature, we assign a new feature id histogram aggregation. All workers then exchange their local starting from 0 inside the column group. Suppose there are best splits and choose the global best with maximal split p features in one group, we use dlog(p)e bytes to encode the gain. In this way, the communication in split finding phase is new feature id. (ii) We encode feature values with histogram far less than horizontal-based methods. When splitting the bin indexes, which indicates the range of two consecutive tree nodes, Yggdrasil encodes the placement of each instance splits. Since the histograms stay unchanged, the model accu- into a bitmap. Further, Yggdrasil utilizes a column-wise racy will not be harmed. As the number of histogram bins q node-to-instance index. Based on the bitmap, the index for is generally a small integer, we further encode bin indexes each column is updated. However, it will bring in a large with dlog(q)e bytes. After this operation, key-value pairs computation cost when feature dimensionality is high. turn into hnew feature id, bin indexi pairs. 4. Repartition column groups. Similar to step 1, the 4.2 Vero column groups are repartitioned among workers. By doing As analyzed in Section 3, QD4 (Vertical & Row) is so, each worker holds all values of its responsible features. superior to the others under many large-scale scenarios but Further, the ordering of instances should be the same on left unexplored. This drives us to develop a system, Vero, all workers, so that we can coalesce the instances with their within the scope of QD4. Vero is built on top of Spark [39] labels. This can be done by sorting the received column and has been deployed in our industrial partner, Tencent groups w.r.t. the original worker ids. Inc.. As shown in Figure 7, Vero follows the master-worker 5. Broadcast instance labels. Master collects all instance architecture. After loading horizontally partitioned dataset labels and broadcasts them to all workers. Since the instance from distributed file systems, we perform an efficient transfor- rows on each worker are ordered in step 4, we can therefore mation operation to vertically repartition the dataset accross coalesce instance rows with instance labels. workers. Then masters and workers iteratively train a set of Network overhead. Step 1 and 2 prepare the candidate decision trees upon the repartitioned dataset. splits for step 3 to convert feature values into bin indexes. Quantile sketch is a widely-used data structure for approxi- 4.2.1 Horizontal-to-Vertical Transformation mate query [25, 34] and is usually small in size [15, 22, 14], Naturally, training datasets are often horizontally parti- so the network overhead is almost negligible. The communi- tioned and stored in distributed file systems such as HDFS cation bottleneck incurs in step 4. Nevertheless, by encoding and S3, which is obviously unfit for vertical partitioning. To feature id and feature value into smaller bytes, the size of a solve this problem, we need to repartition the datasets verti- key-value pair is significantly decreased. According to our cally. To address the potential network overhead for large empirical results, it brings up to 4 compression. The time datasets, we develop an efficient transformation method that cost of step 5 is not dominant as presented in the appendix compresses both feature indices and feature values, without of our technical report [13]. any loss of model accuracy. There are five main steps, as shown in Figure 8 and described below. 4.2.2 Training Workflow 1. Build quantile sketches. After loading the dataset, each worker builds a quantile sketch for each feature. Then To fit the data management strategy of QD4, we revise the local sketches are repartitioned among all workers, i.e., the traditional training procedure of GBDT. 7 file split 1 file split 1 file split 1 inst 1 overhead, we blockify the column groups before repartition, feature index blockify histogram bin index as shown in Figure 9. Each block consists of three arrays, i.e., inst 2 instance pointer feature indexes, histogram bin indexes, and instance pointers. inst 3 0 2 3 0 0 0 0 0 1 column group By default, the file split in Spark is 128MB, therefore, we can repartition always put a partial column group into one block since the file split 1 file split 2 Other indexing 2. number of key-value pairs in one file split is far smaller than get value merge workers … … INT_MAX. We assign the index of file split to the W partial 0 2 3 3 4 column groups. After repartition, each column group (the indexing 1. 0 2 3 0 1 inst id offset locate block data sub-matrix of a worker) is comprised of several blocks, sorted by their file split indexes. Figure 9: Blockfied column grouping and two-phase indexing Two-phase indexing and block merge. Since the data sub-matrix is now made up of a number of blocks, we adopt a two-phase index to access each instance. In initialization, the Histogram construction. Given tree node(s) to process, offset of instance (row) id of each block is recorded. Given an the master first obtains the number of instances on each instance id, we first binary search the block that contains that node, then it decides on which node(s) we can perform instance, then the instance id inside the block is calculated histogram subtraction and sends the schema to all workers. by subtracting the offset of the block, finally we obtain the Each worker constructs histograms based on its data shard. range of the instance by the instance pointers. Considering Since Vero stores data in row manner, we use the node-to- that the number of file splits can be very large, for instance, instance index to achieve the best performance in histogram a 100GB dataset results in approximately 800 file splits, we construction. For each tree node, each worker obtains a list merge the blocks when possible in order to reduce the data of row indexes, and each row represents an instance that is access time. In practice, the number of blocks after the currently classified onto that tree node. Then the worker merge operation is smaller than 5. Therefore, we can nearly adds the gradient statistics to corresponding histograms. We omit the extra cost brought by two-phase indexing. also adopt the method proposed in [17] to handle instances with missing values. Finally, unlike horizontal-based works, 5. EVALUATION Vero does not need to aggregate histograms among workers. Split finding. To obtain the best split for some tree In this section, we conduct experiments to empirically node, each worker first calculates a split for each histogram validate our analysis. We organize the experiments into by Equation 2, and proposes the one with maximal split two parts. In Section 5.2, we implement different quadrants gain as the local best split. Finally, master collects all local in the same code base and assess their performance over best splits and chooses the global best one. Note that, the a range of synthetic datasets. In Section 5.3, we compare obtained feature id is not the original one since we transform Vero with other baselines over extensive public and synthetic it in step 3 of Section A. Hence, the master needs to recover datasets. For more experiments, including the efficiency of the original feature afterwards. the horizontal-to-vertical transformation and scalability of Node splitting. As aforementioned, since only one worker Vero, please refer to the appendix of our technical report [13]. owns feature values of the best split, the placement of each 5.1 Experimental Setup instance (left or right child) after node splitting can only be computed by it. The master asks the worker who has Environment. We conduct the experiments on an 8-node proposed the global best split to compute and broadcast the laboratory cluster. Each machine is equipped with 32GB instance placement. Since the placement of each instance RAM, 4 cores and 1Gbps Ethernet. The maximum memory has only two options, i.e., left or right child node, we use a allowed for each run is limited to 30GB, and we use 4 threads bitmap to represent the instance placement, which can re- to achieve parallel computation on each node. duce the network overhead by 32. All workers then update Hyper-parameters. In specific experiments, we vary the node-to-instance index based on the bitmap. some hyper-parameters to assess the change in performance. However, unless otherwise stated, we set T = 100 (# trees), 4.2.3 Proposed Optimization L = 8 (# layers), and q = 20 (# candidate splits). Load balance. There are various strategies for column 5.2 Assessment of Quadrants grouping, such as round robin, hash-based, and range-bashed partition, yet these methods cannot guarantee exact load bal- In order to validate the analysis in Section 3, we evaluate ance. We might suffer from the straggler problem if a worker the impact of partitioning scheme and storage pattern. For contains far more key-value pairs than others. Therefore, we partitioning scheme, we compare Vero with QD2, in terms balance the workload on workers by averaging the total num- of communication and memory efficiency. For storage pat- ber of key-value pairs. In practice, the master collects the tern, we compare Vero with QD3 in terms of computation number of feature occurrences from global quantile sketches, efficiency. then the problem becomes assigning the feature pairs to W To achieve fair and thorough comparison, we implement groups so that the number of feature pairs in each group is two optimized baselines in QD2 and QD3 on top of Spark and as close as possible. This problem is obviously an NP-hard compare them with Vero over a range of synthetic datasets, problem, we therefore use a greedy method to solve it [19]. and report the mean and standard deviation of one tree. Blockify of column group. Although the network over- The synthetic datasets are generated from random linear head is reduced by compression, the overhead of (de)serialization regression models. Specifically, given dimensionality D, in- is probably large if we represent column groups with large formative ratio p, and number of classes C, we first randomly amount of small vectors, since there are W times number of initialize the weight matrix W with size DC, and each row objects compared to the original dataset. To alleviate such of W contains pD nonzero values. Then for each instance, 8 60 N=5M N=15M D=25K L=8 C=3 N=10M N=20M D=50K L=9 C=5 D=75K L=10 C=10 D=100K 0 0 0 0 Comp Comm Comp Comm Comp Comm Comp Comm Comp Comm Comp Comm Comp Comm Comp Comm Horizontal+Row Vertical+Row Horizontal+Row Vertical+Row Horizontal+Row Vertical+Row Horizontal+Row Vertical+Row (QD2) (QD4) (QD2) (QD4) (QD2) (QD4) (QD2) (QD4) (a) Impact of instance number. (b) Impact of dimensionality. (c) Impact of tree depth. (d) Impact of multi-classes. D=100, C=2, L=8 N=50M, C=2, L=8 N=50M, D=100K, C=2 N=50M, D=25K, L=8 D=25K D=75K C=3 N=10M 10 16 D=25K D=75K D=50K D=100K C=5 D=50K D=100K N=20M C=10 N=30M 5 8 N=40M 0 0 0 0 Data Histogram Data Histogram Data Histogram Data Histogram Comp Comm Comp Comm Comp Comm Comp Comm Horizontal+Row Vertical+Row Horizontal+Row Vertical+Row Vertical+Column Vertical+Row Vertical+Column Vertical+Row (QD2) (QD4) (QD2) (QD4) (QD3) (QD4) (QD3) (QD4) (e) Memory consumption. (f) Memory consumption. (g) Impact of dimensionality. (h) Impact of instance number. N=50M, C=2, L=8 N=50M, D=25K, L=8 N=10K, C=2, L=8 D=100K, C=2, L=8 Figure 10: Comparison of quadrants. Comp refers to computation, and Comm refers to communication. the feature x is a randomly sampled D-dimensional vector Figure 10(c), when L increases from 8 to 9 and 10, the with density , and its label y is determined by arg max x W . communication time of QD2 almost increases exponentially In our experiment, we set p = = 20%. because the number of tree nodes becomes exponential. To the contrary, the communication time of QD4 increases lin- early w.r.t L since the transmission on each layer remains the 5.2.1 Partitioning schemes same. As for computation time, due to the histogram sub- Impact of number of instances. We first assess the traction technique, the time to build histograms for a deep impact of number of instances N using low-dimensional layer is very little. As a result, communication dominates datasets, and present the time cost per tree in Figure 10(a). when the decision tree goes deeper, and vertical partitioning The computation time of QD2 and QD4 is close to each reveals its superiority more for deep trees. other since partitioning scheme does not have influence on Impact of multi-classes. We next assess the impact of computation, Nonetheless, the communication time varies. the number of classes C in multi-classification tasks. The ex- With D = 100, which is a fairly low dimensionality, the com- periments are conducted on several synthetic datasets with munication cost of QD2 is negligible since the size of gradient different number of classes. Since QD2 encounters OOM histograms is small. In contrast, QD4 takes nearly half of (out-of-memory) error with D = 100K and C = 10, we the training time on network transmission. Besides, when N lower the dimensionality to 25K. The results are presented grows larger, the communication cost of QD4 also becomes in Figure 10(d). The computation time of QD2 and QD4 higher. This is because vertical partitioning has to broadcast shows similar increase when C increases from 3 to 5, and the placement of instances after node splitting, which results to 10. Nevertheless, the communication time of QD2 is ap- in proportional network overhead w.r.t. N. Therefore, given proximately proportional to C, while that of QD4 remains a low-dimensional datasets containing a large amount of unchanged. This validates our analysis that vertical parti- instances, horizontal partitioning is a properer choice. tioning is more suitable for multi-classification tasks than Impact of dimensionality. To assess the impact of horizontal partitioning as it saves a lot of communication. feature dimensionality D, we train distributed GBDT over Memory consumption. We record the memory con- datasets with varying D, as shown in Figure 10(b). The com- sumption by monitoring the GC of JVM. As analyzed in munication time of horizontal partitioning increases linearly Section 3, the vertical partitioning is more memory-efficient w.r.t. D, since the histogram size grows linearly, while verti- since each worker does not need to store the histograms of all cal partitioning gives almost the same communication time features. Therefore, we breakdown the memory consumption regardless of D. The result validates that vertical partition- into data and histogram. As shown in Figure 10(e) and Fig- ing is more communication-efficient for the high-dimensional ure 10(f), QD2 and QD4 incur similar memory cost to store datasets. Theoretically speaking, the computation cost of dataset. QD4 allocates slightly more memory since it needs QD2 and QD4 is similar, which matches the case when to store all instance labels. Nonetheless, the memory for his- D = 25K. However, when we use more features, the compu- togram is much different. Compared to QD4, QD2 allocates tation time of QD2 increases sharply while that of QD4 grows approximately 6-8 space to persist the histograms, showing mildly. This is because when D gets higher, the histogram that the memory cost of vertical partitioning can be alleviated becomes larger and cannot fit in cache. Thus QD2 suffers given more workers. Moreover, in multi-classfication tasks, from frequent cache miss, and therefore spends more time the memory consumption of histogram in QD2 dominates on histogram construction for larger D. QD4, instead, holds the overall memory cost, since the histogram size grows lin- a much smaller histogram on each worker owing to vertical early against C while the size of dataset remains unchanged. partitioning and has a slow-growth in computation time. QD4, to the contrary, is able to handle high-dimensional or Impact of tree depth. We then assess the impact of multi-class datasets with limited memory resource. the number of tree layers by changing L. As shown in Memory Time Breakdown Breakdown (GB) Per Tree (Second) Memory Time Breakdown Breakdown (GB) Per Tree (Second) Time Breakdown Time Breakdown Per Tree (Second) Per Tree (Second) Time Breakdown Time Breakdown Per Tree (Second) Per Tree (Second) Table 2: Public and synthetic datasets. LD refers to low- Table 3: Average run time per tree scaled by Vero. We highlight dimensional dense datasets; HS refers to high-dimensional sparse the fastest ones in bold. datasets; MC refers to multi-classification datasets. Dataset Size # Ins # Feat # Labels Type Dataset XGBoost LightGBM DimBoost Vero SUSY 2GB 5M 18 2 LD SUSY 0.3 0.1 0.5 1.0 Higgs 8GB 11M 28 2 LD Higgs 0.5 0.2 0.8 1.0 Criteo 10GB 45M 39 2 LD Criteo 0.5 0.2 0.7 1.0 Epsilon 15GB 500K 2K 2 LD Epsilon 2.8 0.7 1.9 1.0 RCV1 1.2GB 697K 47K 2 HS RCV1 17.3 5.6 4.0 1.0 Synthesis 60GB 50M 100K 2 HS Synthesis 18.9 5.0 2.0 1.0 RCV1-multi 0.8GB 534K 47K 53 MC RCV1-multi 34.7 9.7 - 1.0 Synthesis-multi 18GB 50M 25K 10 MC Synthesis-multi 7.1 3.3 - 1.0 0.925 0.875 0.81 0.75 0.900 0.870 0.80 0.875 0.74 0.850 0.865 0.79 Vero Vero 0.73 Vero Vero 0.825 0.78 0.860 DimBoost DimBoost DimBoost DimBoost 0.72 0.800 0.77 LightGBM LightGBM LightGBM LightGBM 0.855 0.775 0.71 0.76 XGBoost XGBoost XGBoost 0.750 XGBoost 0.850 0.70 0.75 0 100 200 300 400 0 100 200 300 400 500 0 500 1000 1500 2000 2500 0 100 200 300 400 500 600 Time (second) Time (second) Time (second) Time (second) (a) SUSY (b) Higgs (c) Criteo (d) Epsilon 0.8 0.71 0.7 0.98 0.7 0.70 0.6 0.96 Vero Vero 0.6 0.69 0.5 0.94 DimBoost DimBoost Vero Vero 0.5 0.68 0.4 LightGBM LightGBM LightGBM LightGBM 0.92 0.4 XGBoost 0.67 XGBoost XGBoost XGBoost 0.3 0.90 0.3 0 1000 2000 3000 4000 5000 0 10000 20000 30000 40000 50000 0 2000 4000 6000 8000 0 5000 10000 15000 20000 25000 30000 35000 Time (second) Time (second) Time (second) Time (second) (e) RCV1 (f) Synthesis (g) RCV1-multi (h) Synthesis-multi Figure 11: End-to-end evaluation. We report the convergence curves and draw a horizontal line to indicate the best model performance. 5.2.2 Storage patterns ter than row-store when the dataset is low-dimensional and meanwhile contains very few instances. Index plan. Since the column-wise node-to-instance in- Impact of number of instances. We then assess the dex causes unacceptable overhead during update, we im- impact of number of instances N. As shown in Figure 10(h), plement QD3 with a combination of node-to-instance and QD3 and QD4 have similar network time growing linearly instance-to-node indexes. Specifically, when a column con- against N, since both of them vertically partition the datasets tains few number of values, we build histogram for it by linear and need to transmit the instance placement. The difference scanning, otherwise, we perform binary search on the column. occurs in computation time. In general, QD3 spends 3-4 on In the appendix of our technical report [13], we compare computation compared with QD4. Moreover, the computa- our QD3 implementation with Yggdrasil to show that the tion time of QD3 oscillates heavily (high standard deviation combination of two indexes can achieve higher performance. of time per tree). This is because the binary searches on Impact of dimensionality. We first study the perfor- columns result in many CPU branch mispredictions. In con- mance on datasets with only a few instances but a high trast, when training with row-store, we iterate the feature dimensionality. Although such datasets are seldom seen in vectors row-by-row, which escapes from heavy branch predic- practice, conducting the comparison helps make our assess- tion penalty. In short, QD3 shares the same communication ment complete. The result is given in Figure 10(g). Given overhead of QD4, but QD3 is not as computation-efficient a fixed N, the communication cost of QD3 and QD4 al- as QD4, owing to the column-store it adopts. most stays unchanged, due to the vertical partitioning they adopt. However, QD4 spends more time on computation 5.2.3 Summary than QD3 given a larger D. The reason is that QD3 stores The experiments above validate the analysis in Section 3, the dataset column-by-column and constructs histograms one-by-one, thus it is more cache-friendly when writing on that (i) horizontal partitioning works better when dimension- ality is low, while vertical partitioning is more memory- and the histograms. While row-store constructs histograms for communication-efficient under the high-dimensional, deep all features together, which will suffer from heavy cache miss trees and multi-class cases; (ii) row-store is more efficient in when D is large. As a result, the experiment results match computation than column-store except that the dataset is our analysis in Section 3.2 that column-store performs bet- high-dimensional with few instances. In addition, we observe Valid AUC Valid AUC Valid AUC Valid AUC Valid Accuracy Valid AUC Valid Accuracy Valid AUC another two advantages of QD4 in practice, which are cache- the 53 increment in network transmission. Vero, however, and branch-friendly. As a result, the composition of vertical takes only 4 more time on RCV1-mutli, since the network partitioning and row-store can achieve optimal performance transmission of vertical partitioning does not increase w.r.t. under a wide range of workloads. the number of classes. Overall, Vero is 9.7 and 34.7 faster than LightGBM and XGBoost. The speedup of Vero on 5.3 End-to-end Evaluation Synthesis-multi is smaller than Synthesis due to the lower dimensionality, however, Vero still outperforms XGBoost and Baselines. We choose three open source GBDT imple- LightGBM by 7.1 and 3.3, respectively. The experiment mentations as our baselines, which are XGBoost, LightGBM and DimBoost. XGBoost and LightGBM are favorite toolkits results match our analysis that QD4 is more suitable for multi-classification tasks. in data-analytic competitions, while DimBoost is optimized for large-scale GBDT workloads and is able to achieve the state-of-the-art performance. Summary. The end-to-end evaluation reveals that we should Datasets. We run Vero and the baselines on six public choose the proper system for a given workload. To summa- datasets and two synthetic datasets, as listed in Table 2. rize, LightGBM achieves the highest performance on low- We categorize the datasets into low-dimensional dense (LD), dimensional datasets, while Vero is the best choice for high- high-dimensional sparse (HS), and multi-classification (MC) dimensional or multi-classification datasets. datasets, and discuss the overall performance of the systems on different kinds of datasets. All systems are tuned to 6. EVALUATION IN THE REAL WORLD achieve comparable accuracy. We present the convergence curve in Figure 11 and report the running time in Table 3. As aforementioned, Vero has been integrated into the production pipeline of Tencent. In this section, we present Low-dimensional Dense Datasets. We first conduct end- some use cases to validate the ability of Vero to handle large to-end evaluation on four datasets with low dimensional- scale real-world workloads. ity and fully dense data. We use five workers to run on Environment. The experiments are carried out on a these four datasets. Corresponding to the analysis in Sec- productive cluster in Tencent. Each machine is equipped tion 3, low dimensionality results in small histogram size with 64GB RAM, 24 cores and 10Gbps Ethernet. Since and hence the communication time of horizontal partitioning the cluster is shared by other applications, the maximum does not dominant. Therefore, LightGBM, which belongs to resource for each Yarn container is restricted. Thus we use QD2, achieves the fastest speed in overall, since it is more 20GB memory and 10 cores for each container. computation-efficient than XGBoost (QD1) and communi- Datasets. We use three datasets in Tencent. All three cates little compared to Vero (QD4). Vero suffers on extreme datasets are used to train models to complete the user per- low-dimensional datasets, i.e., SUSY, Higgs, and Criteo, how- sona. Gender contains 122 million instances. Age classifies ever, it catches up quickly and is comparable to LightGBM 48 million users into 9 age ranges. Both of them have 330 when the dimensionality gets higher, for instance the Epsilon thousand features. Taste, with 10 million instances and 15 dataset, which also matches our analysis. DimBoost (QD2) thousand features, describes the user taste with 100 tags. runs slower than XGBoost on three datasets, violating our Hyper-parameters. We use 50 workers for Gender, 20 analysis. The unsatisfactory performance of DimBoost is workers for Age and Taste. We set T = 20 (# trees) and caused by two factors: 1) DimBoost is designed aiming at the restrict the maximum running time to convergence as 1 hour. high-dimensional case and always stores datasets as sparse The other hyper-parameters are the same as in Section 5. matrices, which inevitably results in extra cost in data access Baselines. Prior to Vero, XGBoost and DimBoost are and indexing; 2) DimBoost is implemented in Java, thus two candidates for GBDT in Tencent. As discussed in [17], it is hard to achieve as good computation efficiency as the LightGBM is impractical for productive environments owing C++-based XGBoost and LightGBM. to the strict environment requirement and the lack of inte- gration with the Hadoop ecosystem. Therefore, we choose XGBoost and DimBoost as our baselines in this section. High-dimensional Sparse Datasets. We then assess the systems on high-dimensional sparse datasets, RCV1 and Synthesis, with five and eight workers, respectively. In short, Gender dataset. We run the Gender dataset on all the Vero runs the fastest, followed by DimBoost and LightGBM, three systems and present the results in Figure 12 and Ta- while XGBoost is the slowest. XGBoost is about 18 slower ble 4. Unfortunately, Vero spends 1.5 to finish one tree than Vero, due to the inefficiency in both computation and compared with DimBoost. This is caused by two factors. communication. The speedup of Vero w.r.t. DimBoost and First, the productive cluster has a 10 higher network band- LightGBM are 2-5.6. The relative performance of Vero width compared to the laboratory cluster in Section 5, so the on Synthesis is slower than RCV1, since there is a large communication overhead is alleviated for DimBoost. Second, number of instances compared with the 330 thousand feature. Gender contains an extreme large amount of instances, in However, it can still achieve the fastest speed, owing to the which case horizontal partitioning can better distribute the superiority of QD4 under high-dimensional cases. workloads to workers. However, the time cost of Vero is com- parable to that of DimBoost and can outperform XGBoost by 5.5, verifying that Vero can well support datasets with Multi-classification Datasets. Finally we consider the per- large number of instances and low dimensionality. formance on multi-classification datasets using eight workers. Since DimBoost does not support multi-classification, we do not discuss it in this experiment. XGBoost and Light- Age dataset. We next assess the performance of Vero and GBM are 8.6 and 7.4 slower on the multi-class dataset XGBoost on the large-scale multi-class dataset. Figure 12 RCV1-multi than the binary-class dataset RCV1, due to and Table 4 give the results. It takes 207 seconds for Vero to 11 438 1738 627 600Table 4: Run time per tree in Vero 400 Vero Vero Vero 0.83 0.615 1500 0.385 DimBoost DimBoost XGBoost seconds XGBoost (fastest ones in bold). XGBoost XGBoost 400 0.82 0.610 Vero 200 XGBoost 0.81 0.605 200 Dataset Gender Age Taste Vero 52 250 207 XGBoost 0.80 0.600 XGBoost 438 1738 627 0 0.375 0 0 0 900 1800 2700 3600 0 900 1800 2700 3600 0 900 1800 2700 3600 Time (second) Time (second) Time (second) DimBoost 52 - - Vero 79 207 139 Figure 12: End-to-end evaluation over industrial datasets. (left to right: Gender, Age, Taste) complete one tree, and it can get close to convergence within TencentBoost and PSMART [20, 43] implement GBDT with an hour. Nevertheless, XGBoost costs 1738 seconds for one parameter-server. DimBoost [17] further applies a series tree, which is 8.3 slower. In many real applications, the al- of optimization techniques and achieves the state-of-the-art lowed time is usually restricted. For instance, daily recurring performance. However, it only supports binary-classification. jobs need to commit within a reasonable period of time so There exist many works discussing the impact on databases that the jobs in downstream will not be affected. Obviously, brought by data layout. Column-oriented databases [35, 1] XGBoost fails to converge within acceptable time on this vertically partition the data and store them in columns and dataset, whereas Vero can achieve better performance since outperform row-oriented databases on database analytics it is more efficient in both communication and computation. workloads. [2] discusses the performance difference in terms of row-store and column-store. There are also works that take advantages of both vertical partitioning and row representa- Taste dataset. Finally we conduct an experiment on a rela- tion [4, 9]. Despite the extensive studies in database com- tively small-scale multi-class dataset. As shown in Figure 12 munity, how does the way we manage the training datasets and Table 4, Vero is 4.5 faster than XGBoost. Although influence the performance of machine learning algorithms is the feature dimensionality of Taste is low, Vero can still few discussed. Yggdrasil [3] introduces vertical partitioning outperform XGBoost, showing that Vero is more suitable for into the training of decision tree and showcases the reduction the multi-classification tasks. in network communication. Our work extends the analysis to both communication and memory overhead. In addition, Summary. With the experimental results on three indus- Yggdrasil focuses on the case of deep decision tree. We fur- trial datasets, we show that by careful investigation on the ther show that vertical partitioning combined with row-store management of distributed datasets, we can achieve a better benefits the high-dimensional and multi-classification cases. solution to solve a wide range of workloads. Currently Vero DimmWitted [40] analyzes the trade-off in access methods is designed for vertical partitioning and row-store, and is not when training linear models under the NUMA architecture. able to achieve highest performance on all cases. How to However, instances are stored in row format without verti- determine an optimal dataset management strategy given cal partitioning in DimmWitted. In this work, we together the size of dataset (e.g., number of instances, feature dimen- discuss the data access and data index methods for both sionality and number of classes) along with the application row-store and column-store data when training GBDT. environment (e.g., network bandwidth, number of machines, The analysis in this work is applicable to many other number of cores) is remained unsolved. We believe this tree-based algorithms beyond GBDT, such as AdaBoost, problem can bring insight to both the machine learning and random forest, and gcForest [10, 5, 45]. However, there are database community and leave it as our future work. also algorithms that our analysis fails to support. For in- stance, neural decision forest [33, 24] utilizes neural networks (randomized multi-layer perceptron or fully-connected layers 7. RELATED WORK concatenated with a deep convolutional network) as splitting A lot of works have implemented the algorithm, either in criteria. There is a big difference between this algorithm and research interests or industrial needs. R-GBM and scikit- vanilla decision trees. To discuss the impact on performance learn [32, 29] are stand-alone packages so that they cannot brought by data management methods, we need thorough handle large-scale datasets. MLlib [28, 42] is a machine learn- investigation on deep neural network training, such as the ing package of Spark and implements GBDT. XGBoost [8] anatomy of data parallelism and model parallelism. More- achieves great success in various data analytics competitions, over, the qualitative study on how hardware environment and is also widely-used in companies due to the distributed influences the performance is remained undone. We leave learning supported by DMLC. LightGBM [23] is developed in these as future works and do not discuss them in this work. favor of data analytics. Although it supports parallel learning with MPI, LightGBM requires complex setup and is not a good fit for large scale workloads in commodity environment. 8. CONCLUSION Note that there is a feature-parallel version of LightGBM, which lets each worker process a feature subset like vertical In this paper, we systematically study the data manage- partitioning does. However, it requires all workers to load ment methods in distributed GBDT. Specifically, we propose the whole dataset into memory, i.e. dataset is never parti- the four-quadrant categorization along partitioning scheme tioned, which is impractical for large-scale workloads. In Ap- and storage pattern, analyze their pros and cons, and summa- pendix of our technical report [13] we conduct experiments on rized their advantageous scenarios in Table 1. Based on the small datasets with the feature-parallel LightGBM and Vero. findings, we further propose Vero, a distributed GBDT im- There is a surge of interests to introduce parameter-server plementation that partitions the dataset vertically and stores architecture into industrial applications [21, 44, 41]. Notably, data in row manner. Empirical results on extensive datasets Valid AUC Time Cost Per Tree (Second) Valid Accuracy Time Cost Per Tree (Second) Valid Accuracy Time Cost Per Tree (Second) validate our analysis and provide suggestive guidelines on [15] M. Greenwald and S. Khanna. Space-efficient online choosing a proper platform for a given workload. computation of quantile summaries. In ACM SIGMOD Acknowledgements. Jiawei Jiang is the corresponding Record, volume 30, pages 58–66. ACM, 2001. author. This work is supported by the National Key Research [16] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, and Development Program of China (No. 2018YFB1004403), A. Atallah, R. Herbrich, S. Bowers, et al. Practical NSFC(No. 61832001, 61702015, 61702016, 61572039), and lessons from predicting clicks on ads at facebook. In PKU-Tencent joint research Lab. Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pages 1–9. ACM, 9. REFERENCES 2014. [1] D. Abadi, S. Madden, and M. Ferreira. Integrating [17] J. Jiang, B. Cui, C. Zhang, and F. Fu. Dimboost: compression and execution in column-oriented database Boosting gradient boosting decision tree to higher systems. In Proceedings of the 2006 ACM SIGMOD dimensions. In Proceedings of the 2018 International international conference on Management of data, pages Conference on Management of Data, pages 1363–1376. 671–682. ACM, 2006. ACM, 2018. [2] D. J. Abadi, S. R. Madden, and N. Hachem. [18] J. Jiang, B. Cui, C. Zhang, and L. Yu. Column-stores vs. row-stores: how different are they Heterogeneity-aware distributed parameter servers. In really? In Proceedings of the 2008 ACM SIGMOD Proceedings of the 2017 ACM International Conference international conference on Management of data, pages on Management of Data, pages 463–478. ACM, 2017. 967–980. ACM, 2008. [19] J. Jiang, H. Deng, and X. Liu. A predictive dynamic [3] F. Abuzaid, J. K. Bradley, F. T. Liang, A. Feng, load balancing algorithm with service differentiation. In L. Yang, M. Zaharia, and A. S. Talwalkar. Yggdrasil: Communication Technology (ICCT), 2013 15th IEEE An optimized system for training deep decision trees at International Conference on, pages 372–377. IEEE, scale. In Advances in Neural Information Processing Systems, pages 3817–3825, 2016. [20] J. Jiang, J. Jiang, B. Cui, and C. Zhang. Tencentboost: [4] S. Agrawal, V. Narasayya, and B. Yang. Integrating A gradient boosting tree system with parameter server. vertical and horizontal partitioning into automated In Data Engineering (ICDE), 2017 IEEE 33rd physical database design. In Proceedings of the 2004 International Conference on, pages 281–284, 2017. ACM SIGMOD international conference on [21] J. Jiang, L. Yu, J. Jiang, Y. Liu, and B. Cui. Angel: a Management of data, pages 359–370. ACM, 2004. new large-scale machine learning system. National [5] L. Breiman. Random forests. Machine learning, Science Review, 5(2):216–236, 2017. 45(1):5–32, 2001. [22] Z. Karnin, K. Lang, and E. Liberty. Optimal quantile [6] L. Breiman. Classification and regression trees. approximation in streams. In Foundations of Computer Routledge, 2017. Science (FOCS), 2016 IEEE 57th Annual Symposium [7] C. J. Burges. From ranknet to lambdarank to on, pages 71–78. IEEE, 2016. lambdamart: An overview. Learning, 11(23-581):81, [23] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu. Lightgbm: A highly efficient [8] T. Chen and C. Guestrin. Xgboost: A scalable tree gradient boosting decision tree. In Advances in Neural boosting system. In Proceedings of the 22nd acm sigkdd Information Processing Systems, pages 3149–3157, international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016. [24] P. Kontschieder, M. Fiterau, A. Criminisi, and [9] B. Cui, J. Zhao, and D. Yang. Exploring correlated S. Rota Bulo. Deep neural decision forests. In subspaces for efficient query processing in sparse Proceedings of the IEEE international conference on databases. ieee transactions on knowledge and data computer vision, pages 1467–1475, 2015. engineering, 22(2):219–233, 2010. [25] K. Li and G. Li. Approximate query processing: What [10] Y. Freund and R. E. Schapire. A decision-theoretic is new and where to go? Data Science and Engineering, generalization of on-line learning and an application to 3(4):379–397, Dec 2018. boosting. Journal of computer and system sciences, [26] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, 55(1):119–139, 1997. A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and [11] J. Friedman, T. Hastie, R. Tibshirani, et al. Additive B.-Y. Su. Scaling distributed machine learning with the logistic regression: a statistical view of boosting (with parameter server. In OSDI, volume 14, pages 583–598, discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000. [27] P. Li. Robust logitboost and adaptive base class (abc) [12] J. H. Friedman. Greedy function approximation: a logitboost. arXiv preprint arXiv:1203.3491, 2012. gradient boosting machine. Annals of statistics, pages [28] X. Meng, J. Bradley, B. Yavuz, E. Sparks, 1189–1232, 2001. S. Venkataraman, D. Liu, J. Freeman, D. Tsai, [13] F. Fu, J. Jiang, S. Ying, and B. Cui. An experimental M. Amde, S. Owen, et al. Mllib: Machine learning in evaluation of large scale gbdt systems. arXiv preprint apache spark. The Journal of Machine Learning arXiv:1907.01882, 2019. Research, 17(1):1235–1241, 2016. [14] E. Gan, J. Ding, K. S. Tai, V. Sharan, and P. Bailis. [29] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Moment-based quantile sketches for efficient high B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, cardinality aggregation queries. PVLDB, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine 11(11):1647–1660, 2018. 13 learning in python. Journal of machine learning and experiments. Science China Information Sciences, research, 12(Oct):2825–2830, 2011. 55(7):1551–1562, Jul 2012. [30] N. Ponomareva, S. Radpour, G. Hendry, S. Haykal, [39] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, T. Colthurst, P. Mitrichev, and A. Grushetsky. Tf M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. boosted trees: A scalable tensorflow based framework Resilient distributed datasets: A fault-tolerant for gradient boosting. In Joint European Conference on abstraction for in-memory cluster computing. In Machine Learning and Knowledge Discovery in Proceedings of the 9th USENIX conference on Databases, pages 423–427. Springer, 2017. Networked Systems Design and Implementation, pages 2–2. USENIX Association, 2012. [31] J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [40] C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, [32] G. Ridgeway. Generalized boosted models: A guide to 7(12):1283–1294, 2014. the gbm package. Update, 1(1):2007, 2007. [41] Z. Zhang, B. Cui, Y. Shao, L. Yu, J. Jiang, and [33] S. Rota Bulo and P. Kontschieder. Neural decision X. Miao. Ps2: Parameter server on spark. In forests for semantic image labelling. In Proceedings of Proceedings of the 2019 International Conference on the IEEE Conference on Computer Vision and Pattern Management of Data, pages 376–388. ACM, 2019. Recognition, pages 81–88, 2014. [42] Z. Zhang, J. Jiang, W. Wu, C. Zhang, L. Yu, and [34] G. Song, W. Qu, X. Liu, and X. Wang. Approximate B. Cui. Mllib*: Fast training of glms using spark mllib. calculation of window aggregate functions via global In 2019 IEEE 35th International Conference on Data random sample. Data Science and Engineering, Engineering (ICDE), pages 1778–1789. IEEE, 2019. 3(1):40–51, Mar 2018. [43] J. Zhou, Q. Cui, X. Li, P. Zhao, S. Qu, and J. Huang. [35] M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, Psmart: parameter server based multiple additive M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, regression trees system. In Proceedings of the 26th E. O’Neil, et al. C-store: a column-oriented dbms. In International Conference on World Wide Web Proceedings of the 31st international conference on Companion, pages 879–880. International World Wide Very large data bases, pages 553–564. VLDB Web Conferences Steering Committee, 2017. Endowment, 2005. [44] J. Zhou, X. Li, P. Zhao, C. Chen, L. Li, X. Yang, [36] R. Thakur, R. Rabenseifner, and W. Gropp. Q. Cui, J. Yu, X. Chen, Y. Ding, et al. Kunpeng: Optimization of collective communication operations in Parameter server based distributed learning systems mpich. The International Journal of High Performance and its applications in alibaba and ant financial. In Computing Applications, 19(1):49–66, 2005. Proceedings of the 23rd ACM SIGKDD International [37] S. Tyree, K. Q. Weinberger, K. Agrawal, and J. Paykin. Conference on Knowledge Discovery and Data Mining, Parallel boosted regression trees for web search ranking. pages 1693–1702. ACM, 2017. In Proceedings of the 20th international conference on [45] Z.-H. Zhou and J. Feng. Deep forest: towards an World wide web, pages 387–396. ACM, 2011. alternative to deep neural networks. In Proceedings of [38] L. Wang, X. Deng, Z. Jing, and J. Feng. Further results the 26th International Joint Conference on Artificial on the margin explanation of boosting: new algorithm Intelligence, pages 3553–3559. AAAI Press, 2017. 14 APPENDIX C. COMPARISON WITH YGGDRASIL Since Yggdrasil can only train vanilla decision trees on low A. EFFICIENCY OF TRANSFORMATION dimensional datasets, we implement a representative of QD3 We first study the efficiency of our horizontal-to-vertical in Section 5 and assess the impact of storage pattern. To transformation algorithm. We show the time cost of data validate the ability of our implementation to represent QD3, loading, candidate split finding, label broadcasting and horizontal- we compare it with Yggdrasil in this section. to-vertical repartition in Table 5. The experiments are carried out on three low dimensional Effects of proposed techniques. To access the effects datasets listed in Table 7. We use 5 workers for all three of individual optimizations, we also implement the naïve datasets and other hyper-parameters are the same as in method that transmits original 12-byte key-value pairs and a Section 5. The results are also given in Table 7. As afore- compression method that compresses key-value pairs without mentioned, we combine instance-to-node index and node-to- the blockify technique. The results show that our algorithm instance index for optimization, therefore, our implementa- can complete transformation with minimal time cost. Taking tion in QD3 is able to outperform Yggdrasil on the three Synthesis as an example, the compression technique brings datasets. In addition, Vero is the fastest, verifying the QD4 a 16% reduction in time, and the blockify technique brings is more computation-efficient owing the row-store it adopts. another 42%. Analysis of transformation overhead. Note that both Dataset Size Yggdrasil QD3 (Ours) Vero horizontal and vertical partitioning need to calculate data Epsilon N=500K D=2K 137 24 5 sketches (calculate the candidate splits). Therefore, the extra SUSY N=5M D=18 32 9 5 overhead introduced by vertical partitioning is the sum of repartition time and label broadcasting time, which is only Higgs N=11M D=28 71 14 7 10% of data loading and sketching on small dataset like RCV1 and 24% on large dataset like Synthesis. The extra Table 7: Experiments on low dimensional datasets. The right- overhead in vertical partitioning is worth-while given the most three columns are time cost for one tree in seconds. overall performance improvement. D. COMPARISON WITH LIGHTGBM Load Get Repartition Broadcast Dataset Data Splits Label Naïve Compress Vero LightGBM supports both data-parallel and feature-parallel strategies. Data-parallel horizontally partitions the dataset RCV1 17 2 7 4 2 0.4 onto workers and stores the data in row-manner, which is RCV1-multi 12 2 5 3 2 0.3 also chosen as our baseline in Section 5. Feature-parallel, Synthesis 584 65 329 276 158 6 however, does not partition the dataset. It demands that all workers load a full copy of the dataset. In histogram Table 5: Time cost (in seconds) for data loading and prepro- construction and split finding, each worker independently cessing. We run three times and report the average. builds histogram for a feature subset and finds the local best split, as vertical partitioning does. In node splitting, each B. SCALABILITY OF Vero worker splits a node as the horizontal partitioning does, since We further conduct an experiment to assess the scalability it owns a full copy of dataset. Although such approach can of Vero. Since the Synthesis dataset cannot fit in memory of avoid heavy communication, it only works for small-scale two machines, we use two subsets of it, as Section 5.2 does. datasets. For many real-world workloads, the size of dataset Specifically, Synthesis-N10M refers to the subset of the first usually exceeds the memory of each machine, therefore the 10 million instances and Synthesis-D25K the subset of the feature-parallel implementation of LightGBM is impractical. first 25 thousand features. We present the results in Table 6. Here we conduct experiments on two small datasets, RCV1 In overall, Vero runs faster given more machines. However, and RCV1-multi. As shown in Table 8, the feature-parallel linear speedup is not observed on both datasets, since the version can outperform data-parallel, since it avoids the time cost of some operations in Vero have no relations to aggregation of histograms. However, Vero still achieves the number of machines. For instance, in node splitting, all fastest speed. Since the datasets contain smaller numbers of workers need to update the position of every instance, which instances, the communication cost of Vero does not dominant is not able to speedup given more workers. Therefore, the the overall run time. As a result, Vero is able to outperform speedup on Synthesis-D25K is lower as it contains more the feature-parallel LightGBM on small-scale datasets. instances, while on Synthesis-N10M we can achieve higher speedup. However, we can accelerate such computation with Dataset LightGBM (DP) LightGBM (FP) Vero multi-threading. Since the memory consumption of Vero RCV1 17 5 3 is much smaller than the horizontal-based implementations, we should consider using a small number of machines with RCV1-multi 127 23 13 multiple CPU cores to achieve higher speedup. Table 8: Time cost per tree in seconds. DP and FP refer to Dataset Synthesis-N10M Synthesis-D25K data-parallel and feature-parallel, respectively. # Machine 2 4 6 8 2 4 6 8 Run time 32.2 18.6 13.7 12.5 32.1 25.7 23.4 20.2 Speedup 1.0 1.7 2.4 2.6 1.0 1.2 1.4 1.6 Table 6: Scalability test. Run time in seconds.
http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.pngStatisticsarXiv (Cornell University)http://www.deepdyve.com/lp/arxiv-cornell-university/an-experimental-evaluation-of-large-scale-gbdt-systems-Z5obbscYTj
An Experimental Evaluation of Large Scale GBDT Systems