Black-box combinatorial optimization using models with integer-valued minima

Laurens Bliek; Sicco Verwer; Mathijs de Weerdt

doi:10.1007/s10472-020-09712-4

Black-box combinatorial optimization using models with integer-valued minima

Bliek, Laurens; Verwer, Sicco; de Weerdt, Mathijs 2020-09-19 00:00:00 When a black-box optimization objective can only be evaluated with costly or noisy mea- surements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models. These models are usually continuous and smooth, which is beneficial for continuous opti- mization problems, but not necessarily for combinatorial problems. However, by choosing the basis functions of the surrogate model in a certain way, we show that it can be guaran- teed that the optimal solution of the surrogate model is integer. This approach outperforms random search, simulated annealing and a Bayesian optimization algorithm on the prob- lem of finding robust routes for a noise-perturbed traveling salesman benchmark problem, with similar performance as another Bayesian optimization algorithm, and outperforms all compared algorithms on a convex binary optimization problem with a large number of variables. Keywords Surrogate models · Bayesian optimization · Black-box optimization 1 Introduction Traditional optimization techniques such as first order methods or branch and bound make use of a known mathematical formulation of the objective function, for example by cal- culating the derivative or a lower bound. However, many objective functions in real-life situations have no complete mathematical formulation. For example, smart grids or rail- ways are complex networks where every decision influences the whole network in such a way that the objective cannot be easily captured in one mathematical description. In such applications, we can observe the effect of decisions either in real life, or by running a simu- lation. Waiting for such a result can take some time, or may have some other cost associated with it. Furthermore, the outcome of two observations with the same decision variables may be different due to randomness that may be present in the real-life scenario, or due to Laurens Bliek l.bliek@tudelft.nl Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands 640 L. Bliek et al. artificial randomness in the simulation. Such problems have been approached using meth- ods such as black-box or Bayesian optimization [1], simulation-based optimization [2], and derivative-free optimization [3]. Here, a model fits the relation between decision vari- ables and objective function, and then standard optimization techniques are used on the model instead of the original objective. These so-called surrogate modeling techniques have been applied successfully to continuous optimization problems in signal processing [4], optics [4], machine learning [5], robotics [6], and more. However, it is still an on-going research question on how these techniques can be applied effectively to combinatorial opti- mization problems. A common approach is to simply round to the nearest integer, a method that is known to be sub-optimal in traditional optimization, and also in black-box optimiza- tion [7]. Another option is to use discrete surrogate models from machine learning such as regression trees [8] or linear model trees [9]. Although powerful, this makes both model fitting and optimization computationally expensive. This work describes an approach where the surrogate model is still continuous, but where finding the optimum of the surrogate model gives an integer solution. The main contributions are as follows: – This surrogate modeling algorithm, called IDONE, with two variants (one with a basic and one with a more complex surrogate model). – A proof that finding the optimum of the surrogate model gives an integer solution. – Experimental results that show when IDONE outperforms random search, simulated annealing and Bayesian optimization. Section 2 gives a general description of the problem and an overview of related work. Section 3 describes the IDONE algorithm and the proof. In Section 4, IDONE is compared to random search, simulated annealing and Bayesian optimization on two different prob- lems: finding robust routes in a noise-perturbed traveling salesman benchmark problem, and a convex binary optimization problem. Finally, Section 5 contains conclusions and future work. 2 Problem description and related work Consider the problem of minimizing an objective f : R → R with integer and bound constraints: min f(x) s.t. x ∈ Z , l ≤ x ≤ u ,i = 1,...,d.(1) i i i These bounds are also assumed to be integer. It is assumed that f does not have a known mathematical formulation, and can only be accessed via noisy measurements y = f(x) + , y ∈ R, with ∈ R zero-mean noise with finite variance. Furthermore, taking a mea- surement y is computationally expensive or is expensive due to a long measuring time, human decision making, deteriorating biological samples, or other reasons. Examples are hyper-parameter optimization in deep learning [10], contamination control of a food supply chain [11], and structure design in material science [12]. Although many standard optimization methods are unsuitable for this problem, there exists a vast number of methods that were designed with most of the above assumptions in mind. For example, local search heuristics [13] such as hill-climbing, simulated annealing, Black-box combinatorial optimization using models with integer-valued minima 641 or taboo search are general enough to be applied to this problem, and have the advantage of being easy to implement. These heuristics are considered as the baseline in this work, together with random search (simply assign the variables completely random values and keep the best results). Population-based heuristics such as genetic algorithms [14], particle swarm optimization [15], and ant colony optimization [16] operate in the same way as local search algorithms, but keep track of multiple candidate solutions that together form a population. These algorithms have been applied successfully in many applications, but are unsuitable for the problem described in this paper since evaluating a whole population of candidate solutions is not practical if each measurement is expensive. The same holds for algorithms that filter out the noise by averaging, such as COMPASS [17] or stochastic programming methods like sample average approximation [18], since they evaluate one candidate solution multiple times. One of the most successful methods for solving decision problems involving integer decision variables is (Mixed) Integer Linear Programming [19, 20]. This field has brought forward strong and fast solvers that only need a problem description to compute solutions, even for nonlinear objective functions [21] as we consider in this work. However, if the problem cannot be completely formulated, because the objective function has no mathemat- ical formulation, these solvers cannot be straightforwardly used. The core algorithm that makes these solvers so successful is the combined use of branch-and-bound search and (lin- ear) relaxation: the value for a certain assignment to decision variables for a relaxed problem is compared to that of an earlier found solution. If the relaxed problem value is not bet- ter, that assignment can never lead to a better solution and the respective possibilities are pruned from the search. In our context, even though observations of the value are possible, these may be noisy. This seriously invalidates this approach: even if it would be possible to also observe values for relaxations of the problem, if these are noisy we may incorrectly prune parts of the search space. Reducing this effect by repeating the observation multiple times would be too expensive under the assumption that each evaluation of the objective is expensive. Surrogate modeling techniques operate in a different way from the above methods: past measurements are used to fit a model, which is then used to select a next candidate solution. Bayesian optimization algorithms [1, 22], for example, have been successfully applied in many different fields. These methods use probability theory to determine the most promis- ing candidate point according to the surrogate model. However, when the variables are discrete, the typical approach is to relax the integer constraints, which often leads to sub- optimal solutions [7]. The authors in [7] tackled this problem by modifying the covariance function used in the surrogate model. Another approach, based on semi-definite program- ming, is given in [11]. The HyperOpt algorithm [23, 24] takes yet a different approach by using a Tree-structured Parzen Estimator as the surrogate model, which is discrete in case the variables are discrete. HyperOpt is considered the main contender in this paper. A downside of many Bayesian optimization algorithms is that the computation time per iteration scales quadratically with the number of measurements taken up until that point. This causes these methods to become slower over time, and after a certain number of iterations they may even violate the assumption that the bottleneck in the problem is the cost of evaluating the objective. This downside is recognized and overcome in two simi- lar algorithms: COMBO [12] and DONE [4]. Both algorithms use the power of random features [25] to get a fixed computation time every iteration, but COMBO is designed for combinatorial optimization while DONE is designed for continuous optimization. A dis- advantage of COMBO is that it evaluates the surrogate model at every possible candidate point, a number that grows exponentially with the input dimension d. Though evaluating 642 L. Bliek et al. the surrogate model takes very little time (compared to evaluating the original objective f ), this still makes the algorithm unsuitable for problems where the input dimension d is large. A variant of DONE named CDONE has been applied to a mixed-integer problem, where the integer constraints were relaxed [10], but as mentioned earlier, this can lead to sub-optimal solutions. However, the downside of having to relax the integer constraints can be circumvented. By choosing the basis functions in a certain way, we show how a model can be constructed for which it is known beforehand that the minima of the model exactly coincide with inte- ger points. This makes it possible to apply the algorithm to combinatorial problems, as explained in the next section. 3 IDONE algorithm The DONE algorithm [4] and its variants are able to solve problem (1) without the integer constraint by making use of a surrogate model. Every time a new measurement y = f(x)+ comes in, the surrogate model is updated, the minimum of the surrogate model is found, and an exploration step is performed. To make the algorithm suitable for combinatorial optimization we propose a variant of DONE called IDONE (integer-DONE), where the surrogate model is guaranteed to have integer-valued minima. This section starts with a theoretical contribution where we propose a piece-wise linear surrogate model. The second theorem in this section is our main contribution, which guarantees integer-valued minima of the proposed surrogate model. The section proceeds with an explanation of the model fitting step, as well as a visualization of the surrogate model. Next, the section shows how the minimum of the surrogate model can be found, and how this minimum is then used to choose a new point of evaluation for the objective. This section ends with a short overview of the whole algorithm. 3.1 Piece-wise linear surrogate model The proposed surrogate model g : R → R, with d the number of variables, is a linear com- bination of rectified linear units ReLU(z) = max(0,z), a basis function that is commonly used in the deep learning community as an activation function [26]: g(x) = c max {0,z (x)} , k k k=1 z (x) = w x + b , (2) k k d d with x ∈ R , D ∈ N the number of basis functions, z : R → R the ReLU input function, and w ∈ R , b ∈ R, c ∈ R for k = 1,...,D. Unlike what is common practice in the deep k k k learning community, the parameters w and b remain fixed in this surrogate model. This k k makes the model linear in its parameters (c ), allowing it to be trained via linear regression instead of iterative methods. This is explained in Section 3.2. Because of the choice of basis functions, the surrogate model is actually piece-wise linear, which causes its local minima to be located in one of its corner points: ∗ ∗ Theorem 1 Any strict local minimum of g is located in a point x with z (x ) = 0 for d linearly independent z . k Black-box combinatorial optimization using models with integer-valued minima 643 Algorithm 1 Basic model parameters. Input d, l , u , i = 1,...,d i i Output w , b , k = 1,...,D k k k ← 1 w ←[0,..., 0] ; b ← 1; k ← k + 1 model bias k k for i = 1,...,d do for j = l ,...,u do i i w ← e e = Unit vector in dimension i i i b =−j if j = l then Lower bound w ← w; b ← b; k ← k + 1 k k else if j = u then Upper bound w ←−w; b ←−b; k ← k + 1 k k else Between the bounds w ← w; b ← b; k ← k + 1 k k w ←−w; b ←−b; k ← k + 1 k k The reverse of this theorem is not necessarily true: if x ˆ satisfies z (x ˆ ) = 0for d linearly independent z , then it depends on the parameters c of the model whether x ˆ is actually a k k local minimum or not. The number of local minima and their locations depend on the parameters w and b .In k k this work, we provide two options for choosing these parameters in such a way that the local minima are always found in integer solutions. In the first case, the functions z are simply chosen to have zeros on hyper-planes that together form an integer lattice: Definition 1 (Basic model ) Let g be as in (2). The parameters w , b of the basic model k k are chosen according to Algorithm 1. That is, every ReLU input function z is zero on a (d − 1)-dimensional hyper-plane with x = j for some dimension i ∈{1,...,d} and some integer l ≤ j ≤ u . This leads to ReLU input functions of the form z (x) =±(x − j), i i k i i = 1,...,d, j = l ,...,u , k = 1,...,D. The total number of basis functions is D = i i 1 + 2 (u − l ).The D real-valued c ’s are the only parameters of this model. i i k i=1 The basic model has D = 1 + 2 u − l basis functions in total. The 1 comes from i i i=1 the model bias, a basis function that is equal to 1 everywhere. This allows the model to be shifted up or down. As an example, consider a problem with two variables in the range [2, 3].Then D = 1 + 2 · ((3 − 2) + (3 − 2)) = 5 and thus the basic model takes the form g(x) = c max{0,z (x)} with w and b as defined in Algorithm 1,so z (x) = 1 k k k k 1 k=1 (model bias), z (x) = x − 2, z (x) =−x + 3, z (x) = x − 2, and z (x) =−x + 3. 2 1 3 1 4 2 5 2 The exact location of the minimum of this model depends on the values for the parameters c , but as an example, suppose that c = c = 1and c = c = c = 0. In that case, for k 2 4 1 3 5 x ∈[2, 3]×[2, 3] we have g(x) = max{0,x − 2}+ max{0,x − 2}, with a non-strict local 1 2 minimum located in x = (2, 2). Since all the basis functions depend only on one variable, this basic model might not be suitable for problems where the decision variables have complex interactions. Therefore, in the advanced model, we use the same basis functions, but we also add basis functions that depend on two variables: 644 L. Bliek et al. Algorithm 2 Advanced model parameters. Input d, l , u , i = 1,...,d i i Output w , b , k = 1,...,D k k Perform Algorithm 1. for i = 2,...,d do for j = l − u ,...,u − l do i i−1 i i−1 w ← e − e e = Unit vector in dimension i i i−1 i b =−j if j = l − u then Lower bound i i−1 w ← w; b ← b; k ← k + 1 k k else if j = u − l then Upper bound i i−1 w ←−w; b ←−b; k ← k + 1 k k else Between the bounds w ← w; b ← b, k ← k + 1 k k w ←−w; b ←−b; k ← k + 1 k k Definition 2 (Advanced model) Let g be as in (2). The parameters w and b of the k k advanced model are chosen according to Algorithm 2. That is, every ReLU input function z different from the ones in the basic model is zero on a (d −1)-dimensional hyper-plane with x −x = j for some dimension i ∈{2,...,d} and some integer l −u ≤ j ≤ u −l . i i−1 i i−1 i i−1 This leads to ReLU input functions of the form z (x) =±(x − x − j), i = 2,...,d, k i i−1 j = l − u ,...,u − l . This model has 2 u − l + u − l more basis i i−1 i i−1 i i i−1 i−1 i=2 functions than the basic model. The ReLU input functions of this type are zero on diagonal hyper-planes through two variables, see Fig. 1. To continue the example of the basic model above, we can turn it into an advanced model, which consequently has D = 9 input functions. Algorithm 2 defines additional input functions z (x) = x −x +1, z (x) = x −x , z (x) =−x +x ,and z (x) =−x +x +1. 6 2 1 7 2 1 8 2 1 9 2 1 Again, the location of the minimum of this advanced model depends on the values for the parameters c . For example, if c = 1, c = 1, and c = 0for k ∈{5, 7},thenwehave k 7 k g(x) = max{0, −x + 3}+ max{0,x − x }, with a non-strict local minimum located in 2 2 1 x = (3, 3). We now show our main theoretical result. ∗ ∗ d Theorem 2 (I) If x is a strict local minimum of the basic model, then x ∈ Z and l ≤ x ≤ u , ∀i = 1,...,d. i i ∗ ∗ d (II) If x is a strict local minimum of the advanced model, then x ∈ Z . (III) If x is a non-strict local minimum of the basic model, it holds that the model retains ∗ d the same value when going from x to the nearest point x ˆ that satisfies x ˆ ∈ Z and l ≤ x ˆ ≤ u , ∀i = 1,...,d. i k (IV) If x is a non-strict local minimum of the advanced model, it holds that the model retains the same value when rounding x to the nearest integer. Proof (I) Let x be a strict local minimum of the basic model. By Theorem 1, there are d linearly independent z with z (x ) = 0. From Algorithm 1 it can be seen that all functions k k z have the form z (x) =±(x − j),for some i = 1,...,d, j = l ,...,u .Since d of k k i i i Black-box combinatorial optimization using models with integer-valued minima 645 0 0 012345 012345 x x Fig. 1 Regions where the ReLU input functions z of the surrogate model are exactly zero, for the basic model (left) and the advanced model (right), for a problem with two variables with lower bounds (0, 0) and upper bounds (5, 3). The functions z have been chosen in such a way that they cross exactly at integer points within the bounded region. This ensures that the model has its minimum in one of these points, making the model more suitable for combinatorial optimization problems these functions are linearly independent, all d of them must have a different i. Since all d ∗ ∗ of them satisfy z (x ) = 0, it holds that x = j,for some j = l ,...,u , ∀i = 1,...,d, k i i which is what is claimed. (II) Let x be a strict local minimum of the advanced model. By Theorem 1, there are d linearly independent z with z (x ) = 0. This means that all z together must depend on all k k k x , i = 1,...,d. From Algorithm 2 it can be seen that all functions z have the same form i k as in the basic model, that is, z (x) =±(x − j), i = 1,...,d, j = l ,...,u ,ortheyhave k i i i the form z (x) =±(x − x − j), i = 2,...,d, j = l − u ,...,u − l . No matter k i i−1 i i−1 i i−1 the form, thus ∗ ∗ x − x ∈ Z ∀i = 2,...,d.(3) i i−1 To arrive at a contradiction, suppose that ∃s ∈{1,...,d} such that x ∈ Z.Then by (3), x ∈ Z for all i ∈{1,...,d}. However, this is only possible if none of the z have the form z (x) =±(x − j),and all d of the z have the form z (x) = k k i k k ±(x − x − j),for d different i. But by construction, there are only d − 1ofthese i i−1 last ones available, see Algorithm 2 (the for-loop starts at 2). Therefore, it is not true that ∗ ∗ ∃s ∈{1,...,d} such that x ∈ Z. Hence, x ∈ Z ∀i = 1,...,d, which is what the theorem claims. ∗ ∗ (III) Let x be a non-strict local minimum of the basic model and suppose x ∈ Z ∪ [l ,u ] for some s ∈{1,...,d}.Let L be the line segment from x to the nearest point x ˆ s s s in the set Z ∪[l ,u ], without including that point. Since the only z functions that depend s s k on x have the form z (x) =±(x − j), j = l ,...,u , it follows that z (x) = 0 does s k s s s k not happen on L for any z that depends on x . Therefore, model g is linear on this line k s segment, and since x is a non-strict local minimum and g is continuous, g retains the same ∗ ∗ value when replacing x by x ˆ . This can be repeated for all s for which x ∈ Z ∪[l ,u ], s s s s s which proves the claim. ∗ ∗ (IV) Let x be a non-strict local minimum of the advanced model and suppose x ∈ Z. We first show that rounding x to the nearest integer does not change the sign of any z . Note that all the z of the advanced model have the form z (x) =±(x − j) or z (x) = k k i k 2 646 L. Bliek et al. ±(x − x − j),for some i = 1,...,d and some integer j.Let x ¯ denote rounding x to i i−1 i i the nearest integer. Then we have (because j is integer): x ≤ j ⇒¯ x ≤ j, and x ≥ j ⇒¯ x ≥ j, i i i i x − x ≤ j ⇒¯ x ≤ x + j ⇒¯ x −¯ x ≤ j, i i−1 i i−1 i i−1 x − x ≥ j ⇒¯ x −¯ x ≥ j. i i−1 i i−1 Since the sign of none of the z change when rounding, and model g is only nonlinear when going from z (x)< 0to z (x)> 0forsome k = 1,...,D, it follows that g is linear on k k ∗ ∗ the line segment from x to the nearest integer. Together with the fact that x is a non-strict local minimum, it follows that g retains the same value on this line segment. Finally, the claim is valid because g is continuous. 3.2 Fitting the model Because the surrogate model g is linear in its parameters c , fitting the model can be done with linear regression. Given input-output pairs (x ,y ,i = 1,...,N), this is done by i i solving the regularized linear least squares problem 2 2 min (y − g(x , c )) + λ||c − c || , (4) n n N N 0 n=1 with regularization parameter λ and initial weights c . The regularization part is added to overcome ill-conditioning, noise, and model over-fitting. Furthermore, by choosing c = [0, 1,..., 1] , it is ensured that the surrogate model is convex before the first iteration [10]. In this work, λ = 0.001 has been chosen. To prevent having to solve this problem at every iteration (with runtime O(N )), (4)is solved with the recursive least squares algorithm [27]. This algorithm has runtime O(D ) per iteration, with D the number of basis functions used by the model. This implies that the computation time per iteration does not depend on the number of measurements, which is a big advantage compared to Bayesian optimization algorithms (which usually have complex- 2 2 ity O(N ) per iteration). The memory is also O(D ), because a D × D covariance matrix needs to be stored. Since D scales linearly with the input dimension d and with the lower 2 2 and upper bounds, the computational complexity of fitting the surrogate model is O(p d ), with p = max (u − l ). i i i 3.2.1 Model visualization To visualize the surrogate model used by the IDONE algorithm, the fitting procedure is applied to a simple traveling salesman problem with four cities. The distance matrix for the cities is shown in Table 1. The decision variables are chosen as follows: the route starts at city 1, then variable x ∈{1, 2, 3} determines which of the three remaining cities is visited, then variable x ∈{1, 2} determines which of the two remaining cities is visited; then the one remaining city is visited, then city 1 is visited again. This problem has two optimal T T solutions: x =[1, 2] (route 1-2-4-3-1) and x =[2, 2] (route 1-3-4-2-1), both with a total distance of 80. All other solutions have a total distance of 95. Figure 2 shows what the surrogate model looks like after taking measurements in all pos- sible data points for this problem, which is possible due to the low number of possibilities. It can be observed that this model is piece-wise linear and that any local minimum retains Black-box combinatorial optimization using models with integer-valued minima 647 Table 1 Distance matrix for the simple traveling salesman problem 110 15 20 210 3525 31535 30 the same value when rounding to the nearest integer. Furthermore, the diagonal lines (see also Fig. 1) make the advanced model more accurate. 3.3 Finding the minimum of the model After fitting the model g at iteration N, the algorithm proceeds to find a local minimum using the new weights c : x = arg min g(x, c ), x N s.t. x ∈ Z , l ≤ x ≤ u ,i = 1,...,d.(5) i i k The BFGS method [28] with a relaxation on the integer constraint was used to solve the above problem, with a provided analytical derivative of g. In this work, the derivative of the basis function ReLU(z) = max(0,z) has been chosen to be 0.5 at z = 0. The optimal solution was rounded to the nearest integer per Theorem 2. 3.4 Exploration After fitting the model and finding its minimum, a new point x needs to be chosen to N +1 evaluate the function f . As in DONE [4], a random perturbation δ is added to the found ∗ d minimum: x = x + δ, but instead of a continuous random variable, δ ∈{−1, 0, 1} is N +1 Fig. 2 Model output for the simple traveling salesman problem for the basic model (left) and the advanced model (right) . The starting city is city 1, x determines which remaining city is visited next, x determines 1 2 which remaining city is visited third, then the only remaining city is visited, and then city 1 is visited again 648 L. Bliek et al. a discrete random variable with the following probabilities: P(δ = 0) = 1 − p, p, x = l , P(δ = 1) = 0,x = u , i i p/2, otherwise, = l , ⎨ 0,x P(δ =−1) = p, x = u , (6) i i p/2, otherwise. In this work, p = 1/d has been chosen (d is the number of variables). 3.5 IDONE algorithm The IDONE algorithm iterates over three phases: updating the surrogate model with recur- sive least squares, finding the minimum of the model, and performing the exploration step. The pseudocode for the algorithm is shown in Algorithm 3. Depending on which subroutine is used in the first line, we refer to this algorithm as either IDONE-basic (using the basic model) or IDONE-advanced (using the advanced model). Algorithm 3 IDONE-advanced, IDONE-basic. Input x ∈ R , λ ∈ R, (l ,u ) ∀i = 1,...,d, N ∈ N, p ∈[0, 1] 1 i i Output x , y N N Get w , b , k = 1,...D from Algorithm 1 for IDONE-basic or from Algorithm 2 for k k IDONE-advanced T D c ←[0, 1,..., 1] ∈ R for n = 1,...,N do Evaluate y = f(x ) + n n Calculate c from c with recursive least squares n n−1 Compute x using (5) if n<N then x ← x + δ, with δ as in Section 3.4 N +1 4 Experimental results The main idea put forward is to use a model that guarantees integer-valued minima. This idea is evaluated with two different models: a basic and an advanced model. We eval- uate both models on two different benchmark problems: finding a robust route for a noise-perturbed asymmetric traveling salesman benchmark problem with 17 cities, and an artificial convex binary optimization problem with up to 150 integer variables. The first problem gives a first indication of the algorithm’s performance on an objective function that follows from a simulation where there is a network structure: traveling between intercon- nected cities with uncertain travel times between them. The second problem shows an easier and more tangible situation - due to the convexity and the fact that we know the global opti- mum - which makes it easier to interpret results. It is also used to investigate the scalability of the proposed methods. Black-box combinatorial optimization using models with integer-valued minima 649 The algorithm is compared with two basic search strategies: random search (RS) and simulated annealing (SA), and two advanced algorithms representing the state of the art: Bayesian optimization [1, 22] (BO), using one of the existing Python implementations , and the Python library HyperOpt [23, 24] (HypOpt). HypOpt makes use of a Parzen estima- tor, which gives a probability distribution over the different discrete choices in the search space, based on how often they have been visited and whether the visited point was better or worse than the best solution so far. The other algorithms are also implemented in Python , and for random search we used HypOpt’s implementation. All experiments were done on a cluster (32 Intel Xeon E5-2650 2.0 GHz CPUs), without making use of parallelization of the algorithms themselves. For BO and HypOpt, we used the default settings. It should be noted that BO and HypOpt are both aimed at minimizing black-box functions using as few function evaluations as possible. For SA, the settings are explained below. In the context of the IDONE algorithm, the SA algorithm essentially consists of just the exploration step of the IDONE algorithm (see Section 3.4), coupled with a probability of returning to the previous candidate solution. Suppose the current best solution is (x ,y ), b b and that the exploration step as defined in Section 3.4 gives a new candidate solution (x ,y ).If y <y ,then x is accepted as the new best solution. Else, there is a probabil- c c c b c (y −y )/T b c ity that x is still accepted as the new best solution. This probability is equal to e , with T a so-called temperature. In this work, the simulated annealing algorithm starts out with a starting temperature T = T , and the temperature is multiplied with a factor T every 0 f iteration. This strategy is called a cooling schedule. For the asymmetric traveling salesman problem, T = 4.48 and T = 0.996 have been chosen. For the convex binary optimization 0 f problem, T = 1and T = 0.95 have been chosen. 0 f 4.1 Robust routes for an asymmetric traveling salesman problem (17 cities) Consider the asymmetric TSP benchmark called BR17. This benchmark was taken from the TSPLIB website [29], a library of sample instances for the traveling salesman problem. While there exist specific solvers developed for this problem, these solvers are not adequate if the objective to be minimized is perturbed by noise. Here, noise ∈[0, 1], with a uniform distribution, was added to the distances between any two cities (for distances other than 0 or infinity, which both occurred once per city, the mean distance between cities is 16.43 for this instance). Furthermore, every time a sequence of cities has been chosen, we evaluate this route 100 times, with different noise samples. The objective is the worst-case (largest) route length of the chosen sequence of cities. Minimizing this objective should then result in a route that is robust to the noise in the problem. For the variables the same encoding as in Section 3.2.1 has been used, giving 15 integer variables in total. All algorithms were run 5 times on this problem, and the results are shown in Fig. 3. The BO algorithm was not included as it took over 80 hours per run. It can be seen that both HyperOpt and IDONE-advanced outperform the simpler benchmark methods. IDONE- advanced achieves similar results as HyperOpt while being twice as fast, and both are several orders of magnitude faster than Bayesian optimization. It seems IDONE-basic is unable to deal with the complex interaction between the variables due to the basic structure of the model, as it performs similarly to the simpler SA algorithm. All methods managed to outperform random search. https://github.com/fmfn/BayesianOptimization We have made the IDONE algorithm available on https://bitbucket.org/lbliek2/idone. 650 L. Bliek et al. Computation time [s] RS 80.776 ± 0.271 SA 47.684 ± 0.087 IDONEa 316.147 ± 7.902 IDONEb 140.689 ± 0.962 HypOpt 668.497 ± 19.198 Fig. 3 Best found worst-case total distance (left) and corresponding computation time (right) of the noisy TSP with 17 cities for IDONE-advanced (IDONEa), IDONE-basic (IDONEb), random search (RS), simu- lated annealing (SA), and HyperOpt (HypOpt), averaged over 5 runs. The shaded area (left) visualizes the range across all 5 runs 4.2 Convex binary optimization To gain a better understanding of the different algorithms and their scalability, the second experiment is done on a function with a known mathematical formulation. Consider the problem of minimizing the function ∗ T ∗ f(x) = (x − x ) A(x − x ), (7) ∗ d with A a random positive semi-definite matrix, and x ∈{0, 1} a randomly chosen vector, with d the number of binary variables. The optimal solution x or the structure of the func- tion is not given to the different algorithms, only the number of variables and the fact that they are binary. Starting from a matrix U where each element is randomly generated from a uniform [0, 1] distribution, matrix A is constructed as A = (U + U )/d + I , (8) d×d with I the identity matrix. The function f can only be accessed via noisy measurements y = f(x) + , with ∈[0, 1] a uniform random variable. We ran 100 experiments with this function, with IDONE and the other black-box optimization algorithms. For each run, A and x were randomly generated, as well as the initial guess x . All algorithms were stopped after taking 1000 function evaluations, and the best found objective value was stored at each iteration. Figure 4 shows a convergence plot for the case d = 100. It can be seen that the two variants of IDONE have the fastest convergence. The large number of variables is too much for a pure random search, but also for HyperOpt, even though the latter is designed for problems with hundreds of variables [24]. Simulated annealing still gives decent results on this problem. Figure 5 shows the final objective value and computation time after 1000 iterations for the same problem for different values of d. The number of variables d was varied between 5 and 150. Bayesian optimization was only evaluated once for d = 5 due to its large computation time. As can be seen, IDONE is the only algorithm that consistently gives a solution at or close to the optimal solution (which has an objective value between 0 and 1) for the highest dimensions. Where all algorithms get at or close to the optimal solution for problems with 10 variables or less, the difference between the algorithms becomes more distinguishable when more variables are considered. The strengths of HyperOpt, such as Black-box combinatorial optimization using models with integer-valued minima 651 Fig. 4 Lowest objective value found at each iteration of the binary convex optimization example with 100 binary variables, averaged over 100 runs. The shaded area indicates the standard deviation. For every run, the initial value, matrix A,and vector x were chosen randomly dealing with different types of variables that can have complex interactions, are not relevant for this particular problem, and the Parzen estimator surrogate model does not seem to scale well to higher dimensions compared to the piece-wise linear model used by IDONE. Both variations of IDONE also scale better than the other state-of-the-art algorithms in terms of computational time, with IDONE-basic being up to 20 times faster than HyperOpt, which is already several orders of magnitude faster than Bayesian optimization. 5 Conclusions and future work The IDONE algorithm is a black-box optimization algorithm that is designed for combina- torial problems with binary or integer constraints, and has shown to be useful in particular when the objective can only be accessed via costly and noisy evaluations. By using a sur- rogate model that is designed in such a way that its optimum coincides with an integer solution, the algorithm can be applied to combinatorial optimization problems without hav- ing to resort to rounding in the objective function. IDONE has a fixed computation time per iteration that scales with the number of variables but is not influenced by the number of Fig. 5 Objective value (left) and computation time (right) of the convex binary optimization problem for the different algorithms after 1000 iterations, averaged over 100 runs, for problems with different numbers of variables d. Bayesian optimization (BO) was evaluated only for 1 run 652 L. Bliek et al. times the function has been evaluated, which is an advantage compared to Bayesian opti- mization algorithms. One variant of the proposed algorithm, IDONE-advanced, has been shown to outperform random search and simulated annealing on the problem of finding robust routes in a noise-perturbed traveling salesman benchmark problem, and on a convex binary optimization problem with up to 150 variables. The other variant of the algorithm, IDONE-basic, mainly performed well in the second experiment, where it outperformed the state-of-the-art. HyperOpt, a popular surrogate modeling algorithm for problems with hundreds of variables, performs similar as IDONE-advanced on the traveling salesman benchmark problem, but does not scale as well on the binary optimization problem. On both problems, IDONE is faster than HyperOpt, which is already multiple orders of magnitude faster than regular Bayesian optimization algorithms. We conclude that the main idea to use surrogate models with integer-valued minima is successful, but that for smaller problems with many interactions between the variables, there seem to be some limitations, espe- cially for simpler surrogate models. Understanding these limitations better is a challenging direction for future work. The results show that there is room for improvement in the use of surrogate models for black-box combinatorial optimization, and that using continuous models with integer- valued local minima is a new and promising way forward. In future work, the special structure of the surrogate model will be further exploited to provide a faster implementa- tion, and the algorithm will be tested on real-life applications of combinatorial optimization with expensive objective functions. The question also arises whether this algorithm would perform well in situations where the objective function is not expensive to evaluate, or does not contain noise. Population-based methods perform particularly well on cheap black-box objective functions, so it would be interesting to see if they could be combined with the surrogate model used in this paper. As for the noiseless case, it is known that for continu- ous variables it becomes easy in this case to estimate the gradient and use more traditional gradient-based methods, but in the case of discrete variables the traditional combinatorial optimization methods might still benefit from IDONE’s piece-wise linear surrogate model. Where surrogate-based optimization techniques have had great success in continuous opti- mization problems from many different fields, we hope that this work opens up the route to success of these techniques for the plenty of open combinatorial problems in these fields. Acknowledgments This work is part of the research programme Real-time data-driven maintenance logis- tics with project number 628.009.012, which is financed by the Dutch Research Council (NWO). The authors would also like to thank Arthur Guijt for helping with the python code. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/. References 1. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4), 455–492 (1998) Black-box combinatorial optimization using models with integer-valued minima 653 2. Gosavi, A.: Simulation-based optimization: parametric optimization techniques and reinforcement learning, Springer, 55 (2015) 3. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to derivative-free optimization, Siam, 8 (2009) 4. Bliek, L., Verstraete, H.R.G.W., Verhaegen, M., Wahls, S.: Online optimization with costly and noisy measurements using random Fourier expansions. IEEE Transactions on Neural Networks and Learning Systems 29(1), 167–182 (2018) 5. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012) 6. Martinez-Cantin, R., de Freitas, N., Brochu, E., Castellanos, J., Doucet, A.: A Bayesian exploration- exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robot. 27(2), 93–103 (2009) 7. Garrido-Merchan, ´ E.C., Hernandez-Lobato, ´ D.: Dealing with integer-valued variables in Bayesian optimization with Gaussian processes. arXiv:1706.03673 (June 2017) 8. Verwer, S., Zhang, Y., Ye, Q.C.: Auction optimization using regression trees and linear models as integer programs. Artif. Intell. 244, 368–395 (2017) 9. Verbeeck, D., Maes, F., De Grave, K., Blockeel, H.: Multi-objective optimization with surrogate trees. In: Proceedings of the 15th annual conference on Genetic and evolutionary computation, pp. 679–686, ACM (2013) 10. Bliek, L., Verhaegen, M., Wahls, S.: Online function minimization with convex random ReLU expan- sions. In: Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pp. 1–6, IEEE (2017) 11. Baptista, R., Poloczek, M.: Bayesian optimization of combinatorial structures. In: International Confer- ence on Machine Learning, pp. 471–480 (2018) 12. Ueno, T., Rhone, T.D., Hou, Z., Mizoguchi, T., Tsuda, K.: Combo: An efficient Bayesian optimization library for materials science. Materials discovery 4, 18–21 (2016) 13. Aarts, E.H.L., Lenstra, J.K.: Local search in combinatorial optimization, Princeton University Press (2003) 14. Rajeev, S., Krishnamoorthy, C.S.: Discrete optimization of structures using genetic algorithms. Journal of structural engineering 118(5), 1233–1250 (1992) 15. Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algorithm. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics: Computational Cybernetics and Simulation, vol. 5, pp. 4104–4108, IEEE (1997) 16. Dorigo, M., Caro, G.D., Gambardella, L.M.: Ant algorithms for discrete optimization. Artificial life 5(2), 137–172 (1999) 17. Hong, L.J., Nelson, B.L.: Discrete optimization via simulation using COMPASS. Oper. Res. 54(1), 115– 129 (2006) 18. Shapiro, A., Dentcheva, D., Ruszczynski, ´ A.: Lectures on stochastic programming: modeling and theory, SIAM (2014) 19. Wolsey, L.A.: Integer programming, John Wiley & Sons, vol. 52 (1998) 20. Schrijver, A.: Theory of linear and integer programming, John Wiley & Sons (1998) 21. Li, D., Sun, X.: Nonlinear integer programming, Springer Science & Business Media, 84 (2006) 22. Mockus, J.: Bayesian approach to global optimization: theory and applications, Springer Science & Business Media, 37 (2012) 23. Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference, pp. 13–20 (2013) 24. Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on Machine Learning. Jmlr (2013) 25. Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561, IEEE (2008) 26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 27. Sayed, A.H., Kailath, T.: Recursive least-squares adaptive filters, The Digital Signal Processing Handbook, 21, 1 (1998) 28. Wright, S., Nocedal, J.: Numerical optimization. Springer Science 35, 67–68 (1999) 29. TSPlib: http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html (2019) Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Annals of Mathematics and Artificial Intelligence Springer Journals http://www.deepdyve.com/lp/springer-journals/black-box-combinatorial-optimization-using-models-with-integer-valued-ckEHazQQ2W

Loading next page...

References (41)

Binxin Ru, A. Alvi, Vu Nguyen, Michael Osborne, S. Roberts (2019)
Bayesian Optimisation over Multiple Continuous and Categorical Inputs
Akshay Iyer, Yichi Zhang, A. Prasad, Siyu Tao, Yixing Wang, L. Schadler, L. Brinson, Wei Chen (2019)
Data-Centric Mixed-Variable Bayesian Optimization For Materials Design
ArXiv, abs/1907.02577
S. Rajeev, C. Krishnamoorthy (1992)
Discrete Optimization of Structures Using Genetic Algorithms
Journal of Structural Engineering-asce, 118
J. Bergstra, Daniel Yamins, David Cox (2013)
Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures
Tsuyoshi Ueno, T. Rhone, Z. Hou, T. Mizoguchi, Koji Tsuda (2016)
COMBO: An efficient Bayesian optimization library for materials science
Materials Discovery, 4
(1998)
Integer programming, John Wiley & Sons, vol
A. Sayed, T. Kailath (1998)
Recursive least-squares adaptive filters
R. Baptista, Matthias Poloczek (2018)
Bayesian Optimization of Combinatorial Structures
ArXiv, abs/1806.08838
A. Schrijver (1986)
Theory of linear and integer programming
Laurens Bliek, H. Verstraete, M. Verhaegen, S. Wahls (2016)
Online Optimization With Costly and Noisy Measurements Using Random Fourier Expansions
IEEE Transactions on Neural Networks and Learning Systems, 29
(1998)
“Efﬁcient global optimization of expensive black-box functions,”
Jasper Snoek, H. Larochelle, Ryan Adams (2012)
Practical Bayesian Optimization of Machine Learning Algorithms
Alexander Shapiro, D. Dentcheva, A. Ruszczynski (2014)
Lectures on Stochastic Programming: Modeling and Theory, Second Edition
“Tsplib.”
J. Nocedal, Stephen Wright (2018)
Numerical Optimization
Yann LeCun, Yoshua Bengio, Geoffrey Hinton (2015)
Deep Learning
Nature, 521
J. Mockus (1989)
Bayesian Approach to Global Optimization: Theory and Applications
A. Shapiro, D. Dentcheva, A. Ruszczynski (2009)
Lectures on Stochastic Programming: Modeling and Theory
E.C. Garrido-Merchán, D. Hernández-Lobato (2017)
Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes
Neurocomputing, 380
T. Bartz-Beielstein, Martin Zaefferer (2017)
Model-based methods for continuous and discrete global optimization
Appl. Soft Comput., 55
A. Rahimi, B. Recht (2008)
Uniform approximation of functions with random bases
2008 46th Annual Allerton Conference on Communication, Control, and Computing
J. Kennedy, R. Eberhart (1997)
A discrete binary version of the particle swarm algorithm
1997 IEEE International Conference on Systems, Man, and Cybernetics. Computational Cybernetics and Simulation, 5
A. Orman, E. Aarts, J. Lenstra (1997)
Local Search in Combinatorial Optimisation.
Journal of the Operational Research Society, 50
“Bayesopt (mathworks.com documentation).”
M. Dorigo, G. Caro, L. Gambardella (1999)
Ant Algorithms for Discrete Optimization
Artificial Life, 5
A. Gosavi (2015)
Control Optimization with Reinforcement Learning
J. Bergstra, Daniel Yamins, David Cox (2013)
Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms
Ruben Martinez-Cantin, Nando Freitas, E. Brochu, J. Castellanos, A. Doucet (2009)
A Bayesian exploration-exploitation approach for optimal online sensing and planning with a visually guided mobile robot
Autonomous Robots, 27
Laurens Bliek, M. Verhaegen, S. Wahls (2017)
Online function minimization with convex random relu expansions
2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP)
A. Conn, K. Scheinberg, L. Vicente (2009)
Introduction to Derivative-Free Optimization
, 8
Erik Daxberger, Anastasia Makarova, M. Turchetta, Andreas Krause (2020)
Mixed-Variable Bayesian Optimization
IJCAI, abs/1907.01329
Y. Crama, A. Kolen, E. Pesch (2018)
Local Search in Combinatorial Optimization
L. Hong, B. Nelson (2006)
Discrete Optimization via Simulation Using COMPASS
Oper. Res., 54
R.: N-fold integer programming
F. Hutter, H. Hoos, Kevin Leyton-Brown (2011)
Sequential Model-Based Optimization for General Algorithm Configuration
D. Verbeeck, Francis Maes, K. Grave, H. Blockeel (2013)
Multi-objective optimization with surrogate trees
(2019)
Black-box Combinatorial Optimization using Models with Integer-valued Minima
S. Verwer, Yingqian Zhang, Q. Ye (2017)
Auction optimization using regression trees and linear models as integer programs
Artif. Intell., 244
D. Morrison, S. Jacobson, Jason Sauppe, E. Sewell (2016)
Branch-and-bound algorithms: A survey of recent advances in searching, branching, and pruning
Discret. Optim., 19
A. Gosavi (2003)
Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Publisher: Springer Journals
Copyright: Copyright © The Author(s) 2020
ISSN: 1012-2443
eISSN: 1573-7470
DOI: 10.1007/s10472-020-09712-4
Publisher site: See Article on Publisher Site

Abstract

When a black-box optimization objective can only be evaluated with costly or noisy mea- surements, most standard optimization algorithms are unsuited to find the optimal solution. Specialized algorithms that deal with exactly this situation make use of surrogate models. These models are usually continuous and smooth, which is beneficial for continuous opti- mization problems, but not necessarily for combinatorial problems. However, by choosing the basis functions of the surrogate model in a certain way, we show that it can be guaran- teed that the optimal solution of the surrogate model is integer. This approach outperforms random search, simulated annealing and a Bayesian optimization algorithm on the prob- lem of finding robust routes for a noise-perturbed traveling salesman benchmark problem, with similar performance as another Bayesian optimization algorithm, and outperforms all compared algorithms on a convex binary optimization problem with a large number of variables. Keywords Surrogate models · Bayesian optimization · Black-box optimization 1 Introduction Traditional optimization techniques such as first order methods or branch and bound make use of a known mathematical formulation of the objective function, for example by cal- culating the derivative or a lower bound. However, many objective functions in real-life situations have no complete mathematical formulation. For example, smart grids or rail- ways are complex networks where every decision influences the whole network in such a way that the objective cannot be easily captured in one mathematical description. In such applications, we can observe the effect of decisions either in real life, or by running a simu- lation. Waiting for such a result can take some time, or may have some other cost associated with it. Furthermore, the outcome of two observations with the same decision variables may be different due to randomness that may be present in the real-life scenario, or due to Laurens Bliek l.bliek@tudelft.nl Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Van Mourik Broekmanweg 6, 2628 XE Delft, The Netherlands 640 L. Bliek et al. artificial randomness in the simulation. Such problems have been approached using meth- ods such as black-box or Bayesian optimization [1], simulation-based optimization [2], and derivative-free optimization [3]. Here, a model fits the relation between decision vari- ables and objective function, and then standard optimization techniques are used on the model instead of the original objective. These so-called surrogate modeling techniques have been applied successfully to continuous optimization problems in signal processing [4], optics [4], machine learning [5], robotics [6], and more. However, it is still an on-going research question on how these techniques can be applied effectively to combinatorial opti- mization problems. A common approach is to simply round to the nearest integer, a method that is known to be sub-optimal in traditional optimization, and also in black-box optimiza- tion [7]. Another option is to use discrete surrogate models from machine learning such as regression trees [8] or linear model trees [9]. Although powerful, this makes both model fitting and optimization computationally expensive. This work describes an approach where the surrogate model is still continuous, but where finding the optimum of the surrogate model gives an integer solution. The main contributions are as follows: – This surrogate modeling algorithm, called IDONE, with two variants (one with a basic and one with a more complex surrogate model). – A proof that finding the optimum of the surrogate model gives an integer solution. – Experimental results that show when IDONE outperforms random search, simulated annealing and Bayesian optimization. Section 2 gives a general description of the problem and an overview of related work. Section 3 describes the IDONE algorithm and the proof. In Section 4, IDONE is compared to random search, simulated annealing and Bayesian optimization on two different prob- lems: finding robust routes in a noise-perturbed traveling salesman benchmark problem, and a convex binary optimization problem. Finally, Section 5 contains conclusions and future work. 2 Problem description and related work Consider the problem of minimizing an objective f : R → R with integer and bound constraints: min f(x) s.t. x ∈ Z , l ≤ x ≤ u ,i = 1,...,d.(1) i i i These bounds are also assumed to be integer. It is assumed that f does not have a known mathematical formulation, and can only be accessed via noisy measurements y = f(x) + , y ∈ R, with ∈ R zero-mean noise with finite variance. Furthermore, taking a mea- surement y is computationally expensive or is expensive due to a long measuring time, human decision making, deteriorating biological samples, or other reasons. Examples are hyper-parameter optimization in deep learning [10], contamination control of a food supply chain [11], and structure design in material science [12]. Although many standard optimization methods are unsuitable for this problem, there exists a vast number of methods that were designed with most of the above assumptions in mind. For example, local search heuristics [13] such as hill-climbing, simulated annealing, Black-box combinatorial optimization using models with integer-valued minima 641 or taboo search are general enough to be applied to this problem, and have the advantage of being easy to implement. These heuristics are considered as the baseline in this work, together with random search (simply assign the variables completely random values and keep the best results). Population-based heuristics such as genetic algorithms [14], particle swarm optimization [15], and ant colony optimization [16] operate in the same way as local search algorithms, but keep track of multiple candidate solutions that together form a population. These algorithms have been applied successfully in many applications, but are unsuitable for the problem described in this paper since evaluating a whole population of candidate solutions is not practical if each measurement is expensive. The same holds for algorithms that filter out the noise by averaging, such as COMPASS [17] or stochastic programming methods like sample average approximation [18], since they evaluate one candidate solution multiple times. One of the most successful methods for solving decision problems involving integer decision variables is (Mixed) Integer Linear Programming [19, 20]. This field has brought forward strong and fast solvers that only need a problem description to compute solutions, even for nonlinear objective functions [21] as we consider in this work. However, if the problem cannot be completely formulated, because the objective function has no mathemat- ical formulation, these solvers cannot be straightforwardly used. The core algorithm that makes these solvers so successful is the combined use of branch-and-bound search and (lin- ear) relaxation: the value for a certain assignment to decision variables for a relaxed problem is compared to that of an earlier found solution. If the relaxed problem value is not bet- ter, that assignment can never lead to a better solution and the respective possibilities are pruned from the search. In our context, even though observations of the value are possible, these may be noisy. This seriously invalidates this approach: even if it would be possible to also observe values for relaxations of the problem, if these are noisy we may incorrectly prune parts of the search space. Reducing this effect by repeating the observation multiple times would be too expensive under the assumption that each evaluation of the objective is expensive. Surrogate modeling techniques operate in a different way from the above methods: past measurements are used to fit a model, which is then used to select a next candidate solution. Bayesian optimization algorithms [1, 22], for example, have been successfully applied in many different fields. These methods use probability theory to determine the most promis- ing candidate point according to the surrogate model. However, when the variables are discrete, the typical approach is to relax the integer constraints, which often leads to sub- optimal solutions [7]. The authors in [7] tackled this problem by modifying the covariance function used in the surrogate model. Another approach, based on semi-definite program- ming, is given in [11]. The HyperOpt algorithm [23, 24] takes yet a different approach by using a Tree-structured Parzen Estimator as the surrogate model, which is discrete in case the variables are discrete. HyperOpt is considered the main contender in this paper. A downside of many Bayesian optimization algorithms is that the computation time per iteration scales quadratically with the number of measurements taken up until that point. This causes these methods to become slower over time, and after a certain number of iterations they may even violate the assumption that the bottleneck in the problem is the cost of evaluating the objective. This downside is recognized and overcome in two simi- lar algorithms: COMBO [12] and DONE [4]. Both algorithms use the power of random features [25] to get a fixed computation time every iteration, but COMBO is designed for combinatorial optimization while DONE is designed for continuous optimization. A dis- advantage of COMBO is that it evaluates the surrogate model at every possible candidate point, a number that grows exponentially with the input dimension d. Though evaluating 642 L. Bliek et al. the surrogate model takes very little time (compared to evaluating the original objective f ), this still makes the algorithm unsuitable for problems where the input dimension d is large. A variant of DONE named CDONE has been applied to a mixed-integer problem, where the integer constraints were relaxed [10], but as mentioned earlier, this can lead to sub-optimal solutions. However, the downside of having to relax the integer constraints can be circumvented. By choosing the basis functions in a certain way, we show how a model can be constructed for which it is known beforehand that the minima of the model exactly coincide with inte- ger points. This makes it possible to apply the algorithm to combinatorial problems, as explained in the next section. 3 IDONE algorithm The DONE algorithm [4] and its variants are able to solve problem (1) without the integer constraint by making use of a surrogate model. Every time a new measurement y = f(x)+ comes in, the surrogate model is updated, the minimum of the surrogate model is found, and an exploration step is performed. To make the algorithm suitable for combinatorial optimization we propose a variant of DONE called IDONE (integer-DONE), where the surrogate model is guaranteed to have integer-valued minima. This section starts with a theoretical contribution where we propose a piece-wise linear surrogate model. The second theorem in this section is our main contribution, which guarantees integer-valued minima of the proposed surrogate model. The section proceeds with an explanation of the model fitting step, as well as a visualization of the surrogate model. Next, the section shows how the minimum of the surrogate model can be found, and how this minimum is then used to choose a new point of evaluation for the objective. This section ends with a short overview of the whole algorithm. 3.1 Piece-wise linear surrogate model The proposed surrogate model g : R → R, with d the number of variables, is a linear com- bination of rectified linear units ReLU(z) = max(0,z), a basis function that is commonly used in the deep learning community as an activation function [26]: g(x) = c max {0,z (x)} , k k k=1 z (x) = w x + b , (2) k k d d with x ∈ R , D ∈ N the number of basis functions, z : R → R the ReLU input function, and w ∈ R , b ∈ R, c ∈ R for k = 1,...,D. Unlike what is common practice in the deep k k k learning community, the parameters w and b remain fixed in this surrogate model. This k k makes the model linear in its parameters (c ), allowing it to be trained via linear regression instead of iterative methods. This is explained in Section 3.2. Because of the choice of basis functions, the surrogate model is actually piece-wise linear, which causes its local minima to be located in one of its corner points: ∗ ∗ Theorem 1 Any strict local minimum of g is located in a point x with z (x ) = 0 for d linearly independent z . k Black-box combinatorial optimization using models with integer-valued minima 643 Algorithm 1 Basic model parameters. Input d, l , u , i = 1,...,d i i Output w , b , k = 1,...,D k k k ← 1 w ←[0,..., 0] ; b ← 1; k ← k + 1 model bias k k for i = 1,...,d do for j = l ,...,u do i i w ← e e = Unit vector in dimension i i i b =−j if j = l then Lower bound w ← w; b ← b; k ← k + 1 k k else if j = u then Upper bound w ←−w; b ←−b; k ← k + 1 k k else Between the bounds w ← w; b ← b; k ← k + 1 k k w ←−w; b ←−b; k ← k + 1 k k The reverse of this theorem is not necessarily true: if x ˆ satisfies z (x ˆ ) = 0for d linearly independent z , then it depends on the parameters c of the model whether x ˆ is actually a k k local minimum or not. The number of local minima and their locations depend on the parameters w and b .In k k this work, we provide two options for choosing these parameters in such a way that the local minima are always found in integer solutions. In the first case, the functions z are simply chosen to have zeros on hyper-planes that together form an integer lattice: Definition 1 (Basic model ) Let g be as in (2). The parameters w , b of the basic model k k are chosen according to Algorithm 1. That is, every ReLU input function z is zero on a (d − 1)-dimensional hyper-plane with x = j for some dimension i ∈{1,...,d} and some integer l ≤ j ≤ u . This leads to ReLU input functions of the form z (x) =±(x − j), i i k i i = 1,...,d, j = l ,...,u , k = 1,...,D. The total number of basis functions is D = i i 1 + 2 (u − l ).The D real-valued c ’s are the only parameters of this model. i i k i=1 The basic model has D = 1 + 2 u − l basis functions in total. The 1 comes from i i i=1 the model bias, a basis function that is equal to 1 everywhere. This allows the model to be shifted up or down. As an example, consider a problem with two variables in the range [2, 3].Then D = 1 + 2 · ((3 − 2) + (3 − 2)) = 5 and thus the basic model takes the form g(x) = c max{0,z (x)} with w and b as defined in Algorithm 1,so z (x) = 1 k k k k 1 k=1 (model bias), z (x) = x − 2, z (x) =−x + 3, z (x) = x − 2, and z (x) =−x + 3. 2 1 3 1 4 2 5 2 The exact location of the minimum of this model depends on the values for the parameters c , but as an example, suppose that c = c = 1and c = c = c = 0. In that case, for k 2 4 1 3 5 x ∈[2, 3]×[2, 3] we have g(x) = max{0,x − 2}+ max{0,x − 2}, with a non-strict local 1 2 minimum located in x = (2, 2). Since all the basis functions depend only on one variable, this basic model might not be suitable for problems where the decision variables have complex interactions. Therefore, in the advanced model, we use the same basis functions, but we also add basis functions that depend on two variables: 644 L. Bliek et al. Algorithm 2 Advanced model parameters. Input d, l , u , i = 1,...,d i i Output w , b , k = 1,...,D k k Perform Algorithm 1. for i = 2,...,d do for j = l − u ,...,u − l do i i−1 i i−1 w ← e − e e = Unit vector in dimension i i i−1 i b =−j if j = l − u then Lower bound i i−1 w ← w; b ← b; k ← k + 1 k k else if j = u − l then Upper bound i i−1 w ←−w; b ←−b; k ← k + 1 k k else Between the bounds w ← w; b ← b, k ← k + 1 k k w ←−w; b ←−b; k ← k + 1 k k Definition 2 (Advanced model) Let g be as in (2). The parameters w and b of the k k advanced model are chosen according to Algorithm 2. That is, every ReLU input function z different from the ones in the basic model is zero on a (d −1)-dimensional hyper-plane with x −x = j for some dimension i ∈{2,...,d} and some integer l −u ≤ j ≤ u −l . i i−1 i i−1 i i−1 This leads to ReLU input functions of the form z (x) =±(x − x − j), i = 2,...,d, k i i−1 j = l − u ,...,u − l . This model has 2 u − l + u − l more basis i i−1 i i−1 i i i−1 i−1 i=2 functions than the basic model. The ReLU input functions of this type are zero on diagonal hyper-planes through two variables, see Fig. 1. To continue the example of the basic model above, we can turn it into an advanced model, which consequently has D = 9 input functions. Algorithm 2 defines additional input functions z (x) = x −x +1, z (x) = x −x , z (x) =−x +x ,and z (x) =−x +x +1. 6 2 1 7 2 1 8 2 1 9 2 1 Again, the location of the minimum of this advanced model depends on the values for the parameters c . For example, if c = 1, c = 1, and c = 0for k ∈{5, 7},thenwehave k 7 k g(x) = max{0, −x + 3}+ max{0,x − x }, with a non-strict local minimum located in 2 2 1 x = (3, 3). We now show our main theoretical result. ∗ ∗ d Theorem 2 (I) If x is a strict local minimum of the basic model, then x ∈ Z and l ≤ x ≤ u , ∀i = 1,...,d. i i ∗ ∗ d (II) If x is a strict local minimum of the advanced model, then x ∈ Z . (III) If x is a non-strict local minimum of the basic model, it holds that the model retains ∗ d the same value when going from x to the nearest point x ˆ that satisfies x ˆ ∈ Z and l ≤ x ˆ ≤ u , ∀i = 1,...,d. i k (IV) If x is a non-strict local minimum of the advanced model, it holds that the model retains the same value when rounding x to the nearest integer. Proof (I) Let x be a strict local minimum of the basic model. By Theorem 1, there are d linearly independent z with z (x ) = 0. From Algorithm 1 it can be seen that all functions k k z have the form z (x) =±(x − j),for some i = 1,...,d, j = l ,...,u .Since d of k k i i i Black-box combinatorial optimization using models with integer-valued minima 645 0 0 012345 012345 x x Fig. 1 Regions where the ReLU input functions z of the surrogate model are exactly zero, for the basic model (left) and the advanced model (right), for a problem with two variables with lower bounds (0, 0) and upper bounds (5, 3). The functions z have been chosen in such a way that they cross exactly at integer points within the bounded region. This ensures that the model has its minimum in one of these points, making the model more suitable for combinatorial optimization problems these functions are linearly independent, all d of them must have a different i. Since all d ∗ ∗ of them satisfy z (x ) = 0, it holds that x = j,for some j = l ,...,u , ∀i = 1,...,d, k i i which is what is claimed. (II) Let x be a strict local minimum of the advanced model. By Theorem 1, there are d linearly independent z with z (x ) = 0. This means that all z together must depend on all k k k x , i = 1,...,d. From Algorithm 2 it can be seen that all functions z have the same form i k as in the basic model, that is, z (x) =±(x − j), i = 1,...,d, j = l ,...,u ,ortheyhave k i i i the form z (x) =±(x − x − j), i = 2,...,d, j = l − u ,...,u − l . No matter k i i−1 i i−1 i i−1 the form, thus ∗ ∗ x − x ∈ Z ∀i = 2,...,d.(3) i i−1 To arrive at a contradiction, suppose that ∃s ∈{1,...,d} such that x ∈ Z.Then by (3), x ∈ Z for all i ∈{1,...,d}. However, this is only possible if none of the z have the form z (x) =±(x − j),and all d of the z have the form z (x) = k k i k k ±(x − x − j),for d different i. But by construction, there are only d − 1ofthese i i−1 last ones available, see Algorithm 2 (the for-loop starts at 2). Therefore, it is not true that ∗ ∗ ∃s ∈{1,...,d} such that x ∈ Z. Hence, x ∈ Z ∀i = 1,...,d, which is what the theorem claims. ∗ ∗ (III) Let x be a non-strict local minimum of the basic model and suppose x ∈ Z ∪ [l ,u ] for some s ∈{1,...,d}.Let L be the line segment from x to the nearest point x ˆ s s s in the set Z ∪[l ,u ], without including that point. Since the only z functions that depend s s k on x have the form z (x) =±(x − j), j = l ,...,u , it follows that z (x) = 0 does s k s s s k not happen on L for any z that depends on x . Therefore, model g is linear on this line k s segment, and since x is a non-strict local minimum and g is continuous, g retains the same ∗ ∗ value when replacing x by x ˆ . This can be repeated for all s for which x ∈ Z ∪[l ,u ], s s s s s which proves the claim. ∗ ∗ (IV) Let x be a non-strict local minimum of the advanced model and suppose x ∈ Z. We first show that rounding x to the nearest integer does not change the sign of any z . Note that all the z of the advanced model have the form z (x) =±(x − j) or z (x) = k k i k 2 646 L. Bliek et al. ±(x − x − j),for some i = 1,...,d and some integer j.Let x ¯ denote rounding x to i i−1 i i the nearest integer. Then we have (because j is integer): x ≤ j ⇒¯ x ≤ j, and x ≥ j ⇒¯ x ≥ j, i i i i x − x ≤ j ⇒¯ x ≤ x + j ⇒¯ x −¯ x ≤ j, i i−1 i i−1 i i−1 x − x ≥ j ⇒¯ x −¯ x ≥ j. i i−1 i i−1 Since the sign of none of the z change when rounding, and model g is only nonlinear when going from z (x)< 0to z (x)> 0forsome k = 1,...,D, it follows that g is linear on k k ∗ ∗ the line segment from x to the nearest integer. Together with the fact that x is a non-strict local minimum, it follows that g retains the same value on this line segment. Finally, the claim is valid because g is continuous. 3.2 Fitting the model Because the surrogate model g is linear in its parameters c , fitting the model can be done with linear regression. Given input-output pairs (x ,y ,i = 1,...,N), this is done by i i solving the regularized linear least squares problem 2 2 min (y − g(x , c )) + λ||c − c || , (4) n n N N 0 n=1 with regularization parameter λ and initial weights c . The regularization part is added to overcome ill-conditioning, noise, and model over-fitting. Furthermore, by choosing c = [0, 1,..., 1] , it is ensured that the surrogate model is convex before the first iteration [10]. In this work, λ = 0.001 has been chosen. To prevent having to solve this problem at every iteration (with runtime O(N )), (4)is solved with the recursive least squares algorithm [27]. This algorithm has runtime O(D ) per iteration, with D the number of basis functions used by the model. This implies that the computation time per iteration does not depend on the number of measurements, which is a big advantage compared to Bayesian optimization algorithms (which usually have complex- 2 2 ity O(N ) per iteration). The memory is also O(D ), because a D × D covariance matrix needs to be stored. Since D scales linearly with the input dimension d and with the lower 2 2 and upper bounds, the computational complexity of fitting the surrogate model is O(p d ), with p = max (u − l ). i i i 3.2.1 Model visualization To visualize the surrogate model used by the IDONE algorithm, the fitting procedure is applied to a simple traveling salesman problem with four cities. The distance matrix for the cities is shown in Table 1. The decision variables are chosen as follows: the route starts at city 1, then variable x ∈{1, 2, 3} determines which of the three remaining cities is visited, then variable x ∈{1, 2} determines which of the two remaining cities is visited; then the one remaining city is visited, then city 1 is visited again. This problem has two optimal T T solutions: x =[1, 2] (route 1-2-4-3-1) and x =[2, 2] (route 1-3-4-2-1), both with a total distance of 80. All other solutions have a total distance of 95. Figure 2 shows what the surrogate model looks like after taking measurements in all pos- sible data points for this problem, which is possible due to the low number of possibilities. It can be observed that this model is piece-wise linear and that any local minimum retains Black-box combinatorial optimization using models with integer-valued minima 647 Table 1 Distance matrix for the simple traveling salesman problem 110 15 20 210 3525 31535 30 the same value when rounding to the nearest integer. Furthermore, the diagonal lines (see also Fig. 1) make the advanced model more accurate. 3.3 Finding the minimum of the model After fitting the model g at iteration N, the algorithm proceeds to find a local minimum using the new weights c : x = arg min g(x, c ), x N s.t. x ∈ Z , l ≤ x ≤ u ,i = 1,...,d.(5) i i k The BFGS method [28] with a relaxation on the integer constraint was used to solve the above problem, with a provided analytical derivative of g. In this work, the derivative of the basis function ReLU(z) = max(0,z) has been chosen to be 0.5 at z = 0. The optimal solution was rounded to the nearest integer per Theorem 2. 3.4 Exploration After fitting the model and finding its minimum, a new point x needs to be chosen to N +1 evaluate the function f . As in DONE [4], a random perturbation δ is added to the found ∗ d minimum: x = x + δ, but instead of a continuous random variable, δ ∈{−1, 0, 1} is N +1 Fig. 2 Model output for the simple traveling salesman problem for the basic model (left) and the advanced model (right) . The starting city is city 1, x determines which remaining city is visited next, x determines 1 2 which remaining city is visited third, then the only remaining city is visited, and then city 1 is visited again 648 L. Bliek et al. a discrete random variable with the following probabilities: P(δ = 0) = 1 − p, p, x = l , P(δ = 1) = 0,x = u , i i p/2, otherwise, = l , ⎨ 0,x P(δ =−1) = p, x = u , (6) i i p/2, otherwise. In this work, p = 1/d has been chosen (d is the number of variables). 3.5 IDONE algorithm The IDONE algorithm iterates over three phases: updating the surrogate model with recur- sive least squares, finding the minimum of the model, and performing the exploration step. The pseudocode for the algorithm is shown in Algorithm 3. Depending on which subroutine is used in the first line, we refer to this algorithm as either IDONE-basic (using the basic model) or IDONE-advanced (using the advanced model). Algorithm 3 IDONE-advanced, IDONE-basic. Input x ∈ R , λ ∈ R, (l ,u ) ∀i = 1,...,d, N ∈ N, p ∈[0, 1] 1 i i Output x , y N N Get w , b , k = 1,...D from Algorithm 1 for IDONE-basic or from Algorithm 2 for k k IDONE-advanced T D c ←[0, 1,..., 1] ∈ R for n = 1,...,N do Evaluate y = f(x ) + n n Calculate c from c with recursive least squares n n−1 Compute x using (5) if n<N then x ← x + δ, with δ as in Section 3.4 N +1 4 Experimental results The main idea put forward is to use a model that guarantees integer-valued minima. This idea is evaluated with two different models: a basic and an advanced model. We eval- uate both models on two different benchmark problems: finding a robust route for a noise-perturbed asymmetric traveling salesman benchmark problem with 17 cities, and an artificial convex binary optimization problem with up to 150 integer variables. The first problem gives a first indication of the algorithm’s performance on an objective function that follows from a simulation where there is a network structure: traveling between intercon- nected cities with uncertain travel times between them. The second problem shows an easier and more tangible situation - due to the convexity and the fact that we know the global opti- mum - which makes it easier to interpret results. It is also used to investigate the scalability of the proposed methods. Black-box combinatorial optimization using models with integer-valued minima 649 The algorithm is compared with two basic search strategies: random search (RS) and simulated annealing (SA), and two advanced algorithms representing the state of the art: Bayesian optimization [1, 22] (BO), using one of the existing Python implementations , and the Python library HyperOpt [23, 24] (HypOpt). HypOpt makes use of a Parzen estima- tor, which gives a probability distribution over the different discrete choices in the search space, based on how often they have been visited and whether the visited point was better or worse than the best solution so far. The other algorithms are also implemented in Python , and for random search we used HypOpt’s implementation. All experiments were done on a cluster (32 Intel Xeon E5-2650 2.0 GHz CPUs), without making use of parallelization of the algorithms themselves. For BO and HypOpt, we used the default settings. It should be noted that BO and HypOpt are both aimed at minimizing black-box functions using as few function evaluations as possible. For SA, the settings are explained below. In the context of the IDONE algorithm, the SA algorithm essentially consists of just the exploration step of the IDONE algorithm (see Section 3.4), coupled with a probability of returning to the previous candidate solution. Suppose the current best solution is (x ,y ), b b and that the exploration step as defined in Section 3.4 gives a new candidate solution (x ,y ).If y <y ,then x is accepted as the new best solution. Else, there is a probabil- c c c b c (y −y )/T b c ity that x is still accepted as the new best solution. This probability is equal to e , with T a so-called temperature. In this work, the simulated annealing algorithm starts out with a starting temperature T = T , and the temperature is multiplied with a factor T every 0 f iteration. This strategy is called a cooling schedule. For the asymmetric traveling salesman problem, T = 4.48 and T = 0.996 have been chosen. For the convex binary optimization 0 f problem, T = 1and T = 0.95 have been chosen. 0 f 4.1 Robust routes for an asymmetric traveling salesman problem (17 cities) Consider the asymmetric TSP benchmark called BR17. This benchmark was taken from the TSPLIB website [29], a library of sample instances for the traveling salesman problem. While there exist specific solvers developed for this problem, these solvers are not adequate if the objective to be minimized is perturbed by noise. Here, noise ∈[0, 1], with a uniform distribution, was added to the distances between any two cities (for distances other than 0 or infinity, which both occurred once per city, the mean distance between cities is 16.43 for this instance). Furthermore, every time a sequence of cities has been chosen, we evaluate this route 100 times, with different noise samples. The objective is the worst-case (largest) route length of the chosen sequence of cities. Minimizing this objective should then result in a route that is robust to the noise in the problem. For the variables the same encoding as in Section 3.2.1 has been used, giving 15 integer variables in total. All algorithms were run 5 times on this problem, and the results are shown in Fig. 3. The BO algorithm was not included as it took over 80 hours per run. It can be seen that both HyperOpt and IDONE-advanced outperform the simpler benchmark methods. IDONE- advanced achieves similar results as HyperOpt while being twice as fast, and both are several orders of magnitude faster than Bayesian optimization. It seems IDONE-basic is unable to deal with the complex interaction between the variables due to the basic structure of the model, as it performs similarly to the simpler SA algorithm. All methods managed to outperform random search. https://github.com/fmfn/BayesianOptimization We have made the IDONE algorithm available on https://bitbucket.org/lbliek2/idone. 650 L. Bliek et al. Computation time [s] RS 80.776 ± 0.271 SA 47.684 ± 0.087 IDONEa 316.147 ± 7.902 IDONEb 140.689 ± 0.962 HypOpt 668.497 ± 19.198 Fig. 3 Best found worst-case total distance (left) and corresponding computation time (right) of the noisy TSP with 17 cities for IDONE-advanced (IDONEa), IDONE-basic (IDONEb), random search (RS), simu- lated annealing (SA), and HyperOpt (HypOpt), averaged over 5 runs. The shaded area (left) visualizes the range across all 5 runs 4.2 Convex binary optimization To gain a better understanding of the different algorithms and their scalability, the second experiment is done on a function with a known mathematical formulation. Consider the problem of minimizing the function ∗ T ∗ f(x) = (x − x ) A(x − x ), (7) ∗ d with A a random positive semi-definite matrix, and x ∈{0, 1} a randomly chosen vector, with d the number of binary variables. The optimal solution x or the structure of the func- tion is not given to the different algorithms, only the number of variables and the fact that they are binary. Starting from a matrix U where each element is randomly generated from a uniform [0, 1] distribution, matrix A is constructed as A = (U + U )/d + I , (8) d×d with I the identity matrix. The function f can only be accessed via noisy measurements y = f(x) + , with ∈[0, 1] a uniform random variable. We ran 100 experiments with this function, with IDONE and the other black-box optimization algorithms. For each run, A and x were randomly generated, as well as the initial guess x . All algorithms were stopped after taking 1000 function evaluations, and the best found objective value was stored at each iteration. Figure 4 shows a convergence plot for the case d = 100. It can be seen that the two variants of IDONE have the fastest convergence. The large number of variables is too much for a pure random search, but also for HyperOpt, even though the latter is designed for problems with hundreds of variables [24]. Simulated annealing still gives decent results on this problem. Figure 5 shows the final objective value and computation time after 1000 iterations for the same problem for different values of d. The number of variables d was varied between 5 and 150. Bayesian optimization was only evaluated once for d = 5 due to its large computation time. As can be seen, IDONE is the only algorithm that consistently gives a solution at or close to the optimal solution (which has an objective value between 0 and 1) for the highest dimensions. Where all algorithms get at or close to the optimal solution for problems with 10 variables or less, the difference between the algorithms becomes more distinguishable when more variables are considered. The strengths of HyperOpt, such as Black-box combinatorial optimization using models with integer-valued minima 651 Fig. 4 Lowest objective value found at each iteration of the binary convex optimization example with 100 binary variables, averaged over 100 runs. The shaded area indicates the standard deviation. For every run, the initial value, matrix A,and vector x were chosen randomly dealing with different types of variables that can have complex interactions, are not relevant for this particular problem, and the Parzen estimator surrogate model does not seem to scale well to higher dimensions compared to the piece-wise linear model used by IDONE. Both variations of IDONE also scale better than the other state-of-the-art algorithms in terms of computational time, with IDONE-basic being up to 20 times faster than HyperOpt, which is already several orders of magnitude faster than Bayesian optimization. 5 Conclusions and future work The IDONE algorithm is a black-box optimization algorithm that is designed for combina- torial problems with binary or integer constraints, and has shown to be useful in particular when the objective can only be accessed via costly and noisy evaluations. By using a sur- rogate model that is designed in such a way that its optimum coincides with an integer solution, the algorithm can be applied to combinatorial optimization problems without hav- ing to resort to rounding in the objective function. IDONE has a fixed computation time per iteration that scales with the number of variables but is not influenced by the number of Fig. 5 Objective value (left) and computation time (right) of the convex binary optimization problem for the different algorithms after 1000 iterations, averaged over 100 runs, for problems with different numbers of variables d. Bayesian optimization (BO) was evaluated only for 1 run 652 L. Bliek et al. times the function has been evaluated, which is an advantage compared to Bayesian opti- mization algorithms. One variant of the proposed algorithm, IDONE-advanced, has been shown to outperform random search and simulated annealing on the problem of finding robust routes in a noise-perturbed traveling salesman benchmark problem, and on a convex binary optimization problem with up to 150 variables. The other variant of the algorithm, IDONE-basic, mainly performed well in the second experiment, where it outperformed the state-of-the-art. HyperOpt, a popular surrogate modeling algorithm for problems with hundreds of variables, performs similar as IDONE-advanced on the traveling salesman benchmark problem, but does not scale as well on the binary optimization problem. On both problems, IDONE is faster than HyperOpt, which is already multiple orders of magnitude faster than regular Bayesian optimization algorithms. We conclude that the main idea to use surrogate models with integer-valued minima is successful, but that for smaller problems with many interactions between the variables, there seem to be some limitations, espe- cially for simpler surrogate models. Understanding these limitations better is a challenging direction for future work. The results show that there is room for improvement in the use of surrogate models for black-box combinatorial optimization, and that using continuous models with integer- valued local minima is a new and promising way forward. In future work, the special structure of the surrogate model will be further exploited to provide a faster implementa- tion, and the algorithm will be tested on real-life applications of combinatorial optimization with expensive objective functions. The question also arises whether this algorithm would perform well in situations where the objective function is not expensive to evaluate, or does not contain noise. Population-based methods perform particularly well on cheap black-box objective functions, so it would be interesting to see if they could be combined with the surrogate model used in this paper. As for the noiseless case, it is known that for continu- ous variables it becomes easy in this case to estimate the gradient and use more traditional gradient-based methods, but in the case of discrete variables the traditional combinatorial optimization methods might still benefit from IDONE’s piece-wise linear surrogate model. Where surrogate-based optimization techniques have had great success in continuous opti- mization problems from many different fields, we hope that this work opens up the route to success of these techniques for the plenty of open combinatorial problems in these fields. Acknowledgments This work is part of the research programme Real-time data-driven maintenance logis- tics with project number 628.009.012, which is financed by the Dutch Research Council (NWO). The authors would also like to thank Arthur Guijt for helping with the python code. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommonshorg/licenses/by/4.0/. References 1. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. Journal of Global optimization 13(4), 455–492 (1998) Black-box combinatorial optimization using models with integer-valued minima 653 2. Gosavi, A.: Simulation-based optimization: parametric optimization techniques and reinforcement learning, Springer, 55 (2015) 3. Conn, A.R., Scheinberg, K., Vicente, L.N.: Introduction to derivative-free optimization, Siam, 8 (2009) 4. Bliek, L., Verstraete, H.R.G.W., Verhaegen, M., Wahls, S.: Online optimization with costly and noisy measurements using random Fourier expansions. IEEE Transactions on Neural Networks and Learning Systems 29(1), 167–182 (2018) 5. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012) 6. Martinez-Cantin, R., de Freitas, N., Brochu, E., Castellanos, J., Doucet, A.: A Bayesian exploration- exploitation approach for optimal online sensing and planning with a visually guided mobile robot. Auton. Robot. 27(2), 93–103 (2009) 7. Garrido-Merchan, ´ E.C., Hernandez-Lobato, ´ D.: Dealing with integer-valued variables in Bayesian optimization with Gaussian processes. arXiv:1706.03673 (June 2017) 8. Verwer, S., Zhang, Y., Ye, Q.C.: Auction optimization using regression trees and linear models as integer programs. Artif. Intell. 244, 368–395 (2017) 9. Verbeeck, D., Maes, F., De Grave, K., Blockeel, H.: Multi-objective optimization with surrogate trees. In: Proceedings of the 15th annual conference on Genetic and evolutionary computation, pp. 679–686, ACM (2013) 10. Bliek, L., Verhaegen, M., Wahls, S.: Online function minimization with convex random ReLU expan- sions. In: Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International Workshop on, pp. 1–6, IEEE (2017) 11. Baptista, R., Poloczek, M.: Bayesian optimization of combinatorial structures. In: International Confer- ence on Machine Learning, pp. 471–480 (2018) 12. Ueno, T., Rhone, T.D., Hou, Z., Mizoguchi, T., Tsuda, K.: Combo: An efficient Bayesian optimization library for materials science. Materials discovery 4, 18–21 (2016) 13. Aarts, E.H.L., Lenstra, J.K.: Local search in combinatorial optimization, Princeton University Press (2003) 14. Rajeev, S., Krishnamoorthy, C.S.: Discrete optimization of structures using genetic algorithms. Journal of structural engineering 118(5), 1233–1250 (1992) 15. Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algorithm. In: 1997 IEEE International Conference on Systems, Man, and Cybernetics: Computational Cybernetics and Simulation, vol. 5, pp. 4104–4108, IEEE (1997) 16. Dorigo, M., Caro, G.D., Gambardella, L.M.: Ant algorithms for discrete optimization. Artificial life 5(2), 137–172 (1999) 17. Hong, L.J., Nelson, B.L.: Discrete optimization via simulation using COMPASS. Oper. Res. 54(1), 115– 129 (2006) 18. Shapiro, A., Dentcheva, D., Ruszczynski, ´ A.: Lectures on stochastic programming: modeling and theory, SIAM (2014) 19. Wolsey, L.A.: Integer programming, John Wiley & Sons, vol. 52 (1998) 20. Schrijver, A.: Theory of linear and integer programming, John Wiley & Sons (1998) 21. Li, D., Sun, X.: Nonlinear integer programming, Springer Science & Business Media, 84 (2006) 22. Mockus, J.: Bayesian approach to global optimization: theory and applications, Springer Science & Business Media, 37 (2012) 23. Bergstra, J., Yamins, D., Cox, D.D.: Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In: Proceedings of the 12th Python in science conference, pp. 13–20 (2013) 24. Bergstra, J., Yamins, D., Cox, D.D.: Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In: Proceedings of the 30th International Conference on Machine Learning. Jmlr (2013) 25. Rahimi, A., Recht, B.: Uniform approximation of functions with random bases. In: 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561, IEEE (2008) 26. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 27. Sayed, A.H., Kailath, T.: Recursive least-squares adaptive filters, The Digital Signal Processing Handbook, 21, 1 (1998) 28. Wright, S., Nocedal, J.: Numerical optimization. Springer Science 35, 67–68 (1999) 29. TSPlib: http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html (2019) Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Journal

Annals of Mathematics and Artificial Intelligence – Springer Journals

Published: Sep 19, 2020

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Black-box combinatorial optimization using models with integer-valued minima

Black-box combinatorial optimization using models with integer-valued minima

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Black-box combinatorial optimization using models with integer-valued minima

Black-box combinatorial optimization using models with integer-valued minima

References (41)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies