Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Spacecraft collision avoidance challenge: Design and results of a machine learning competition

Spacecraft collision avoidance challenge: Design and results of a machine learning competition Astrodynamics https://doi.org/10.1007/s42064-021-0101-5 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 1 1 2 3 4 4 Thomas Uriot , Dario Izzo (B), Lu s F. Sim~ oes , Rasit Abay , Nils Einecke , Sven Rebhan , 5 5 5 5 Jose Martinez-Heras , Francesca Letizia , Jan Siminski , and Klaus Merz 1. The European Space Agency, Noordwijk, 2201 AZ, the Netherlands 2. ML Analytics, Lisbon, Portugal 3. FuturifAI, Canberra, Australia 4. Honda Research Institute Europe GmbH, O enbach 63073, Germany 5. ESOC, Space Debris Oce, Darmstadt 64293, Germany ABSTRACT KEYWORDS Spacecraft collision avoidance procedures have become an essential part of satellite space operations. Complex and constantly updated estimates of the collision risk between debris orbiting objects inform various operators who can then plan risk mitigation measures. collision avoidance Such measures can be aided by the development of suitable machine learning (ML) competition models that predict, for example, the evolution of the collision risk over time. In October kelvins 2019, in an attempt to study this opportunity, the European Space Agency released a large curated dataset containing information about close approach events in the form of conjunction data messages (CDMs), which was collected from 2015 to 2019. This dataset Research Article was used in the Spacecraft Collision Avoidance Challenge, which was an ML competition where participants had to build models to predict the nal collision risk between orbiting Received: 1 October 2020 objects. This paper describes the design and results of the competition and discusses the Accepted: 27 January 2021 © The Author(s) 2021 challenges and lessons learned when applying ML methods to this problem domain. 1 Introduction to de ning guidelines to mitigate collision risk and pre- serve the space environment for future generations [6]. The overcrowding of the low Earth orbit (LEO) has been As a result, agencies, as well as operators and manufac- extensively discussed in the scienti c literature [1, 2]. turers, have been assessing a number of approaches and More than 900,000 small debris objects with a radius technologies in an attempt to alleviate this problem [7{9]. of at least 1 cm have been estimated to be currently Despite all the e orts to actively control debris and orbiting uncontrolled in the LEO , posing a threat to satellite populations, this problem is still of increasing operational satellites [3]. The consequences of an im- concern today. To illustrate the crowding of some areas pact between orbiting objects can be dramatic, as the of the LEO, we have visualized, as of 22 May 2020, the 2009 Iridium-33/Cosmos-2251 collision demonstrated [4]. position of all 19,084 objects monitored by the radar and While shielding a satellite may be e ective for impacts optical observations of the United States Space Surveil- with smaller objects [5], any impact of an active satellite lance Network (SSN) in Fig. 1. The gure clearly shows with objects that have cross-sections larger than 10 cm the density of objects at low altitudes, as well as the is most likely to result in its complete destruction. Over density drop around the northern and southern polar the past decades, international institutions and agencies caps owing to the orbital dynamics being dominated by have become increasingly concerned with and contributed the main perturbations that, in LEO, act primarily on B dario.izzo@esa.int the argument of perigee and on the right ascension of the ¬ Data from https://sdup.esoc.esa.int/discosweb/statistics/ (ac- cessed on 3 June 2020). ascending node [10]. 2 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. To obtain a rst assessment of the risk posed to events, including the destruction of Fengyun-1C (2007), an active satellite operating, for example, in a Sun- the Iridium-33/Cosmos-2251 collision (2009), and the synchronous orbit, we computed the closest distance of a Briz-M explosion (2012), convinced most satellite op- Sun-synchronous satellite to the LEO population and its erators to include the possibility of collision avoidance distribution at random epochs and within a two-year win- maneuvers in the routine operation of their satellites [12]. In addition, the actual number of active satellites is dow. Figure 2 shows the results for Sentinel-3B. In most steadily increasing, and plans for mega-constellations of the epochs, the satellite was far from other objects, but such as Starlink, OneWeb, and Project Kuiper [13] in- in some rare scenarios, the closest distance approached dicate that the population of active satellites is likely values that were of concern. A Weibull distribution can to increase in the coming decades. Thus, satellite colli- be tted to the obtained data, where results from ex- sion avoidance systems are expected to be increasingly treme value statistics justify its use to make preliminary important, and their further improvement, in particular inferences on collision probabilities [11]. Such inferences their full automation, will be a priority in the coming are very sensitive to the Weibull distribution parameters decades [14]. and, in particular, to the behavior of its tail close to the origin. 1.1 Spacecraft collision avoidance challenge This type of inference, as well as a series of resounding To advance the research on the automation of preven- tive collision avoidance maneuvers, the European Space Agency (ESA) released a unique real-world dataset con- taining a time series of events representing the evolution of collision risks related to several actively monitored satellites. The dataset was made available to the public as part of a machine learning (ML) challenge called the Collision Avoidance Challenge, which was hosted on the Kelvins online platform . The challenge occurred over two months, with 96 teams participating, resulting in 862 submissions. It attracted a wide range of people, from students to ML practitioners and aerospace engi- neers, as well as academic institutions and companies. In Fig. 1 Visualization of the density of objects orbiting the this challenge, the participants were requested to predict low Earth orbit as of 2020-May-22 (data from www.space- the nal risk of collision at the time of closest approach track.org). (TCA) between a satellite and a space object using data cropped at two days to the TCA. In this paper, we analyze the competition's dataset and results, highlighting problems to be addressed by the scienti c community to advantageously introduce ML in collision avoidance systems in the future. The paper is structured as follows: In Section 2, we describe the collision avoidance pipeline currently in place at ESA, introducing important concepts used throughout the pa- per and crucial to the understanding of the dataset. In Section 3, we describe the dataset and the details of its acquisition. Subsequently, in Section 4, we outline the competition design process and discuss some of the LN decisions made and their consequences. The competi- Fig. 2 Distribution of the distance between the closest ob- tion results, analysis of the received submissions, and ject and Sentinel-3B, and a tted Weibull distribution ( t skewed to represent the tail with higher accuracy). ¬ Hosted at https://kelvins.esa.int/. Spacecraft collision avoidance challenge: Design and results of a machine learning competition 3 challenges encountered when building statistical models and their associated uncertainties (i.e., covariances). The of the collision avoidance decision-making process are data contained in the CDMs are then processed to ob- the subjects of Section 5. In Section 6, we evaluate the tain risk estimates by applying algorithms such as the generalization of ML models in this problem beyond their Alfriend{Akella algorithm [17]. training data. In the days after the rst CDM, regular CDM updates are received, and over time, the uncertainties of the object positions become smaller as the knowledge on the close 2 Collision avoidance at ESA encounter is re ned. Typically, a time series of CDMs A detailed description of the collision avoidance process over one week is released for each unique close approach, currently implemented at ESA is available in previous with approximately three CDMs becoming available per reports [15, 16]. In this section, we brie y outline several day. For a particular close approach, the last obtained fundamental concepts. CDM can be assumed to be the best knowledge available The Space Debris Oce of ESA supports operational on the potential collision and the state of the two objects collision avoidance activities. Its activities primarily in question. If the estimated collision risk for a particular encompass ESA's missions Aeolus, Cluster II, Cryosat-2, event is close to or above the reaction threshold (e.g., the constellation of Swarm-A/B/C, and the Copernicus 10 ), the Space Debris Oce will alarm control teams Sentinel eet composed of seven satellites, as well as and begin planning a potential avoidance maneuver a the missions of third-party customers. The altitudes of few days prior to the close approach, as well as meeting these missions plotted against the background density of ¬ the ight dynamics and mission operations teams. While orbiting objects, as computed by the ESA MASTER , the Space Debris Oce at ESA provides a risk value are shown in Fig. 3. associated with each CDM, to date, it has not attempted to propagate the risk value into the future. Therefore, a practical baseline that can be considered as the current best estimate would be to use the latest risk value as the nal prediction. We introduce this estimate as the latest risk prediction (LRP) baseline in Section 4.4. 3 Database of conjunction events The CDMs collected by the ESA Space Debris Oce in support of collision avoidance operations between 2015 and 2019 were assembled into a database of conjunc- Fig. 3 Operational altitudes for the missions in LEO sup- tion events. Two initial phases of data preparation were ported by ESA Space Debris Oce, and the spatial density performed. First, the database of collected CDMs was of objects with a cross-section of > 10 cm. queried to consider only events where the theoretical The main source of information of the collision avoid- maximum collision probability (i.e., the maximum colli- ance process at ESA is based on conjunction data mes- sion probability obtained by scaling the combined target- sages (CDMs). These are ascii les produced and dis- chaser covariance) was greater than 10 . Here, the tributed by the United States based Combined Space target refers to the ESA satellites, while the chaser refers Operations Center (CSpOC). Each conjunction contains to the space debris or object to be avoided. In addition, information on one close approach between a monitored events related to intra-constellation conjunctions (e.g., space object (the \target satellite") and a second space for the Cluster II mission) and anomalous entries, such object (the \chaser satellite"). The CDMs contain multi- as scenarios with null relative velocity between the target ple attributes of the approach, such as the identity of the and chaser, were removed. Finally, some events may satellite in question, the object type of the potential col- cover a period during which the spacecraft performs a lider, the TCA, the positions and velocities of the objects, maneuver. In these scenarios, the last estimation of the ¬ Available at https://sdup.esoc.esa.int/master/. collision risk cannot be predicted from the evolution of 4 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. the CDM data, as the propulsive maneuver is not de- 4 Competition design scribed. These scenarios were addressed by removing all The database of conjunction events constitutes an impor- CDM data before the maneuver epoch. tant historical record of risky conjunction events that oc- The second step in the data preparation was the curred in LEO and creates the opportunity to test the use anonymization of the data. This involved transform- of ML approaches in the collision avoidance process. The ing absolute time stamps and position/velocity values in decision on whether to perform an avoidance maneuver relative values, respectively, in terms of time to the TCA is based on the best knowledge one has of the associated and state with respect to the target. The names of the collision risk at the time when the maneuver cannot be target mission were also removed, and a numerical mis- further delayed, i.e., the risk reported in the latest CDM sion identi er was introduced to group similar missions. available. Such a decision would clearly bene t from a A random event identi er was assigned to each event. forecast of the collision risk, enabling past evolution and The full list of the attributes extracted from the CDMs projected trends to be considered. During the design and released in the dataset, as well as their explanations, of the Spacecraft Collision Avoidance Challenge, it was are available on the Kelvins competition website. natural to begin from a forecasting standpoint, seeking Here, we brie y describe only a few attributes relevant an answer to the question: can an ML model forecast the to later discussions: collision risk evolution from available CDMs? • time to tca : time interval between the CDM creation Such a forecast could assist the decision of whether and the TCA (day). or not to perform an avoidance maneuver by providing • c object type : type of the object at a collision risk a better estimate of the future collision risk before fur- with the satellite. ther CDMs are released. Forecasting competitions are • t span : size of the target satellite used by the colli- widely recognized as an e ective means of determining sion risk computation algorithm (m). good predictive models and solutions for a particular pro- • miss distance : relative position between chaser and blem [18]. The successful designing of such competitions target. requires a good balance to be determined between the • mission id : identi er of the mission from which the desire to create an interesting and fair ML challenge, CDMs are obtained. motivating and involving a large community of data sci- • risk : self-computed value at the epoch of each CDM, entists worldwide, and ful lls the objective of furthering using the attributes contained in the CDM, as de- the current understanding by answering a meaningful scribed in Section 2. scienti c question [19]. Designing a competition to forecast r from the database Table 1 provides an overview of the resulting database, of conjunction events presents a few challenges. First, indicating the number of entries (i.e., CDMs) and unique the distribution of the nal risk r associated with all the close-approach events. The risk computed from the last conjunction events contained in the database is highly available CDM is denoted as r. skewed (Fig. 4), revealing how most events eventually Table 1 Database of conjunction events at a glance result in a negligible risk. Intuitively, the e ect is due to Characteristics Number the uncertainties being reduced as the objects get closer, which in most scenarios results in close approaches where Events 15,321 High-risk events (r > 10 ) 30 a safe distance is maintained between the orbiting objects. High-risk events (r > 10 ) 131 Furthermore, events that already require an avoidance High-risk events (r > 10 ) 515 maneuver are removed from the data, thus reducing the CDMs 199,082 number of high-risk events. This is particularly trouble- Average CDMs per event 13 some as the interesting events, the ones that are to be Maximum CDMs per event 23 Minimum CDMs per event 1 forecasted accurately, are the few ones for which the nal risk is signi cant. Second, there is signi cant heterogene- Attributes 103 ity in the various time series associated with di erent events, both in terms of the number of available CDMs Spacecraft collision avoidance challenge: Design and results of a machine learning competition 5 database, revealing an abrupt increase in the risk value of 6. In particular, there were 30 events with r > 4, 131 events with r > 5, and 515 events with r > 6 (Table 1). 4.2 Test and training sets ML algorithms learn relationships between inputs and outputs by maximizing a particular objective function. The aim is to automatically learn patterns from the training data that generalize to unseen data, known as the test set. Hence, the training and test sets must be obtained from similar data distributions. In addition, the Fig. 4 Histogram of the latest known risk value (logarithmic data in the test set should re ect the type of data that scale) for the entire dataset (training and testing sets). Note we care about when deploying the ML model in the real that there are 9505 events with a nal risk value of log r = 30 or lower, which are not displayed in this gure. world. While releasing the raw database of conjunction events and the actual time to tca at which CDMs are available, to the public was a priority, and thus provide the commu- and most importantly, of the time to tca of the last avail- nity with an unbiased set of information to learn from, able CDM that de nes the variable r to be predicted. the various models produced during the competition were Therefore, the test and training sets and the competition tested primarily on predictions of events deemed partic- metric were designed to alleviate these problems. ularly meaningful. Consequently, while the training and test sets originated from a split of the original database, 4.1 De nition of high-risk events they were not randomly sampled from it. Events corre- Many mission operators in LEO use 10 as a risk thres- sponding to useful operational scenarios appeared in the hold to implement an avoidance maneuver. Over time, test set. this value has been applied by default. However, the In particular, for some events, the latest available CDM selection of a suitable reaction threshold for a particular was days away from the (known) time to the closest mission depends on many di erent parameters (e.g., size approach, which made its prediction (also if correct) not of the chaser, target satellite), and its selection can be a good proxy for the risk at the TCA. Furthermore, driven by considerations of the risk reduction that an potential avoidance maneuvers were planned at least two days prior to the closest approach; thus, events that operator seeks to achieve [20]. Therefore, ESA missions contain several CDMs at least two days prior to the TCA in LEO adopt reaction thresholds ranging between 10 were more interesting. Overall, three constraints were and 10 . Events are monitored and highlighted when imposed on the events to be eligible in the test set: the collision risk is larger than a noti cation threshold, which is typically set to one order of magnitude lower 1. The event had to contain at least two CDMs, one to than the reaction threshold. Note that in the remainder infer from and one to use as the target. of this paper, the log of the risk value is used frequently, 2. The last CDM released for the event had to be within such that log r > 6 de nes high-risk events. Thus, we a day (time to tca < 1) of the TCA. often omit writing log and simply refer to r > 6. For 3. The rst CDM released for the event had to be at the objectives of the competition, a single noti cation least two days before the TCA (time to tca > 2) and threshold was used for all missions, and its value was all the CDMs that were within two days from the set at 10 . The threshold value was selected to have a TCA (time to tca < 2) were removed. higher number of high-risk events while maintaining its Figure 5 depicts an example of an event that satis ed value close to the more frequently used operational value the requirements. Note that by permitting only events of 10 . Figure 4 shows the risk computed from the last that satis ed the aforementioned requirements in the available CDM for all the close approach events in the test set, the number of high-risk events was considerably 6 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. was a highly unbalanced problem, where the proportion of low-risk events was much higher than that of high-risk events. The nal metric used considered these requirements and summarized them into one overall value to rank com- petitors. Eventually, the Spacecraft Collision Avoidance Challenge metric included both the classi cation and regression parts. Denoting the nal risk as r and the corresponding prediction as r ^, the metric can be de ned as Fig. 5 Diagram depicting the raw CDMs time series for one L(r ^) = MSE (r; r ^) (1) HR event (top), and the same series if it was selected for the test set (bottom): only the CDMs prior to two days to TCA were where F is computed over the entire test set using two made available (labeled as x) and the latest CDM was used classes (high nal risk: r > 6, low nal risk: r < 6) as the target (labeled as y). and MSE (r;) is only computed for high-risk events. HR More formally, we obtain diminished. After enforcing the three requirements de- scribed above, only 216 high-risk events (out of 515) were MSE (r; r ^) = 1 (r r ^ ) (2) HR i i i eligible for the test set. Note that the remaining 299 high- i=1 risk events were maintained in the training set without being necessarily representative of the test events. where N is the total number of events, N = 1 i=1 Because of the unbalanced nature of the dataset and is the number of high-risk events, r and r ^ are the true i i the small number of high-risk events eligible for the test and predicted risks for the ith event, respectively, and set, we decided to place most of the eligible events into 1; if r > 6 1 = (3) the test set. Speci cally, 150 eligible high-risk events 0; otherwise were included in the test set and 66 in the training set. Finally, the F score is de ned as To alleviate the risk of directly probing the test set and thus over tting, we limited the number of submissions p q F = (1 + ) (4) per team to two per day during the rst month of the (  p) + q competition and to a single submission per day during where essentially controls the trade-o between pre- the second month. cision and recall, denoted as p and q, respectively. A higher value of means that a recall has more weight 4.3 Competition metric than precision; thus, more emphasis is placed on false negatives. To penalize false negatives more, we set = 2. In this section, we introduce the metric used to rank the participants and discuss its advantages and drawbacks. While the metric encourages participants to have a higher F score and a lower mean squared error, it in- Several criteria were used to design a metric that could troduces many layers of subjectivity. This is because the be fair and reward models of interest for operational ob- metric contains multiple sub-objectives that are com- jectives. The Spacecraft Collision Avoidance Challenge bined into one meta-objective. In the denominator, the had two main objectives: (i) the correct classi cation of F score is already an implicit multiobjective metric, events into high- and low-risk events; (ii) the prediction of where precision and recall are maximized to 1. Thus, the risk value for high-risk events. In other words, when- there is a trade-o between precision and recall, which ever an event belonged to the low-risk class, the exact risk is controlled by . In the numerator, the mean squared value was not important, and if an event belonged to the high-risk class, its exact value was of interest. Further- error penalizes erroneous predictions for high-risk events. more, because in the context of collision avoidance, false The squaring is justi ed by the desire to penalize large negatives were much more disastrous than false positives, errors. their occurrences were to be penalized more. Finally, this In the metric de ned in Eq. (1), F functions as a scal- 2 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 7 ing factor for MSE , where F assumes values in [0; 1] known observation. This is known to be optimal for HR 2 and MSE in R , which means that the metric is largely random walk data, and it operates well on economic and HR dominated by MSE in the numerator. Nonetheless, as nancial time series. Based on this fact, a second baseline HR reported in Section 5, even the highest-ranked models solution, called latest risk prediction (LRP) baseline is achieved a relatively small MSE ; thus, the F scaling de ned as the clipped naive forecast: HR 2 factor is appropriate. r ; if r > 6 2 2 i i In conclusion, several objectives were combined into r ^ = (6) 6:001; otherwise one metric, which introduced some level of complexity and subjectivity. An alternative to the metric used in and has a score of L = 0:694 when evaluated on the Eq. (1) would have a simple weighted average for each complete test set. Of the 97 teams, 12 managed to sub-metric (F and MSE ). This scoring scheme is submit better solutions. A few di erent teams obtained 2 HR routinely used in public benchmarks such as the popular and utilized this baseline (or equivalent variants) in their GLUE [21] score used in natural language processing, and submissions. The score achieved by the LRP is also it presents similar problems in the selection of weights reported in Table 3 and is plotted as a horizontal line in that function as scaling factors. Fig. 9, along with the proposed solutions from the top ten Note that according to Eq. (1), as soon as an event is teams. The LRP was of interest in this competition, as in predicted to be low-risk (r ^ < 6), the optimal prediction any forecasting competition, because it provides a simple to assign to the event is r ^ = 6 , where  > 0. Thus, and yet surprisingly e ective benchmark to improve upon. for a false negative, we minimize MSE , and for a true HR 4.5 Data split negative, the actual value does not matter, as long as r ^ < 6. Consequently, all risk predictions can be clipped This section discusses the splitting of the original dataset at a value slightly lower than 10 to improve the overall into training and test sets. First, principal component score (or at least produce an equivalent score). In the analysis (PCA) is applied to the data and it demonstrates remainder of this paper, we utilize this clipping, and the that the attributes depend on the mission identi er. In scores of the various teams are reported after the clipping other words, attributes recorded during di erent missions has been applied, using  = 0:001. are not obtained from the same distribution, making it dicult to generalize from one mission to another. Next, 4.4 Baselines we study the e ect of di erent splits of the test data on To have a sense of the e ectiveness of a proposed solution, the leaderboard scores (evaluated on a portion of the test baseline solutions should be introduced. For the Space- set) and in the nal ranking (evaluated on the full test craft Collision Avoidance Challenge, two simple concepts set), using the LRP baseline solution. can be used to build such baselines. Let us denote r ^ and In Fig. 6, the PCA projection of the original data r as the predicted risk and the latest known risk for is shown by maintaining only the rst two principal the ith event, respectively (the subscript 2 reminds us components. While the rst two principal components that the latest known risk for a close-approach event is only account for 20% of the total variance, the projected associated with a CDM released at least two days before data can still be distinguished and crudely clustered by the TCA, as shown in Fig. 5). The rst baseline solution, mission id, in particular mission id : 7 and mission id : 2. called the constant risk prediction (CRP) baseline, is This unsurprisingly implies that the attributes from the then de ned as CDMs do not come from the same distribution, making it potentially dicult to generalize from one mission r ^ = 5 (5) to another. Thus, each mission id refers to a di erent and has an overall score of L = 2:5. It constantly predicts satellite orbiting at di erent altitudes in regions of space the same value for the risk, and it was highlighted during with varying space debris density (Fig. 3). Therefore, imbalances in mission type should not be the competition as a persistent entry in the leaderboard. Of the 97 teams, 38 managed to produce a better model. created when splitting the data into training and test One of the simplest approaches in time series prediction sets. Figure 7 shows that, for low-risk events, the missions is the naive forecast [22], i.e., forecasting with the last are proportionally represented in both the training and 8 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. whereas mission id : 1, mission id : 6, and mission id : Mission id Mission id 20 are under-represented. This is because the dataset Mission id was randomly split into training and testing, considering Mission id Mission id only the risk value and not the mission type. For future research, we recommended that mission type should be considered during the splitting of the dataset, or datasets with a higher homogeneity should be created with respect to the mission type. Note that further analysis of the dataset split and the correlation between the training and test sets are presented in Section 6. Fig. 6 Projection of the original CDMs, from the test set, 5 Competition results onto the rst two principal components, colored according to After the competition ended and the nal rankings were mission id. made public, a survey was distributed to all the partici- pating teams in the form of a questionnaire. The results of the survey, the nal rankings, the methods of a few of the best-ranking teams, and a brief meta-analysis of all the solutions are reported in this section. 5.1 Survey A total of 22 teams participated in the survey, including all top-ranked teams. Some questions from the survey were targeted to gather more information on the back- ground of the participants. The questions were phrased Mission id as follows: \How would you describe your knowledge in Fig. 7 Distribution of the mission type for the test and space debris/astrodynamics?" A similar approach was training sets for low-risk events. conducted for ML and data science. The possible an- swers were limited to \professional", \studying", and \amateur". The answers are reported in Table 2, which shows that most participants had a background in ML and less in orbital mechanics. Note that the top three teams all identi ed themselves as ML professionals and two as studying orbital mechanics. Table 2 Background of the participants, out of 22 respon- dents to the end of the competition questionnaire Pro ciency Discipline Professional Student Amateur Mission id Machine learning 10 10 4 Orbital mechanics 4 5 15 Fig. 8 Distribution of the mission type for the test and training sets for high-risk events. As mentioned in Section 4.2 and reported in Table 1, test sets. However, when we examine only high-risk the dataset for the collision avoidance challenge is highly events (Fig. 8), we observe that the missions are not well unbalanced, with the training and test sets not randomly distributed. In particular, mission id : 2, mission id : 15, sampled from the dataset. A question from the survey and mission id : 19 are over-represented in the test set, probed whether the participants explicitly attempted to Spacecraft collision avoidance challenge: Design and results of a machine learning competition 9 address class imbalance (e.g., by arti cially balancing beat the LRP baseline, most of the teams required ap- proximately 20 days to do so, implying that the LRP the classes, assigning importance weighing to samples) baseline was fairly strong. This was further supported by asking, \Did you apply any approach to compensate by the fact that the scores did not improve much below for the imbalances in the dataset?" A total of 65% of the LRP baseline, suggesting that the naive forecast is the participants answered positively. Furthermore, half an important predictor of the risk value at the closest of the participants reported attempting to build a vali- approach. dation set with similar properties and risk distribution The nal results, broken down into MSE , with the HR as the test set, albeit failing since most surveyed teams risk clipped at 6.001 and the F components, are shown lamented a poor correlation between training and test in Table 3 for the top ten teams. All teams managed set performances. to improve upon the LRP baseline score by obtaining a One of the main scienti c questions that this challenge better MSE . However, many teams failed to obtain a HR aimed at addressing was whether the temporal infor- better F value than the LRP baseline. mation contained in the time series of CDMs was used To further investigate the di erences between the F to infer the future risk of collision between the target score achieved by the teams and the LRP baseline so- and chaser. A speci c question from the survey asked lution, it is useful to examine the false positives and participants if they found the evolution of the attributes false negatives of each returned model (Fig. 10(b)). The over time useful to predict the nal risk value. Surpri- Pareto front is very heterogeneous and consists of several singly, 65% of the teams framed the learning problem as teams: DunderMiin, Valis, Magpies, DeCRA, diet- a static one, summarizing the information contained in the time series as an aggregation of attributes (e.g., using summary statistics, or simply the latest available CDM). This may have been a direct consequence of the great predictive strength of a naive forecast for this dataset, as outlined in the approaches implemented by the top teams in Section 5.3. Finally, because of the small number of high-risk events in the test set and the emphasis placed on false negatives induced by the F score, it is natural to ask whether teams probed the test set through a trial-and-error process. Overall, 30% of the participants (including the top-ranked team sesc, see Section 5.3) reported utilizing a trial-and- Fig. 9 Evolution of the scores of the submission of various error method to identify high-risk events, suggesting that top teams. the di erence between the test and training sets posed a Table 3 Final rankings (from best to worst) evaluated on signi cant problem for many teams, a fact that deserves the test set, for the top ten teams. The best results are shown some further insight, which we provide in Section 6. in bold Team Score MSE F HR 2 5.2 Final rankings sesc 0.556 0.407 0.733 96 teams participated to the challenge and produced a dietmarw 0.571 0.437 0.765 total of 862 di erent submissions during the competi- Magpies 0.585 0.441 0.753 Vidente 0.610 0.436 0.714 tion timeframe. The scores on the leaderboard changed DeCRA 0.615 0.457 0.743 frequently, and the nal ranking remained uncertain un- Valis 0.628 0.467 0.744 til the end of the competition. The evolution of the DunderMiin 0.628 0.451 0.718 madks 0.634 0.476 0.750 scores for the top ten teams throughout the competition vhrique 0.649 0.496 0.764 is shown in Fig. 9. Note how the top four teams closely Spacemeister 0.649 0.479 0.738 competed for rst place until the very last days. Another LRP baseline 0.694 0.513 0.739 observation is that while all the top teams managed to 10 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. the team attempted to use di erent methods, including extracting time series features [23], constructing an au- tomated ML pipeline via Genetic Programming [24, 25], and using random forests. All these approaches were ' reported to have a score of L 2 [0:83; 1:0] on the test set, but they performed radically better on the training set. Such a di erence was considered an indication that an automated, o -the-shelf ML pipeline was unlikely to be the appropriate way of learning from this dataset. Instead, the team resorted to a step-by-step approach informed by statistical analysis, utilizing the metric and (a) the constitution of the test set. Thus, the F score is biased toward false negatives, and there is a relatively higher proportion of high-risk events in the test set than in the training set. Furthermore, we can observe that, in the training set, most of the high-risk events misclassi ed by the naive forecast have the latest risk r only slightly below the threshold. A simple strategy is to promote borderline low-risk events to high-risk ones, thus improv- ing the recall (at the cost of penalizing precision), which is what the F score puts emphasis on. In practice, this strategy was implemented by introducing three thres- holds, referred to as step 0, step 1, and step 2, as shown (b) in Table 4 and Eq. (7). Fig. 10 On the top, in (a), the F score and the MSE 2 HR Additional incremental improvements were achieved are plotted for the top ten teams. On the bottom, in (b), the by assigning events to low risk whenever either the chaser F scores are broken into two components: false negatives type (c object type attribute) was identi ed as a payload (out of 150 positive events) and false positives (out of 2017 negative events). and the diameter of the satellite (t span attribute) was small (below 0.5) or the miss distance was greater than marw, vhrique, madks, and the baseline solution, de- 30,000 m. These steps are referred to as step 3, step 4, noted as Baseline. Although the baseline solution is in and step 5, respectively, in Table 4 and Eq. (7): the Pareto front, we can observe that the resulting F Finally, the risk value for high-risk events was clipped score in Fig. 10(a) is dominated by several teams. This is to a slightly lower risk value to enforce the general trend because the F score places more emphasis on penalizing of risk decrease over time, thus improving the MSE HR false negatives, which the baseline solution has the most while preserving the F score. This step is referred to as of. In Fig. 10(a), only two teams remain in the Pareto step 6 in Table 4, and Eq. (7). front: sesc and dietmarw. Interestingly, dietmarw has In summary, the aforementioned observations resulted the highest F score, and sesc has the lowest MSE , 2 HR in the introduction of a cascade of thresholds: suggesting that their methods can be combined to achieve a better overall score. 5:95; if 6:04 6 r < 6:00 (step 0) 5:60; if 6:40 6 r < 6:04 (step 1) 5.3 Methods used by top teams 5:00; if 7:30 6 r < 6:40 (step 2) > 2 6:00001; if c object type is \payload" (step 3) 5.3.1 Team sesc r ^ = 6:00001; if t span < 0:5 (step 4) The highest-ranking team was composed of scientists > >6:00001; if miss distance > 30000 (step 5) from diverse domains of expertise: evolutionary opti- >4:00; if 4:00 6 r < 3:50 (step 6) mization, ML, computer vision, data science, and energy 3:50; if r > 3:50 (step 6) management. In the early stages of the competition, (7) Spacecraft collision avoidance challenge: Design and results of a machine learning competition 11 Table 4 Evaluation of team sesc's approach, as additional steps were added Training set Test set Combinations of steps MSE F Loss MSE F Loss Leaderboard HR 2 HR 2 LRP baseline 0.330 0.411 0.804 0.513 0.739 0.694 0.718 Steps: 0 0.330 0.430 0.768 0.512 0.753 0.680 0.703 Steps: 0 + 1 0.305 0.392 0.779 0.498 0.764 0.653 0.670 Steps: 0 + 1 + 2 0.290 0.296 0.982 0.445 0.738 0.603 0.612 Steps: 0 + 1 + 2 + 3 0.290 0.301 0.966 0.426 0.735 0.579 0.587 Steps: 0 + 1 + 2 + 4 0.290 0.298 0.974 0.447 0.735 0.608 0.611 Steps: 0 + 1 + 2 + 5 0.325 0.304 1.070 0.444 0.733 0.607 0.613 Steps: 0 + 1 + 2 + 6 0.293 0.296 0.990 0.424 0.738 0.575 0.581 Steps: 0 + 1 + 2 + 3 + 4 + 5 + 6 0.327 0.311 1.050 0.414 0.728 0.569 0.564 Steps: 0 + 1 + 2 + 5 + 6 0.327 0.304 1.077 0.424 0.733 0.578 0.581 Steps: 0 + 1 + 2 + 5 + 6 + 7 0.327 0.304 1.077 0.407 0.733 0.555 0.555 5.3.2 Team Magpies the mean (mean risk CDMs ) and standard deviation (std risk CDMs ) of the risk values of the CDMs. The third-ranked team was composed of a space situa- Hyperbolic tangents were used as activation functions, tional awareness (SSA) researcher and Learning (ML) and Adam was used as the gradient descent optimizer [27]. engineer. The team achieved its nal score by lever- aging Manhattan-LSTMs [26] and a siamese architec- The training data were split using a three-fold cross- ture based on recurrent neural networks. Team Mag- validation (eight events were selected in each valida- pies began by analyzing the dataset and ltering the tion fold, from the 23 anomalous events). Subsequently, training data according to the test set requirements fnon-anomalous, non-anomalousg, and fnon-anomalous, described in Section 4.2. Subsequently, they selected anomalousg pairs were generated for the siamese network seven out of 103 features (time to tca, max risk estimate, to learn similar and dissimilar pairs, respectively. max risk scaling, mahalanobis distance, miss distance, For each validation fold, several networks were trained c position covariance det, and c obs used ) by comparing using di erent hyperparameters. Networks that attained the distribution di erence of the non-anomalous event a reasonably high performance were then used in a ma- (last available collision risk is low and ends up low at jority voting ensemble scheme with equal weights. The close approach, and vice-versa for high-risk events) and majority vote outcome was denoted as f , used ten fea- anomalous events (last available collision risk is low and tures as inputs, denoted as x, and predicted whether a ends up high at the close approach, and vice-versa for low-risk event was anomalous or not. Then, the nal high-risk events). Figure 11 shows the number of anoma- predictions of the test set were expressed as lous and non-anomalous scenarios. In addition to these >6:001; if r < 6 and f (x) = non-anomalous seven attributes, three new features were included: the r ^ = 5:35; if r < 6 and f (x) = anomalous number of CDMs (number CDMs ) issued before two days, r ; if r > 6; 2 2 (8) where 5.35 is the average risk value of all high-risk events in the training set. S S 5.4 Diculty of samples In this section, we investigate the events in the test set that were consistently misclassi ed by all the top ten teams. These events can be separated into two groups: Fig. 11 Number of anomalous and non-anomalous events false positives and false negatives. The false negatives from the training set. correspond to events that were incorrectly classi ed as low 12 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. risk and false positives correspond to events incorrectly have in common is high uncertainties in their associated classi ed as high risk. Figure 12 shows the evolution of measurements (e.g., position and velocity), resulting in the risk of events that were consistently misclassi ed. The very uncertain risk estimates, susceptible to large jumps gure shows that these events all experience a signi cant close to the TCA. Figure 13 shows the evolution of the change in their risk value, as they progressed to the uncertainty in the radial velocity of the target spacecraft closest approach, thus rendering the use of the latest risk (t sigma rdot ) for the 150 high-risk events in the test set. value misleading. Furthermore, as shown in Fig. 12, the The uncertainty values were generally higher for mis- temporal information is likely to be of little use to make classi ed events. Note that many more uncertainty at- good inferences in these scenarios: there is no visible tributes were recorded in the dataset, and Fig. 13 shows trend and the risk value jumped from one extreme to the only one of them. Higher uncertainties for the misclassi- other (from very low to very high risk in (a) and vice- ed events suggests that there may be value in building versa in (b)). One characteristic that all these events a model which takes these uncertainties into account at inference time, for instance, by outputting a risk distri- bution instead of a point prediction. (a) Fig. 13 Evolution of the uncertainty in the radial velocity of the target spacecraft (t sigma rdot ) over time, up until two days to the TCA. The evolution of the uncertainty of the 11 false negative events (Fig. 12) is indicated in red. The evolution of the uncertainty of the 139 remaining true positive events is indicated in black. 6 Post competition ML Further ML experiments were conducted on the dataset, both to analyze the competition and to further investi- gate the use of predictive models for collision avoidance. (b) The aim of these experiments was to understand the diculties experienced by competitors in this challenge Fig. 12 Events consistently misclassi ed by all the top ten teams: (a) false negatives, (b) false positives. In the top and to gain deeper insights into the ability of ML models panel, we show all false negatives (11 out of 150 high-risk to learn generalizable knowledge from these data. events). Each event is represented as a line, and the CDMs are marked with crosses. The evolution of the risk between 6.1 Training/test set loss correlation two CDMs is plotted as a linear interpolation. In the bottom panel, we show ten randomly sampled events out of 62 false The rst experiment was designed to analyze the cor- positives in total. These events were particularly dicult to classify because of the big leap in risk closer to the TCA, relation between the performance of ML models on the ranging from low risk to high risk in (a) and vice-versa in (b). training and test sets used during the competition. Only Spacecraft collision avoidance challenge: Design and results of a machine learning competition 13 the training set events conforming to the test set speci - classi cation metric, cared only for where the risk values cations (Section 4.2) were considered: nal CDM within lay with respect to the 6 threshold. a day of the TCA, and all other CDMs at least two For the type of regression problem outlined above, with days away from the TCA. The last CDM that is at least tabular data representation, gradient boosting meth- two days away from the TCA, the most recent CDM ods [28] o er state-of-the-art performance. Thus, we available to operators when making the nal planning selected the LightGBM gradient boosting framework [29] decisions, was used here as the sole input to the model. to train many models. To attain both training speed and model diversity, we changed the hyperparameters In other words, temporal information from the CDM as follows (with respect to the default in the LGBMRe- time series was not considered. From that CDM, only the numerical features were used; the two categorical fea- gressor of LightGBM 2.2.3): the n estimators was set to tures (mission id and object type ) were discarded. Thus, 25, feature fraction to 0.25, and learning rate to 0.05. for models to learn mission or object-speci c rules, they Together, these settings resulted in an ensemble with would have to utilize features encoding relevant prop- fewer decision trees (the default is 100), and each tree erties of that feature or object. It was hoped that this was trained exclusively on a small random subset of 25% would force the model to learn more generalizable rules. of the available features (the default is 100%), and each In addition to this step, no other transformations were successive tree had a reduced capability to overrule what applied to the CDM raw values (such as scaling or embed- previous trees had learned (the default learning rate, also known as shrinkage in the literature, is 0.1). ding). Similarly, no steps were implemented to impute the missing values that occurred at times in many of Figure 14 shows the evaluations of 5000 models, on the the CDM variables. We left these to be addressed by training and test sets, on the MSE and 1=F metrics, HR 2 the ML algorithm (LightGBM in our case) through its as well as their product (the competition's score, or loss own internal mechanisms. Most importantly, the model's metric). The risk values were clipped at 6.001 prior to target was de ned as the change in risk value between measuring the MSE . We compared the performance of HR the models on the training (x-axis) and test sets (y-axis). the input CDM and the event's nal CDM (r r ), As a reference, the dotted lines show the performance rather than the nal risk r itself. This facilitated the of the LRP baseline (Section 4.4). The Spearman rank- learning task as it implicitly reduced the bias toward the most represented nal risk (i.e., 30). Furthermore, it order correlation coecient was computed as an indicator enabled a direct comparison to the LRP baseline, as the of the extent to which performance in the training set various models were de facto tasked to predict a new generalizes to the test set. Only one model (0.02% of the trained models) outper- estimator (h) such that r = LRP + h. The quantity h formed the LRP baseline loss in both the training and was further encoded through a quantile transformer to test sets. With a test set loss of 0.684 (1.4% gain over assume a uniform distribution. Eventually, the training data consisted of a table of LRP), that model would have ranked 11 th in the ocial 8293 CDMs from as many events, each described by 100 competition. Overall, Fig. 14 shows several undesirable trends. The features. Each CDM was assigned a single numerical MSE plot exhibited a positive correlation: the training value that was to be predicted. Overall, these steps HR set performance was predictive of the performance on the resulted in a simpli ed data pipeline. Note the absence of any steps to address the existing class imbalance and test set. However, the models struggled to improve the the absence of any focus on high-risk events during the LRP in both sets. Most models degraded the risk estimate training process. Models were requested to learn the available in the most recent CDM. In 1=F , we observed evolution of risk across the full range of risk values, a strong negative correlation: the better the performance although they were mostly assessed on their performance on the training set, the worse it was on the test set. at one end of the range of risk values during evaluation. This was a clear sign of over tting. When aggregated, The competition's MSE metric was obtained only over we remained with a loss metric displaying essentially HR the true high-risk events, and the use of clipping at a no correlation. This observation, while bound to the risk of 6.001 further ignored where the nal predicted modeling choices made, o ers a possible explanation for risk lay if it fell below this value. In addition, F , a competitors' sentiment of disappointment over models 2 14 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. )3 )3 .4& .4& Fig. 14 Performance levels achieved by 5000 gradient boosting regression models trained using the competition's dataset split. that were good in their local training setups evaluating learning of their properties unlikely. Similarly, the object poorly on the leaderboard. type attribute had an identical imbalance, which led to similar challenges. Although this partitioning process 6.2 Simulating 10,000 virtual competitions made achieving a higher score on the performance metrics more dicult, it served the current aim of evaluating the To further our understanding of the absence of a signi- generalization capability. cant Spearman rank correlation between training and In each of the 10,000 virtual competitions, 100 regres- test set performances, as highlighted in Fig. 14, we simu- sion models were trained using the same data pipeline and lated 10,000 possible competitions that di ered in the model settings as described in the previous section. On data split. In each, a test size was randomly selected average, 526 competitions were simulated for each of the from a set of 19 options, containing the values from 19 di erent test size settings, each with its own distinct 0.05 through 0.95 in steps of 0.05. This setting indicates data split. In total, 1 million models were trained. Al- the fraction of events that should be randomly selected though framed here as virtual recreations of the Kelvins to be moved to the test set. The full dataset being par- competition, this process implemented, per test size set- titioned was composed solely of the 10,460 events that ting, a Monte Carlo cross-validation or repeated random conformed to the ocial competition's test set speci- sub-sampling validation [30]. If the number of random cations. We adopted a di erent splitting procedure data splits approached in nity, the results would tend from that reported in Section 4.5. A strati ed shue toward those of leave-p-out cross-validation. splitter was used, so the proportions of nal high-risk The experiment's results are shown in Figs. 15{19. events in both the training and test sets would always Figure 15 shows statistics on the Spearman rank-order match the proportion observed in the dataset being par- correlation coecients between model evaluations in the titioned (2.07%) as closely as possible. For reference, a training and test sets per evaluation metric. A positive test size of 0.2 results in 172.8 high-risk events on average Spearman correlation signals the ability to use the met- in the training set, and 43.2 in the test set (and 8195.2 ric for model selection. The better a model is on the and 2048.8 low-risk events, respectively). In the train- training set, the better we expect it to be on the unseen ing and test sets, no allowances were made to preserve events of the test set. A negative correlation is a sign of the event distributions of mission ids and chaser object over tting or inability to generalize beyond the training types present in the full dataset being partitioned. As set data. Figure 16 complements the analysis in Fig. 15 shown in Figs. 7 and 8, the fraction of events from the by showing the statistics on the percentage of the models di erent missions had such an imbalance that many of these generated splits likely either resulted in some mis- per simulated competition that outperformed the LRP sions being entirely unrepresented in either the training baseline in their respective training and test sets. The or test set or having such low volumes as to render the curves show the mean performance as a function of the Spacecraft collision avoidance challenge: Design and results of a machine learning competition 15 Fig. 15 Extent to which di erent data splits a ected the ability to infer test set performance from the training set performance. Expected Spearman rank-order correlation coecients between training and test set evaluations, as data sets vary in the fraction of events assigned to both (shown in the x-axis). Correlations measured in the MSE and 1=F metrics, as well as HR 2 their product. Fig. 16 Expected percentage of ML models that would outperform the LRP baseline in both the training and test set, as data sets varied in the fraction of events assigned to both (shown in the x-axis). Performance measured in the MSE and HR 1=F metrics, as well as their product. test size, and the shaded areas represent the region within by themselves displayed a low correlation between the one standard deviation. Figure 17 also shows the corre- training and test set and aggregated them into a single lations between the training and test set evaluations, but value, which was even less correlated. Furthermore, as now matches MSE correlations to 1=F correlations. shown in Fig. 17, the highest loss correlations tended HR 2 Thus, an overview of the e ect of the same data split to occur when the MSE was highly correlated. The HR on models' capabilities to learn generalizable knowledge MSE was of the three metrics the one that tended to HR display a higher rank correlation. However, as shown in simultaneously with respect to regression and classi ca- Fig. 16, few models outperformed the MSE obtained tion objectives can be obtained. The red star places the HR Kelvins competition's unstrati ed (with respect to high- by the LRP baseline on both sets. This indicated an risk events) data split in the context of 10,000 strati ed identical scenario to that shown in Fig. 14, in which we splits, indicating how much of an outlier it turned out obtained a high positive correlation, but the models were to be. not particularly successful. Predicting the actual nal risk The rst conclusion to be drawn from these gures is value was dicult; therefore, the further our predictions that the aggregated loss metric, MSE =F , was deci- moved away from the most recent risk estimate in high- HR 2 dedly uninformative in terms of identifying models that risk events (the only events scored by this metric), the were likely to generalize. It required two metrics that worse we were likely to perform, both in the training 16 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. By normalizing models' 1=F evaluations with respect to the LRP baseline 1=F values, performance became comparable across di erent data splits. Figure 18 shows the mean and standard deviation of models' percentage gains in performance over the LRP baseline in the training and test sets. Statistics were calculated across all models trained over all the data splits that used the same test size setting. Figure 19 shows models' 1=F evaluations in the training and test sets, normalized against their respective LRP 1=F baseline, for selected test size settings. Over 50,000 ML models are shown in each subplot, trained over 500 data splits on average with that test size setting. At one end, with a test size of 0.95, training sets had merely 523.0 events to learn from (10.8 of which are of Fig. 17 Extent to which di erent data splits a ected the high risk). With insucient data to learn from, the mod- ability to infer test set performance from the training set els quickly over t and failed to learn generalizable rules. performance, as observed through simultaneous evaluations This was indicated by a mean gain in performance of of regression and classi cation metrics. Spearman rank-order 12.61% on the training set, but a 7.73% mean loss in correlation coecients between training and test set evalua- tions of the three performance metrics in 10,000 di erent data performance on the test set, both with respect to the splits using di erent test size fractions. The red star corre- LRP baseline. As we increased the amount of data avail- sponds to the data split of the ocial competition (Fig. 14). able for training models, the training set performance decreased (more data patterns to learn from and harder to incorporate individual event idiosyncrasies into the model), but test performance increased. At the other end, with a test size of 0.05, most data were available for training, but the small test set was no longer rep- resentative (523.0 events to evaluate models on, 10.8 of high risk). Depending on the \predictability" of events that ended up on the test set, we either obtained a very high or very low performance: a mean gain of 0.04% on the test set, with a standard deviation of 9.83%. Here, the optimal trade-o lay in a test size of 0.2, where a 6.45% gain in the training set performance over the LRP baseline translated into a 1.12% gain in the test set. It Fig. 18 Expected 1=F performance gain (%) over the LRP is common for data scientists to use 80/20 splits of the baseline as data splits varied in the fraction of events assigned dataset, as a rule of thumb inspired by the Pareto princi- to either the training or test set. ple. Note we experimentally converged as this being the and test sets. Nonetheless, as shown in the 1=F plots in ideal setting. To establish an ML performance baseline, we now turn Figs. 15 and 16, even if the predicted nal risk values were not accurate, those perturbations moved the values across directly to the F score rather than its inverse (see the the 6 risk threshold to result in improved capability discussion in Section 4.3). F , which ranges in [0; 1], is to forecast the nal risk class. With a test size of 0.2, the harmonic mean of precision and recall, where recall 67.55% of the trained models outperformed the LRP (ability to identify all high-risk events) is valued two times baseline on both their training and test sets. This was in higher than precision (ability to ensure all events pre- stark contrast to Fig. 14, where only 0.04% of the models dicted as being of high-risk indeed are). A Monte Carlo (two out of 5000) outperformed the LRP baseline for the cross-validation with a test size of 0.2 and 505 strati- 1=F metric. ed random data splits evaluated the LRP baseline (the 2 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 17 Fig. 19 Variation in the training and test set performance of the ML models (1=F ) as a function of data availability. Performance normalized with respect to LRP baseline performance in the same datasets. direct use of an event's latest CDM's risk value as pre- 6.3 Feature relevance diction) to a mean F score of 0.59967 over the test set The experiment described in the previous section provides (standard deviation: 0.04391). Over the same data splits, a basis on which to quantify feature relevance that is a LightGBM regressor, acting over the same CDM raw independent from the speci cs of any particular data values (see data and model con gurations in Section 6.1), split or the decision-making of any individual model. We evaluated to a mean F score of 0.60732 over the test provide that information here, to illustrate what signal set (standard deviation: 0.04895; statistics over 50500 ML models use to arrive at their predictions, and to direct trained models). Therefore, a gain of 1.2743% over the future research towards the more important features to LRP baseline . The di erence in performance between train models on. both approaches is statistically signi cant: a paired Stu- Of the 1 million models trained in the previous sec- dent's t-test rejects the null hypothesis of equal averages tion's experiment, 47.75%, from across di erent test size (t-statistic: 10.23, two-sided p-value: 1:83 10 ; per settings, surpassed the 1=F LRP baseline on both their data split, the LRP F score was paired with the mean training and test sets. We selected all those models and LightGBM model F score to ensure independence across used LightGBM to quantify their \gain" feature rele- pairs). vance. This process did not measure how frequently a This is the strongest evidence yet that ML models can feature was used across a model's decision trees, but indeed learn generalizable knowledge in this problem. rather the gains in loss it enabled when that feature was used (loss here refers to the objective function op- In a domain that safeguards assets valued in millions, timized by the algorithm while building the model, not a 1% gain in risk classi cation can already be transfor- to the competition's MSE =F scoring function). For mative. Furthermore, this would be a 1% gain on top HR 2 each model, relevance values were normalized over the of approaches for risk determination that have been de- features' total gains and converted to percentages. Sub- veloped for decades. Note that these results, obtained sequently, the values were aggregated through weighted using a classi cation metric, were achieved through a statistics across the selected models, resulting in the rele- regression modeling approach. Furthermore, it had an vance assessments shown in Table 5 (only the top twenty intentionally limited modeling capability, was trained features are shown, out of the 100 used). The models' over a basic data preparation process, and was evaluated fractional gains in performance over the test sets' LRP under adverse conditions (owing to imbalances in the 1=F baseline were used as weights. mission id and type of chaser object). Thus, we expect it The LRP is a strong predictor, as previously dis- will be possible to signi cantly surpass these performance cussed. However, relevance measurements indicated that levels with more extensive research in data preparation in the ML models, features directly related to risk (risk, and modeling. max risk scaling, max risk estimate ) together accounted ¬ For comparison, cross-validation of team sesc's method (Sec- for only half (54.44%) of the models' gains in loss. Models tion 5.3.1), over the same 505 data splits with test size of 0.2, widely used the information available to them, with the evaluated to a mean F score of 0.51563 over the test set. This was a performance loss of 14% with respect to the LRP baseline. top twenty features in Table 5 accounting for 78.32% of 18 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. Table 5 Feature relevance estimates. In the prediction of increasing relevance by 0.07%. near-term changes in risk, percentage of the reduction in error Note that models under consideration used CDM raw attributable to the feature. A description of the features is available on the Kelvins website values as inputs. After some feature engineering, the attributes presented in Table 5 may follow a di erent Feature Rank Mean Std. dev. ranking. A result of their information content with re- risk 1 29.275 9.557 spect to the prediction target becoming clearer to identify max risk scaling 2 22.544 8.979 and use by the ML algorithms. Note also that correlated mahalanobis distance 3 3.261 1.675 c sigma t 4 3.000 1.715 features may have split relevance values between them, max risk estimate 5 2.624 1.367 causing them to appear lower in this ranking. c sigma rdot 6 2.191 1.369 miss distance 7 2.089 1.112 c position covariance det 8 1.778 1.066 7 Conclusions c sigma n 9 1.312 0.625 time to tca 10 1.236 0.517 The Spacecraft Collision Avoidance Challenge enabled, c sigma r 11 1.177 0.739 c obs used 12 1.164 0.554 for the rst time, the study of the use of ML methods c sigma ndot 13 0.964 0.437 in the domain of spacecraft collision avoidance owing to relative position n 14 0.954 0.754 the public release of a unique dataset collected by the c recommended od span 15 0.945 0.423 relative position r 16 0.835 0.440 ESA Space Debris Oce over more than four years of c sedr 17 0.779 0.486 operations. Several challenges, mostly derived from the SSN 18 0.773 0.372 unavoidable unbalanced nature of the dataset, had to c crdot t 19 0.718 0.468 be accounted for to release the dataset in the form of relative speed 20 0.699 0.400 a competition and the use of automated, o -the-shelf the gains in loss, and only two of the 100 features having ML pipelines were limited. Nevertheless, the competition a relevance of 0.0. results and further experiments presented here clearly A set of 40 features had values for both the \target" demonstrated two things. On one hand, naive forecasting (the ESA satellite { pre x t ), and \chaser" that should models have surprisingly good performances and thus are be avoided (space debris/object { pre x c ), for a total established as an unavoidable benchmark for any future of 80 of the 100 features. Note the absence of \target" work in this subject; on the other hand, ML models can features in Table 5. The relevance of \target" features improve upon such a benchmark, hinting at the possibility summed to a total of 9.41%, while \chaser" features of using ML to improve the decision-making process in summed to 23.49%. If the models were to rely too much collision avoidance systems. on the properties of the \target", they would be learning mission-speci c rules. Instead, we observed a greater Acknowledgements reliance on properties of the \chaser", and in features with relative values, thus enabling better generalization The ESA would like to thank the Unite States Space across missions. Surveillance Network for the agreement that enabled the The mean relevance estimates were very stable. The public release of the dataset for the objectives of the unweighted aggregation of normalized relevance values in competition. the remaining 52.25% of trained models not included in The authors would like to thank all the scientists that the selection above had a total of 10.51% absolute di e- participated in the Spacecraft Collision Avoidance Chal- rence across features. The higher-performing models from lenge and that dedicated their time and knowledge to an which the statistics in Table 5 were drawn exhibited by important element of ESA's operated satellites. comparison a greater reliance on risk and max risk scaling In particular, we would like to acknowledge all members (+4.62%). The SSN, the Wolf sunspot number, at a rank of 18, was one of the most relevant features. It was also of team sesc, whose methodology is brie y described in one of the features with a greater increase with respect this paper: Ste en Limmer, Sebastian Schmitt, Viktor to the alternate ranking, climbing three positions, and Losing, Sven Rebhan, and Nils Einecke. Spacecraft collision avoidance challenge: Design and results of a machine learning competition 19 References [15] Merz, K., Bastida Virgili, B., Braun, V., Flohrer, T., Funke, Q., Krag, H., Lemmens, S., Siminski, J. Current [1] Liou, J. C., Johnson, N. L. Instability of the present collision avoidance service by ESA's Space Debris Oce. LEO satellite populations. Advances in Space Research, In: Proceedings of the 7th European Conference on 2008, 41(7): 1046{1053. Space Debris, 2017. [2] Krag, H. Consideration of space debris mitigation re- [16] Braun, V., Flohrer, T., Krag, H., Merz, K., Lemmens, quirements in the operation of LEO missions. In: Pro- S., Bastida Virgili, B., Funke, Q. Operational support ceedings of the SpaceOps 2012 Conference, 2012. to collision avoidance activities by ESA's space debris [3] Klinkrad, H. Space Debris. Springer-Verlag Berlin Hei- oce. CEAS Space Journal, 2016, 8(3): 177{189. delberg, 2006. [17] Alfriend, K. T., Akella, M. R., Frisbee, J., Foster, J. L., Lee, D. J., Wilkins, M. Probability of collision error [4] Anselmo, L., Pardini, C. Analysis of the consequences analysis. Space Debris, 1999, 1(1): 21{35. in low Earth orbit of the collision between Cosmos 2251 [18] Hyndman, R. J. A brief history of forecasting competi- and Iridium 33. In: Proceedings of the 21st International tions. International Journal of Forecasting, 2020, 36(1): Symposium on Space Flight Dynamics, 2009: 2009-294. 7{14. [5] Ryan, S., Christiansen, E. L. Hypervelocity impact test- [19] Kisantal, M., Sharma, S., Park, T. H., Izzo, D., M artens, ing of advanced materials and structures for microme- M., D'Amico, S. Satellite pose estimation challenge: teoroid and orbital debris shielding. Acta Astronautica, Dataset, competition design, and results. IEEE Transac- 2013, 83: 216{231. tions on Aerospace and Electronic Systems, 2020, 56(5): [6] IADC. IADC space debris mitigation guidelines. Avail- 4083{4098. able at https://www.iadc-home.org/ (cited in 2007). [20] Merz, K., Virgili, B. B., Braun, V. Risk reduction and [7] Walker, R., Martin, C. E. Cost-e ective and robust collision risk thresholds for missions operated at ESA. mitigation of space debris in low earth orbit. Advances In: Proceedings of the 27th International Symposium on in Space Research, 2004, 34(5): 1233{1240. Space Flight Dynamics (ISSFD), 2019. [8] Biesbroek, R., Innocenti, L., Wolahan, A., Serrano, S. [21] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bow- M. e. Deorbit-ESA's active debris removal mission. In: man, S. GLUE: A multi-task benchmark and analysis Proceedings of the 7th European Conference on Space platform for natural language understanding. In: Pro- Debris, 2017: 10. ceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, [9] Liou, J. C., Johnson, N. L., Hill, N. M. Controlling the 2018: 353{355. growth of future LEO debris populations with active [22] Hyndman, R. J., Athanasopoulos, G. Forecasting: Prin- debris removal. Acta Astronautica, 2010, 66(5{6): 2288{ ciples and Practice, 2nd edn. OTexts, 2018. [23] Christ, M., Braun, N., Neu er, J. tsfresh a python pack- [10] Izzo, D. E ects of orbital parameter uncertainties. Jour- age. Available at https://tsfresh.readthedocs.io. nal of guidance, control, and dynamics, 2005, 28(2): [24] Olson, R. S., Bartley, N., Urbanowicz, R. J., Moore, 298{305. J. H. Evaluation of a tree-based pipeline optimization [11] Smirnov, N. N. Space Debris: Hazard Evaluation and tool for automating data science. In: Proceedings of Debris. CRC Press, 2001. the Genetic and Evolutionary Computation Conference, [12] Flohrer, T., Braun, V., Krag, H., Merz, K., Lemmens, S., 2016: 485{492. Virgili, B. B., Funke, Q. Operational collision avoidance [25] Wang, C., B ack, T., Hoos, H. H., Baratchi, M., Limmer, at ESOC. In: Proceedings of the Deutscher Luft-und S., Olhofer, M. Automated machine learning for short- Raumfahrtkongress, 2015. term electric load forecasting. In: Proceedings of the 2019 [13] Logue, T. J., Pelton, J. Overview of commercial small IEEE Symposium Series on Computational Intelligence satellite systems in the \New Space" age. In: Handbook (SSCI), 2019: 314{321. of Small Satellites. Pelton J. Ed. Springer, Cham, 2019: [26] Mueller, J., Thyagarajan, A. Siamese recurrent archi- 1{18. tectures for learning sentence similarity. In: Proceedings [14] Flohrer, T., Krag, H., Merz, K., Lemmens, S. CREAM- of the 30th AAAI conference on arti cial intelligence, ESA's proposal for collision risk estimation and auto- 2016. mated mitigation. In: Proceedings of the Advanced Maui [27] Kingma, D. P., Ba, L. J. Adam: A method for stochastic Optical and Space Surveillance Technologies Conference optimization. arXiv preprint, 2014: arXiv:1412.6980. (AMOS), 2019. https://arxiv.org/abs/1412.6980. 20 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. [28] Hastie, T., Tibshirani, R., Friedman, J. Boosting and later joined the European Space Agency (ESA) and became additive trees. In: The Elements of Statistical Learning, the scienti c coordinator of its Advanced Concepts Team. 2nd edn. New York: Springer New York, 2008: 337{387. He devised and managed the Global Trajectory Optimiza- [29] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, tion Competitions, the ESA's Summer of Code in Space, and W., Ye, Q., Liu, T.-Y. Light-GBM: A highly ecient gra- the Kelvins innovation and competition platform for space dient boosting decision tree. In: Proceedings of the 31st problems. He published more than 180 papers in interna- Annual Conference on Neural Information Processing tional journals and conferences making key contributions to Systems, 2017: 3146{3154. the understanding of ight mechanics and spacecraft control [30] Kuhn, M. Johnson, K. Applied Predictive Modeling. New and pioneering techniques based on evolutionary and ma- York: Springer-Verlag New York, 2013. chine learning approaches. Dario Izzo received the Humies Gold Medal and led the team winning the 8th edition of Thomas Uriot graduated from the Uni- the Global Trajectory Optimization Competition. E-mail: versity of Oxford in the UK, where he Dario.izzo@esa.int. obtained his master degree in statistics Open Access This article is licensed under a Creative Com- and mathematics. Thomas worked as a mons Attribution 4.0 International License, which permits researcher at the ESA in the Advanced use, sharing, adaptation, distribution and reproduction in Concepts Team, where he conducted re- any medium or format, as long as you give appropriate credit search on evolutionary machine learning to the original author(s) and the source, provide a link to and spacecraft collision avoidance. E- the Creative Commons licence, and indicate if changes were mail: uriot.thomas@gmail.com. made. Dario Izzo graduated as a doctor of The images or other third party material in this article are aeronautical engineering from the Uni- included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material versity Sapienza of Rome (Italy). He then took a second master degree in satellite is not included in the article's Creative Commons licence and platforms at the University of Cran eld in your intended use is not permitted by statutory regulation or the UK and completed his Ph.D. degree exceeds the permitted use, you will need to obtain permission directly from the copyright holder. in mathematical modelling at the Univer- sity Sapienza of Rome where he lectured To view a copy of this licence, visit http://creativecoorg/ classical mechanics and space ight mechanics. Dario Izzo licenses/by/4.0/. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Astrodynamics Springer Journals

Spacecraft collision avoidance challenge: Design and results of a machine learning competition

Loading next page...
 
/lp/springer-journals/spacecraft-collision-avoidance-challenge-design-and-results-of-a-JApMiMyuaT
Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2021
ISSN
2522-008X
eISSN
2522-0098
DOI
10.1007/s42064-021-0101-5
Publisher site
See Article on Publisher Site

Abstract

Astrodynamics https://doi.org/10.1007/s42064-021-0101-5 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 1 1 2 3 4 4 Thomas Uriot , Dario Izzo (B), Lu s F. Sim~ oes , Rasit Abay , Nils Einecke , Sven Rebhan , 5 5 5 5 Jose Martinez-Heras , Francesca Letizia , Jan Siminski , and Klaus Merz 1. The European Space Agency, Noordwijk, 2201 AZ, the Netherlands 2. ML Analytics, Lisbon, Portugal 3. FuturifAI, Canberra, Australia 4. Honda Research Institute Europe GmbH, O enbach 63073, Germany 5. ESOC, Space Debris Oce, Darmstadt 64293, Germany ABSTRACT KEYWORDS Spacecraft collision avoidance procedures have become an essential part of satellite space operations. Complex and constantly updated estimates of the collision risk between debris orbiting objects inform various operators who can then plan risk mitigation measures. collision avoidance Such measures can be aided by the development of suitable machine learning (ML) competition models that predict, for example, the evolution of the collision risk over time. In October kelvins 2019, in an attempt to study this opportunity, the European Space Agency released a large curated dataset containing information about close approach events in the form of conjunction data messages (CDMs), which was collected from 2015 to 2019. This dataset Research Article was used in the Spacecraft Collision Avoidance Challenge, which was an ML competition where participants had to build models to predict the nal collision risk between orbiting Received: 1 October 2020 objects. This paper describes the design and results of the competition and discusses the Accepted: 27 January 2021 © The Author(s) 2021 challenges and lessons learned when applying ML methods to this problem domain. 1 Introduction to de ning guidelines to mitigate collision risk and pre- serve the space environment for future generations [6]. The overcrowding of the low Earth orbit (LEO) has been As a result, agencies, as well as operators and manufac- extensively discussed in the scienti c literature [1, 2]. turers, have been assessing a number of approaches and More than 900,000 small debris objects with a radius technologies in an attempt to alleviate this problem [7{9]. of at least 1 cm have been estimated to be currently Despite all the e orts to actively control debris and orbiting uncontrolled in the LEO , posing a threat to satellite populations, this problem is still of increasing operational satellites [3]. The consequences of an im- concern today. To illustrate the crowding of some areas pact between orbiting objects can be dramatic, as the of the LEO, we have visualized, as of 22 May 2020, the 2009 Iridium-33/Cosmos-2251 collision demonstrated [4]. position of all 19,084 objects monitored by the radar and While shielding a satellite may be e ective for impacts optical observations of the United States Space Surveil- with smaller objects [5], any impact of an active satellite lance Network (SSN) in Fig. 1. The gure clearly shows with objects that have cross-sections larger than 10 cm the density of objects at low altitudes, as well as the is most likely to result in its complete destruction. Over density drop around the northern and southern polar the past decades, international institutions and agencies caps owing to the orbital dynamics being dominated by have become increasingly concerned with and contributed the main perturbations that, in LEO, act primarily on B dario.izzo@esa.int the argument of perigee and on the right ascension of the ¬ Data from https://sdup.esoc.esa.int/discosweb/statistics/ (ac- cessed on 3 June 2020). ascending node [10]. 2 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. To obtain a rst assessment of the risk posed to events, including the destruction of Fengyun-1C (2007), an active satellite operating, for example, in a Sun- the Iridium-33/Cosmos-2251 collision (2009), and the synchronous orbit, we computed the closest distance of a Briz-M explosion (2012), convinced most satellite op- Sun-synchronous satellite to the LEO population and its erators to include the possibility of collision avoidance distribution at random epochs and within a two-year win- maneuvers in the routine operation of their satellites [12]. In addition, the actual number of active satellites is dow. Figure 2 shows the results for Sentinel-3B. In most steadily increasing, and plans for mega-constellations of the epochs, the satellite was far from other objects, but such as Starlink, OneWeb, and Project Kuiper [13] in- in some rare scenarios, the closest distance approached dicate that the population of active satellites is likely values that were of concern. A Weibull distribution can to increase in the coming decades. Thus, satellite colli- be tted to the obtained data, where results from ex- sion avoidance systems are expected to be increasingly treme value statistics justify its use to make preliminary important, and their further improvement, in particular inferences on collision probabilities [11]. Such inferences their full automation, will be a priority in the coming are very sensitive to the Weibull distribution parameters decades [14]. and, in particular, to the behavior of its tail close to the origin. 1.1 Spacecraft collision avoidance challenge This type of inference, as well as a series of resounding To advance the research on the automation of preven- tive collision avoidance maneuvers, the European Space Agency (ESA) released a unique real-world dataset con- taining a time series of events representing the evolution of collision risks related to several actively monitored satellites. The dataset was made available to the public as part of a machine learning (ML) challenge called the Collision Avoidance Challenge, which was hosted on the Kelvins online platform . The challenge occurred over two months, with 96 teams participating, resulting in 862 submissions. It attracted a wide range of people, from students to ML practitioners and aerospace engi- neers, as well as academic institutions and companies. In Fig. 1 Visualization of the density of objects orbiting the this challenge, the participants were requested to predict low Earth orbit as of 2020-May-22 (data from www.space- the nal risk of collision at the time of closest approach track.org). (TCA) between a satellite and a space object using data cropped at two days to the TCA. In this paper, we analyze the competition's dataset and results, highlighting problems to be addressed by the scienti c community to advantageously introduce ML in collision avoidance systems in the future. The paper is structured as follows: In Section 2, we describe the collision avoidance pipeline currently in place at ESA, introducing important concepts used throughout the pa- per and crucial to the understanding of the dataset. In Section 3, we describe the dataset and the details of its acquisition. Subsequently, in Section 4, we outline the competition design process and discuss some of the LN decisions made and their consequences. The competi- Fig. 2 Distribution of the distance between the closest ob- tion results, analysis of the received submissions, and ject and Sentinel-3B, and a tted Weibull distribution ( t skewed to represent the tail with higher accuracy). ¬ Hosted at https://kelvins.esa.int/. Spacecraft collision avoidance challenge: Design and results of a machine learning competition 3 challenges encountered when building statistical models and their associated uncertainties (i.e., covariances). The of the collision avoidance decision-making process are data contained in the CDMs are then processed to ob- the subjects of Section 5. In Section 6, we evaluate the tain risk estimates by applying algorithms such as the generalization of ML models in this problem beyond their Alfriend{Akella algorithm [17]. training data. In the days after the rst CDM, regular CDM updates are received, and over time, the uncertainties of the object positions become smaller as the knowledge on the close 2 Collision avoidance at ESA encounter is re ned. Typically, a time series of CDMs A detailed description of the collision avoidance process over one week is released for each unique close approach, currently implemented at ESA is available in previous with approximately three CDMs becoming available per reports [15, 16]. In this section, we brie y outline several day. For a particular close approach, the last obtained fundamental concepts. CDM can be assumed to be the best knowledge available The Space Debris Oce of ESA supports operational on the potential collision and the state of the two objects collision avoidance activities. Its activities primarily in question. If the estimated collision risk for a particular encompass ESA's missions Aeolus, Cluster II, Cryosat-2, event is close to or above the reaction threshold (e.g., the constellation of Swarm-A/B/C, and the Copernicus 10 ), the Space Debris Oce will alarm control teams Sentinel eet composed of seven satellites, as well as and begin planning a potential avoidance maneuver a the missions of third-party customers. The altitudes of few days prior to the close approach, as well as meeting these missions plotted against the background density of ¬ the ight dynamics and mission operations teams. While orbiting objects, as computed by the ESA MASTER , the Space Debris Oce at ESA provides a risk value are shown in Fig. 3. associated with each CDM, to date, it has not attempted to propagate the risk value into the future. Therefore, a practical baseline that can be considered as the current best estimate would be to use the latest risk value as the nal prediction. We introduce this estimate as the latest risk prediction (LRP) baseline in Section 4.4. 3 Database of conjunction events The CDMs collected by the ESA Space Debris Oce in support of collision avoidance operations between 2015 and 2019 were assembled into a database of conjunc- Fig. 3 Operational altitudes for the missions in LEO sup- tion events. Two initial phases of data preparation were ported by ESA Space Debris Oce, and the spatial density performed. First, the database of collected CDMs was of objects with a cross-section of > 10 cm. queried to consider only events where the theoretical The main source of information of the collision avoid- maximum collision probability (i.e., the maximum colli- ance process at ESA is based on conjunction data mes- sion probability obtained by scaling the combined target- sages (CDMs). These are ascii les produced and dis- chaser covariance) was greater than 10 . Here, the tributed by the United States based Combined Space target refers to the ESA satellites, while the chaser refers Operations Center (CSpOC). Each conjunction contains to the space debris or object to be avoided. In addition, information on one close approach between a monitored events related to intra-constellation conjunctions (e.g., space object (the \target satellite") and a second space for the Cluster II mission) and anomalous entries, such object (the \chaser satellite"). The CDMs contain multi- as scenarios with null relative velocity between the target ple attributes of the approach, such as the identity of the and chaser, were removed. Finally, some events may satellite in question, the object type of the potential col- cover a period during which the spacecraft performs a lider, the TCA, the positions and velocities of the objects, maneuver. In these scenarios, the last estimation of the ¬ Available at https://sdup.esoc.esa.int/master/. collision risk cannot be predicted from the evolution of 4 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. the CDM data, as the propulsive maneuver is not de- 4 Competition design scribed. These scenarios were addressed by removing all The database of conjunction events constitutes an impor- CDM data before the maneuver epoch. tant historical record of risky conjunction events that oc- The second step in the data preparation was the curred in LEO and creates the opportunity to test the use anonymization of the data. This involved transform- of ML approaches in the collision avoidance process. The ing absolute time stamps and position/velocity values in decision on whether to perform an avoidance maneuver relative values, respectively, in terms of time to the TCA is based on the best knowledge one has of the associated and state with respect to the target. The names of the collision risk at the time when the maneuver cannot be target mission were also removed, and a numerical mis- further delayed, i.e., the risk reported in the latest CDM sion identi er was introduced to group similar missions. available. Such a decision would clearly bene t from a A random event identi er was assigned to each event. forecast of the collision risk, enabling past evolution and The full list of the attributes extracted from the CDMs projected trends to be considered. During the design and released in the dataset, as well as their explanations, of the Spacecraft Collision Avoidance Challenge, it was are available on the Kelvins competition website. natural to begin from a forecasting standpoint, seeking Here, we brie y describe only a few attributes relevant an answer to the question: can an ML model forecast the to later discussions: collision risk evolution from available CDMs? • time to tca : time interval between the CDM creation Such a forecast could assist the decision of whether and the TCA (day). or not to perform an avoidance maneuver by providing • c object type : type of the object at a collision risk a better estimate of the future collision risk before fur- with the satellite. ther CDMs are released. Forecasting competitions are • t span : size of the target satellite used by the colli- widely recognized as an e ective means of determining sion risk computation algorithm (m). good predictive models and solutions for a particular pro- • miss distance : relative position between chaser and blem [18]. The successful designing of such competitions target. requires a good balance to be determined between the • mission id : identi er of the mission from which the desire to create an interesting and fair ML challenge, CDMs are obtained. motivating and involving a large community of data sci- • risk : self-computed value at the epoch of each CDM, entists worldwide, and ful lls the objective of furthering using the attributes contained in the CDM, as de- the current understanding by answering a meaningful scribed in Section 2. scienti c question [19]. Designing a competition to forecast r from the database Table 1 provides an overview of the resulting database, of conjunction events presents a few challenges. First, indicating the number of entries (i.e., CDMs) and unique the distribution of the nal risk r associated with all the close-approach events. The risk computed from the last conjunction events contained in the database is highly available CDM is denoted as r. skewed (Fig. 4), revealing how most events eventually Table 1 Database of conjunction events at a glance result in a negligible risk. Intuitively, the e ect is due to Characteristics Number the uncertainties being reduced as the objects get closer, which in most scenarios results in close approaches where Events 15,321 High-risk events (r > 10 ) 30 a safe distance is maintained between the orbiting objects. High-risk events (r > 10 ) 131 Furthermore, events that already require an avoidance High-risk events (r > 10 ) 515 maneuver are removed from the data, thus reducing the CDMs 199,082 number of high-risk events. This is particularly trouble- Average CDMs per event 13 some as the interesting events, the ones that are to be Maximum CDMs per event 23 Minimum CDMs per event 1 forecasted accurately, are the few ones for which the nal risk is signi cant. Second, there is signi cant heterogene- Attributes 103 ity in the various time series associated with di erent events, both in terms of the number of available CDMs Spacecraft collision avoidance challenge: Design and results of a machine learning competition 5 database, revealing an abrupt increase in the risk value of 6. In particular, there were 30 events with r > 4, 131 events with r > 5, and 515 events with r > 6 (Table 1). 4.2 Test and training sets ML algorithms learn relationships between inputs and outputs by maximizing a particular objective function. The aim is to automatically learn patterns from the training data that generalize to unseen data, known as the test set. Hence, the training and test sets must be obtained from similar data distributions. In addition, the Fig. 4 Histogram of the latest known risk value (logarithmic data in the test set should re ect the type of data that scale) for the entire dataset (training and testing sets). Note we care about when deploying the ML model in the real that there are 9505 events with a nal risk value of log r = 30 or lower, which are not displayed in this gure. world. While releasing the raw database of conjunction events and the actual time to tca at which CDMs are available, to the public was a priority, and thus provide the commu- and most importantly, of the time to tca of the last avail- nity with an unbiased set of information to learn from, able CDM that de nes the variable r to be predicted. the various models produced during the competition were Therefore, the test and training sets and the competition tested primarily on predictions of events deemed partic- metric were designed to alleviate these problems. ularly meaningful. Consequently, while the training and test sets originated from a split of the original database, 4.1 De nition of high-risk events they were not randomly sampled from it. Events corre- Many mission operators in LEO use 10 as a risk thres- sponding to useful operational scenarios appeared in the hold to implement an avoidance maneuver. Over time, test set. this value has been applied by default. However, the In particular, for some events, the latest available CDM selection of a suitable reaction threshold for a particular was days away from the (known) time to the closest mission depends on many di erent parameters (e.g., size approach, which made its prediction (also if correct) not of the chaser, target satellite), and its selection can be a good proxy for the risk at the TCA. Furthermore, driven by considerations of the risk reduction that an potential avoidance maneuvers were planned at least two days prior to the closest approach; thus, events that operator seeks to achieve [20]. Therefore, ESA missions contain several CDMs at least two days prior to the TCA in LEO adopt reaction thresholds ranging between 10 were more interesting. Overall, three constraints were and 10 . Events are monitored and highlighted when imposed on the events to be eligible in the test set: the collision risk is larger than a noti cation threshold, which is typically set to one order of magnitude lower 1. The event had to contain at least two CDMs, one to than the reaction threshold. Note that in the remainder infer from and one to use as the target. of this paper, the log of the risk value is used frequently, 2. The last CDM released for the event had to be within such that log r > 6 de nes high-risk events. Thus, we a day (time to tca < 1) of the TCA. often omit writing log and simply refer to r > 6. For 3. The rst CDM released for the event had to be at the objectives of the competition, a single noti cation least two days before the TCA (time to tca > 2) and threshold was used for all missions, and its value was all the CDMs that were within two days from the set at 10 . The threshold value was selected to have a TCA (time to tca < 2) were removed. higher number of high-risk events while maintaining its Figure 5 depicts an example of an event that satis ed value close to the more frequently used operational value the requirements. Note that by permitting only events of 10 . Figure 4 shows the risk computed from the last that satis ed the aforementioned requirements in the available CDM for all the close approach events in the test set, the number of high-risk events was considerably 6 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. was a highly unbalanced problem, where the proportion of low-risk events was much higher than that of high-risk events. The nal metric used considered these requirements and summarized them into one overall value to rank com- petitors. Eventually, the Spacecraft Collision Avoidance Challenge metric included both the classi cation and regression parts. Denoting the nal risk as r and the corresponding prediction as r ^, the metric can be de ned as Fig. 5 Diagram depicting the raw CDMs time series for one L(r ^) = MSE (r; r ^) (1) HR event (top), and the same series if it was selected for the test set (bottom): only the CDMs prior to two days to TCA were where F is computed over the entire test set using two made available (labeled as x) and the latest CDM was used classes (high nal risk: r > 6, low nal risk: r < 6) as the target (labeled as y). and MSE (r;) is only computed for high-risk events. HR More formally, we obtain diminished. After enforcing the three requirements de- scribed above, only 216 high-risk events (out of 515) were MSE (r; r ^) = 1 (r r ^ ) (2) HR i i i eligible for the test set. Note that the remaining 299 high- i=1 risk events were maintained in the training set without being necessarily representative of the test events. where N is the total number of events, N = 1 i=1 Because of the unbalanced nature of the dataset and is the number of high-risk events, r and r ^ are the true i i the small number of high-risk events eligible for the test and predicted risks for the ith event, respectively, and set, we decided to place most of the eligible events into 1; if r > 6 1 = (3) the test set. Speci cally, 150 eligible high-risk events 0; otherwise were included in the test set and 66 in the training set. Finally, the F score is de ned as To alleviate the risk of directly probing the test set and thus over tting, we limited the number of submissions p q F = (1 + ) (4) per team to two per day during the rst month of the (  p) + q competition and to a single submission per day during where essentially controls the trade-o between pre- the second month. cision and recall, denoted as p and q, respectively. A higher value of means that a recall has more weight 4.3 Competition metric than precision; thus, more emphasis is placed on false negatives. To penalize false negatives more, we set = 2. In this section, we introduce the metric used to rank the participants and discuss its advantages and drawbacks. While the metric encourages participants to have a higher F score and a lower mean squared error, it in- Several criteria were used to design a metric that could troduces many layers of subjectivity. This is because the be fair and reward models of interest for operational ob- metric contains multiple sub-objectives that are com- jectives. The Spacecraft Collision Avoidance Challenge bined into one meta-objective. In the denominator, the had two main objectives: (i) the correct classi cation of F score is already an implicit multiobjective metric, events into high- and low-risk events; (ii) the prediction of where precision and recall are maximized to 1. Thus, the risk value for high-risk events. In other words, when- there is a trade-o between precision and recall, which ever an event belonged to the low-risk class, the exact risk is controlled by . In the numerator, the mean squared value was not important, and if an event belonged to the high-risk class, its exact value was of interest. Further- error penalizes erroneous predictions for high-risk events. more, because in the context of collision avoidance, false The squaring is justi ed by the desire to penalize large negatives were much more disastrous than false positives, errors. their occurrences were to be penalized more. Finally, this In the metric de ned in Eq. (1), F functions as a scal- 2 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 7 ing factor for MSE , where F assumes values in [0; 1] known observation. This is known to be optimal for HR 2 and MSE in R , which means that the metric is largely random walk data, and it operates well on economic and HR dominated by MSE in the numerator. Nonetheless, as nancial time series. Based on this fact, a second baseline HR reported in Section 5, even the highest-ranked models solution, called latest risk prediction (LRP) baseline is achieved a relatively small MSE ; thus, the F scaling de ned as the clipped naive forecast: HR 2 factor is appropriate. r ; if r > 6 2 2 i i In conclusion, several objectives were combined into r ^ = (6) 6:001; otherwise one metric, which introduced some level of complexity and subjectivity. An alternative to the metric used in and has a score of L = 0:694 when evaluated on the Eq. (1) would have a simple weighted average for each complete test set. Of the 97 teams, 12 managed to sub-metric (F and MSE ). This scoring scheme is submit better solutions. A few di erent teams obtained 2 HR routinely used in public benchmarks such as the popular and utilized this baseline (or equivalent variants) in their GLUE [21] score used in natural language processing, and submissions. The score achieved by the LRP is also it presents similar problems in the selection of weights reported in Table 3 and is plotted as a horizontal line in that function as scaling factors. Fig. 9, along with the proposed solutions from the top ten Note that according to Eq. (1), as soon as an event is teams. The LRP was of interest in this competition, as in predicted to be low-risk (r ^ < 6), the optimal prediction any forecasting competition, because it provides a simple to assign to the event is r ^ = 6 , where  > 0. Thus, and yet surprisingly e ective benchmark to improve upon. for a false negative, we minimize MSE , and for a true HR 4.5 Data split negative, the actual value does not matter, as long as r ^ < 6. Consequently, all risk predictions can be clipped This section discusses the splitting of the original dataset at a value slightly lower than 10 to improve the overall into training and test sets. First, principal component score (or at least produce an equivalent score). In the analysis (PCA) is applied to the data and it demonstrates remainder of this paper, we utilize this clipping, and the that the attributes depend on the mission identi er. In scores of the various teams are reported after the clipping other words, attributes recorded during di erent missions has been applied, using  = 0:001. are not obtained from the same distribution, making it dicult to generalize from one mission to another. Next, 4.4 Baselines we study the e ect of di erent splits of the test data on To have a sense of the e ectiveness of a proposed solution, the leaderboard scores (evaluated on a portion of the test baseline solutions should be introduced. For the Space- set) and in the nal ranking (evaluated on the full test craft Collision Avoidance Challenge, two simple concepts set), using the LRP baseline solution. can be used to build such baselines. Let us denote r ^ and In Fig. 6, the PCA projection of the original data r as the predicted risk and the latest known risk for is shown by maintaining only the rst two principal the ith event, respectively (the subscript 2 reminds us components. While the rst two principal components that the latest known risk for a close-approach event is only account for 20% of the total variance, the projected associated with a CDM released at least two days before data can still be distinguished and crudely clustered by the TCA, as shown in Fig. 5). The rst baseline solution, mission id, in particular mission id : 7 and mission id : 2. called the constant risk prediction (CRP) baseline, is This unsurprisingly implies that the attributes from the then de ned as CDMs do not come from the same distribution, making it potentially dicult to generalize from one mission r ^ = 5 (5) to another. Thus, each mission id refers to a di erent and has an overall score of L = 2:5. It constantly predicts satellite orbiting at di erent altitudes in regions of space the same value for the risk, and it was highlighted during with varying space debris density (Fig. 3). Therefore, imbalances in mission type should not be the competition as a persistent entry in the leaderboard. Of the 97 teams, 38 managed to produce a better model. created when splitting the data into training and test One of the simplest approaches in time series prediction sets. Figure 7 shows that, for low-risk events, the missions is the naive forecast [22], i.e., forecasting with the last are proportionally represented in both the training and 8 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. whereas mission id : 1, mission id : 6, and mission id : Mission id Mission id 20 are under-represented. This is because the dataset Mission id was randomly split into training and testing, considering Mission id Mission id only the risk value and not the mission type. For future research, we recommended that mission type should be considered during the splitting of the dataset, or datasets with a higher homogeneity should be created with respect to the mission type. Note that further analysis of the dataset split and the correlation between the training and test sets are presented in Section 6. Fig. 6 Projection of the original CDMs, from the test set, 5 Competition results onto the rst two principal components, colored according to After the competition ended and the nal rankings were mission id. made public, a survey was distributed to all the partici- pating teams in the form of a questionnaire. The results of the survey, the nal rankings, the methods of a few of the best-ranking teams, and a brief meta-analysis of all the solutions are reported in this section. 5.1 Survey A total of 22 teams participated in the survey, including all top-ranked teams. Some questions from the survey were targeted to gather more information on the back- ground of the participants. The questions were phrased Mission id as follows: \How would you describe your knowledge in Fig. 7 Distribution of the mission type for the test and space debris/astrodynamics?" A similar approach was training sets for low-risk events. conducted for ML and data science. The possible an- swers were limited to \professional", \studying", and \amateur". The answers are reported in Table 2, which shows that most participants had a background in ML and less in orbital mechanics. Note that the top three teams all identi ed themselves as ML professionals and two as studying orbital mechanics. Table 2 Background of the participants, out of 22 respon- dents to the end of the competition questionnaire Pro ciency Discipline Professional Student Amateur Mission id Machine learning 10 10 4 Orbital mechanics 4 5 15 Fig. 8 Distribution of the mission type for the test and training sets for high-risk events. As mentioned in Section 4.2 and reported in Table 1, test sets. However, when we examine only high-risk the dataset for the collision avoidance challenge is highly events (Fig. 8), we observe that the missions are not well unbalanced, with the training and test sets not randomly distributed. In particular, mission id : 2, mission id : 15, sampled from the dataset. A question from the survey and mission id : 19 are over-represented in the test set, probed whether the participants explicitly attempted to Spacecraft collision avoidance challenge: Design and results of a machine learning competition 9 address class imbalance (e.g., by arti cially balancing beat the LRP baseline, most of the teams required ap- proximately 20 days to do so, implying that the LRP the classes, assigning importance weighing to samples) baseline was fairly strong. This was further supported by asking, \Did you apply any approach to compensate by the fact that the scores did not improve much below for the imbalances in the dataset?" A total of 65% of the LRP baseline, suggesting that the naive forecast is the participants answered positively. Furthermore, half an important predictor of the risk value at the closest of the participants reported attempting to build a vali- approach. dation set with similar properties and risk distribution The nal results, broken down into MSE , with the HR as the test set, albeit failing since most surveyed teams risk clipped at 6.001 and the F components, are shown lamented a poor correlation between training and test in Table 3 for the top ten teams. All teams managed set performances. to improve upon the LRP baseline score by obtaining a One of the main scienti c questions that this challenge better MSE . However, many teams failed to obtain a HR aimed at addressing was whether the temporal infor- better F value than the LRP baseline. mation contained in the time series of CDMs was used To further investigate the di erences between the F to infer the future risk of collision between the target score achieved by the teams and the LRP baseline so- and chaser. A speci c question from the survey asked lution, it is useful to examine the false positives and participants if they found the evolution of the attributes false negatives of each returned model (Fig. 10(b)). The over time useful to predict the nal risk value. Surpri- Pareto front is very heterogeneous and consists of several singly, 65% of the teams framed the learning problem as teams: DunderMiin, Valis, Magpies, DeCRA, diet- a static one, summarizing the information contained in the time series as an aggregation of attributes (e.g., using summary statistics, or simply the latest available CDM). This may have been a direct consequence of the great predictive strength of a naive forecast for this dataset, as outlined in the approaches implemented by the top teams in Section 5.3. Finally, because of the small number of high-risk events in the test set and the emphasis placed on false negatives induced by the F score, it is natural to ask whether teams probed the test set through a trial-and-error process. Overall, 30% of the participants (including the top-ranked team sesc, see Section 5.3) reported utilizing a trial-and- Fig. 9 Evolution of the scores of the submission of various error method to identify high-risk events, suggesting that top teams. the di erence between the test and training sets posed a Table 3 Final rankings (from best to worst) evaluated on signi cant problem for many teams, a fact that deserves the test set, for the top ten teams. The best results are shown some further insight, which we provide in Section 6. in bold Team Score MSE F HR 2 5.2 Final rankings sesc 0.556 0.407 0.733 96 teams participated to the challenge and produced a dietmarw 0.571 0.437 0.765 total of 862 di erent submissions during the competi- Magpies 0.585 0.441 0.753 Vidente 0.610 0.436 0.714 tion timeframe. The scores on the leaderboard changed DeCRA 0.615 0.457 0.743 frequently, and the nal ranking remained uncertain un- Valis 0.628 0.467 0.744 til the end of the competition. The evolution of the DunderMiin 0.628 0.451 0.718 madks 0.634 0.476 0.750 scores for the top ten teams throughout the competition vhrique 0.649 0.496 0.764 is shown in Fig. 9. Note how the top four teams closely Spacemeister 0.649 0.479 0.738 competed for rst place until the very last days. Another LRP baseline 0.694 0.513 0.739 observation is that while all the top teams managed to 10 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. the team attempted to use di erent methods, including extracting time series features [23], constructing an au- tomated ML pipeline via Genetic Programming [24, 25], and using random forests. All these approaches were ' reported to have a score of L 2 [0:83; 1:0] on the test set, but they performed radically better on the training set. Such a di erence was considered an indication that an automated, o -the-shelf ML pipeline was unlikely to be the appropriate way of learning from this dataset. Instead, the team resorted to a step-by-step approach informed by statistical analysis, utilizing the metric and (a) the constitution of the test set. Thus, the F score is biased toward false negatives, and there is a relatively higher proportion of high-risk events in the test set than in the training set. Furthermore, we can observe that, in the training set, most of the high-risk events misclassi ed by the naive forecast have the latest risk r only slightly below the threshold. A simple strategy is to promote borderline low-risk events to high-risk ones, thus improv- ing the recall (at the cost of penalizing precision), which is what the F score puts emphasis on. In practice, this strategy was implemented by introducing three thres- holds, referred to as step 0, step 1, and step 2, as shown (b) in Table 4 and Eq. (7). Fig. 10 On the top, in (a), the F score and the MSE 2 HR Additional incremental improvements were achieved are plotted for the top ten teams. On the bottom, in (b), the by assigning events to low risk whenever either the chaser F scores are broken into two components: false negatives type (c object type attribute) was identi ed as a payload (out of 150 positive events) and false positives (out of 2017 negative events). and the diameter of the satellite (t span attribute) was small (below 0.5) or the miss distance was greater than marw, vhrique, madks, and the baseline solution, de- 30,000 m. These steps are referred to as step 3, step 4, noted as Baseline. Although the baseline solution is in and step 5, respectively, in Table 4 and Eq. (7): the Pareto front, we can observe that the resulting F Finally, the risk value for high-risk events was clipped score in Fig. 10(a) is dominated by several teams. This is to a slightly lower risk value to enforce the general trend because the F score places more emphasis on penalizing of risk decrease over time, thus improving the MSE HR false negatives, which the baseline solution has the most while preserving the F score. This step is referred to as of. In Fig. 10(a), only two teams remain in the Pareto step 6 in Table 4, and Eq. (7). front: sesc and dietmarw. Interestingly, dietmarw has In summary, the aforementioned observations resulted the highest F score, and sesc has the lowest MSE , 2 HR in the introduction of a cascade of thresholds: suggesting that their methods can be combined to achieve a better overall score. 5:95; if 6:04 6 r < 6:00 (step 0) 5:60; if 6:40 6 r < 6:04 (step 1) 5.3 Methods used by top teams 5:00; if 7:30 6 r < 6:40 (step 2) > 2 6:00001; if c object type is \payload" (step 3) 5.3.1 Team sesc r ^ = 6:00001; if t span < 0:5 (step 4) The highest-ranking team was composed of scientists > >6:00001; if miss distance > 30000 (step 5) from diverse domains of expertise: evolutionary opti- >4:00; if 4:00 6 r < 3:50 (step 6) mization, ML, computer vision, data science, and energy 3:50; if r > 3:50 (step 6) management. In the early stages of the competition, (7) Spacecraft collision avoidance challenge: Design and results of a machine learning competition 11 Table 4 Evaluation of team sesc's approach, as additional steps were added Training set Test set Combinations of steps MSE F Loss MSE F Loss Leaderboard HR 2 HR 2 LRP baseline 0.330 0.411 0.804 0.513 0.739 0.694 0.718 Steps: 0 0.330 0.430 0.768 0.512 0.753 0.680 0.703 Steps: 0 + 1 0.305 0.392 0.779 0.498 0.764 0.653 0.670 Steps: 0 + 1 + 2 0.290 0.296 0.982 0.445 0.738 0.603 0.612 Steps: 0 + 1 + 2 + 3 0.290 0.301 0.966 0.426 0.735 0.579 0.587 Steps: 0 + 1 + 2 + 4 0.290 0.298 0.974 0.447 0.735 0.608 0.611 Steps: 0 + 1 + 2 + 5 0.325 0.304 1.070 0.444 0.733 0.607 0.613 Steps: 0 + 1 + 2 + 6 0.293 0.296 0.990 0.424 0.738 0.575 0.581 Steps: 0 + 1 + 2 + 3 + 4 + 5 + 6 0.327 0.311 1.050 0.414 0.728 0.569 0.564 Steps: 0 + 1 + 2 + 5 + 6 0.327 0.304 1.077 0.424 0.733 0.578 0.581 Steps: 0 + 1 + 2 + 5 + 6 + 7 0.327 0.304 1.077 0.407 0.733 0.555 0.555 5.3.2 Team Magpies the mean (mean risk CDMs ) and standard deviation (std risk CDMs ) of the risk values of the CDMs. The third-ranked team was composed of a space situa- Hyperbolic tangents were used as activation functions, tional awareness (SSA) researcher and Learning (ML) and Adam was used as the gradient descent optimizer [27]. engineer. The team achieved its nal score by lever- aging Manhattan-LSTMs [26] and a siamese architec- The training data were split using a three-fold cross- ture based on recurrent neural networks. Team Mag- validation (eight events were selected in each valida- pies began by analyzing the dataset and ltering the tion fold, from the 23 anomalous events). Subsequently, training data according to the test set requirements fnon-anomalous, non-anomalousg, and fnon-anomalous, described in Section 4.2. Subsequently, they selected anomalousg pairs were generated for the siamese network seven out of 103 features (time to tca, max risk estimate, to learn similar and dissimilar pairs, respectively. max risk scaling, mahalanobis distance, miss distance, For each validation fold, several networks were trained c position covariance det, and c obs used ) by comparing using di erent hyperparameters. Networks that attained the distribution di erence of the non-anomalous event a reasonably high performance were then used in a ma- (last available collision risk is low and ends up low at jority voting ensemble scheme with equal weights. The close approach, and vice-versa for high-risk events) and majority vote outcome was denoted as f , used ten fea- anomalous events (last available collision risk is low and tures as inputs, denoted as x, and predicted whether a ends up high at the close approach, and vice-versa for low-risk event was anomalous or not. Then, the nal high-risk events). Figure 11 shows the number of anoma- predictions of the test set were expressed as lous and non-anomalous scenarios. In addition to these >6:001; if r < 6 and f (x) = non-anomalous seven attributes, three new features were included: the r ^ = 5:35; if r < 6 and f (x) = anomalous number of CDMs (number CDMs ) issued before two days, r ; if r > 6; 2 2 (8) where 5.35 is the average risk value of all high-risk events in the training set. S S 5.4 Diculty of samples In this section, we investigate the events in the test set that were consistently misclassi ed by all the top ten teams. These events can be separated into two groups: Fig. 11 Number of anomalous and non-anomalous events false positives and false negatives. The false negatives from the training set. correspond to events that were incorrectly classi ed as low 12 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. risk and false positives correspond to events incorrectly have in common is high uncertainties in their associated classi ed as high risk. Figure 12 shows the evolution of measurements (e.g., position and velocity), resulting in the risk of events that were consistently misclassi ed. The very uncertain risk estimates, susceptible to large jumps gure shows that these events all experience a signi cant close to the TCA. Figure 13 shows the evolution of the change in their risk value, as they progressed to the uncertainty in the radial velocity of the target spacecraft closest approach, thus rendering the use of the latest risk (t sigma rdot ) for the 150 high-risk events in the test set. value misleading. Furthermore, as shown in Fig. 12, the The uncertainty values were generally higher for mis- temporal information is likely to be of little use to make classi ed events. Note that many more uncertainty at- good inferences in these scenarios: there is no visible tributes were recorded in the dataset, and Fig. 13 shows trend and the risk value jumped from one extreme to the only one of them. Higher uncertainties for the misclassi- other (from very low to very high risk in (a) and vice- ed events suggests that there may be value in building versa in (b)). One characteristic that all these events a model which takes these uncertainties into account at inference time, for instance, by outputting a risk distri- bution instead of a point prediction. (a) Fig. 13 Evolution of the uncertainty in the radial velocity of the target spacecraft (t sigma rdot ) over time, up until two days to the TCA. The evolution of the uncertainty of the 11 false negative events (Fig. 12) is indicated in red. The evolution of the uncertainty of the 139 remaining true positive events is indicated in black. 6 Post competition ML Further ML experiments were conducted on the dataset, both to analyze the competition and to further investi- gate the use of predictive models for collision avoidance. (b) The aim of these experiments was to understand the diculties experienced by competitors in this challenge Fig. 12 Events consistently misclassi ed by all the top ten teams: (a) false negatives, (b) false positives. In the top and to gain deeper insights into the ability of ML models panel, we show all false negatives (11 out of 150 high-risk to learn generalizable knowledge from these data. events). Each event is represented as a line, and the CDMs are marked with crosses. The evolution of the risk between 6.1 Training/test set loss correlation two CDMs is plotted as a linear interpolation. In the bottom panel, we show ten randomly sampled events out of 62 false The rst experiment was designed to analyze the cor- positives in total. These events were particularly dicult to classify because of the big leap in risk closer to the TCA, relation between the performance of ML models on the ranging from low risk to high risk in (a) and vice-versa in (b). training and test sets used during the competition. Only Spacecraft collision avoidance challenge: Design and results of a machine learning competition 13 the training set events conforming to the test set speci - classi cation metric, cared only for where the risk values cations (Section 4.2) were considered: nal CDM within lay with respect to the 6 threshold. a day of the TCA, and all other CDMs at least two For the type of regression problem outlined above, with days away from the TCA. The last CDM that is at least tabular data representation, gradient boosting meth- two days away from the TCA, the most recent CDM ods [28] o er state-of-the-art performance. Thus, we available to operators when making the nal planning selected the LightGBM gradient boosting framework [29] decisions, was used here as the sole input to the model. to train many models. To attain both training speed and model diversity, we changed the hyperparameters In other words, temporal information from the CDM as follows (with respect to the default in the LGBMRe- time series was not considered. From that CDM, only the numerical features were used; the two categorical fea- gressor of LightGBM 2.2.3): the n estimators was set to tures (mission id and object type ) were discarded. Thus, 25, feature fraction to 0.25, and learning rate to 0.05. for models to learn mission or object-speci c rules, they Together, these settings resulted in an ensemble with would have to utilize features encoding relevant prop- fewer decision trees (the default is 100), and each tree erties of that feature or object. It was hoped that this was trained exclusively on a small random subset of 25% would force the model to learn more generalizable rules. of the available features (the default is 100%), and each In addition to this step, no other transformations were successive tree had a reduced capability to overrule what applied to the CDM raw values (such as scaling or embed- previous trees had learned (the default learning rate, also known as shrinkage in the literature, is 0.1). ding). Similarly, no steps were implemented to impute the missing values that occurred at times in many of Figure 14 shows the evaluations of 5000 models, on the the CDM variables. We left these to be addressed by training and test sets, on the MSE and 1=F metrics, HR 2 the ML algorithm (LightGBM in our case) through its as well as their product (the competition's score, or loss own internal mechanisms. Most importantly, the model's metric). The risk values were clipped at 6.001 prior to target was de ned as the change in risk value between measuring the MSE . We compared the performance of HR the models on the training (x-axis) and test sets (y-axis). the input CDM and the event's nal CDM (r r ), As a reference, the dotted lines show the performance rather than the nal risk r itself. This facilitated the of the LRP baseline (Section 4.4). The Spearman rank- learning task as it implicitly reduced the bias toward the most represented nal risk (i.e., 30). Furthermore, it order correlation coecient was computed as an indicator enabled a direct comparison to the LRP baseline, as the of the extent to which performance in the training set various models were de facto tasked to predict a new generalizes to the test set. Only one model (0.02% of the trained models) outper- estimator (h) such that r = LRP + h. The quantity h formed the LRP baseline loss in both the training and was further encoded through a quantile transformer to test sets. With a test set loss of 0.684 (1.4% gain over assume a uniform distribution. Eventually, the training data consisted of a table of LRP), that model would have ranked 11 th in the ocial 8293 CDMs from as many events, each described by 100 competition. Overall, Fig. 14 shows several undesirable trends. The features. Each CDM was assigned a single numerical MSE plot exhibited a positive correlation: the training value that was to be predicted. Overall, these steps HR set performance was predictive of the performance on the resulted in a simpli ed data pipeline. Note the absence of any steps to address the existing class imbalance and test set. However, the models struggled to improve the the absence of any focus on high-risk events during the LRP in both sets. Most models degraded the risk estimate training process. Models were requested to learn the available in the most recent CDM. In 1=F , we observed evolution of risk across the full range of risk values, a strong negative correlation: the better the performance although they were mostly assessed on their performance on the training set, the worse it was on the test set. at one end of the range of risk values during evaluation. This was a clear sign of over tting. When aggregated, The competition's MSE metric was obtained only over we remained with a loss metric displaying essentially HR the true high-risk events, and the use of clipping at a no correlation. This observation, while bound to the risk of 6.001 further ignored where the nal predicted modeling choices made, o ers a possible explanation for risk lay if it fell below this value. In addition, F , a competitors' sentiment of disappointment over models 2 14 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. )3 )3 .4& .4& Fig. 14 Performance levels achieved by 5000 gradient boosting regression models trained using the competition's dataset split. that were good in their local training setups evaluating learning of their properties unlikely. Similarly, the object poorly on the leaderboard. type attribute had an identical imbalance, which led to similar challenges. Although this partitioning process 6.2 Simulating 10,000 virtual competitions made achieving a higher score on the performance metrics more dicult, it served the current aim of evaluating the To further our understanding of the absence of a signi- generalization capability. cant Spearman rank correlation between training and In each of the 10,000 virtual competitions, 100 regres- test set performances, as highlighted in Fig. 14, we simu- sion models were trained using the same data pipeline and lated 10,000 possible competitions that di ered in the model settings as described in the previous section. On data split. In each, a test size was randomly selected average, 526 competitions were simulated for each of the from a set of 19 options, containing the values from 19 di erent test size settings, each with its own distinct 0.05 through 0.95 in steps of 0.05. This setting indicates data split. In total, 1 million models were trained. Al- the fraction of events that should be randomly selected though framed here as virtual recreations of the Kelvins to be moved to the test set. The full dataset being par- competition, this process implemented, per test size set- titioned was composed solely of the 10,460 events that ting, a Monte Carlo cross-validation or repeated random conformed to the ocial competition's test set speci- sub-sampling validation [30]. If the number of random cations. We adopted a di erent splitting procedure data splits approached in nity, the results would tend from that reported in Section 4.5. A strati ed shue toward those of leave-p-out cross-validation. splitter was used, so the proportions of nal high-risk The experiment's results are shown in Figs. 15{19. events in both the training and test sets would always Figure 15 shows statistics on the Spearman rank-order match the proportion observed in the dataset being par- correlation coecients between model evaluations in the titioned (2.07%) as closely as possible. For reference, a training and test sets per evaluation metric. A positive test size of 0.2 results in 172.8 high-risk events on average Spearman correlation signals the ability to use the met- in the training set, and 43.2 in the test set (and 8195.2 ric for model selection. The better a model is on the and 2048.8 low-risk events, respectively). In the train- training set, the better we expect it to be on the unseen ing and test sets, no allowances were made to preserve events of the test set. A negative correlation is a sign of the event distributions of mission ids and chaser object over tting or inability to generalize beyond the training types present in the full dataset being partitioned. As set data. Figure 16 complements the analysis in Fig. 15 shown in Figs. 7 and 8, the fraction of events from the by showing the statistics on the percentage of the models di erent missions had such an imbalance that many of these generated splits likely either resulted in some mis- per simulated competition that outperformed the LRP sions being entirely unrepresented in either the training baseline in their respective training and test sets. The or test set or having such low volumes as to render the curves show the mean performance as a function of the Spacecraft collision avoidance challenge: Design and results of a machine learning competition 15 Fig. 15 Extent to which di erent data splits a ected the ability to infer test set performance from the training set performance. Expected Spearman rank-order correlation coecients between training and test set evaluations, as data sets vary in the fraction of events assigned to both (shown in the x-axis). Correlations measured in the MSE and 1=F metrics, as well as HR 2 their product. Fig. 16 Expected percentage of ML models that would outperform the LRP baseline in both the training and test set, as data sets varied in the fraction of events assigned to both (shown in the x-axis). Performance measured in the MSE and HR 1=F metrics, as well as their product. test size, and the shaded areas represent the region within by themselves displayed a low correlation between the one standard deviation. Figure 17 also shows the corre- training and test set and aggregated them into a single lations between the training and test set evaluations, but value, which was even less correlated. Furthermore, as now matches MSE correlations to 1=F correlations. shown in Fig. 17, the highest loss correlations tended HR 2 Thus, an overview of the e ect of the same data split to occur when the MSE was highly correlated. The HR on models' capabilities to learn generalizable knowledge MSE was of the three metrics the one that tended to HR display a higher rank correlation. However, as shown in simultaneously with respect to regression and classi ca- Fig. 16, few models outperformed the MSE obtained tion objectives can be obtained. The red star places the HR Kelvins competition's unstrati ed (with respect to high- by the LRP baseline on both sets. This indicated an risk events) data split in the context of 10,000 strati ed identical scenario to that shown in Fig. 14, in which we splits, indicating how much of an outlier it turned out obtained a high positive correlation, but the models were to be. not particularly successful. Predicting the actual nal risk The rst conclusion to be drawn from these gures is value was dicult; therefore, the further our predictions that the aggregated loss metric, MSE =F , was deci- moved away from the most recent risk estimate in high- HR 2 dedly uninformative in terms of identifying models that risk events (the only events scored by this metric), the were likely to generalize. It required two metrics that worse we were likely to perform, both in the training 16 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. By normalizing models' 1=F evaluations with respect to the LRP baseline 1=F values, performance became comparable across di erent data splits. Figure 18 shows the mean and standard deviation of models' percentage gains in performance over the LRP baseline in the training and test sets. Statistics were calculated across all models trained over all the data splits that used the same test size setting. Figure 19 shows models' 1=F evaluations in the training and test sets, normalized against their respective LRP 1=F baseline, for selected test size settings. Over 50,000 ML models are shown in each subplot, trained over 500 data splits on average with that test size setting. At one end, with a test size of 0.95, training sets had merely 523.0 events to learn from (10.8 of which are of Fig. 17 Extent to which di erent data splits a ected the high risk). With insucient data to learn from, the mod- ability to infer test set performance from the training set els quickly over t and failed to learn generalizable rules. performance, as observed through simultaneous evaluations This was indicated by a mean gain in performance of of regression and classi cation metrics. Spearman rank-order 12.61% on the training set, but a 7.73% mean loss in correlation coecients between training and test set evalua- tions of the three performance metrics in 10,000 di erent data performance on the test set, both with respect to the splits using di erent test size fractions. The red star corre- LRP baseline. As we increased the amount of data avail- sponds to the data split of the ocial competition (Fig. 14). able for training models, the training set performance decreased (more data patterns to learn from and harder to incorporate individual event idiosyncrasies into the model), but test performance increased. At the other end, with a test size of 0.05, most data were available for training, but the small test set was no longer rep- resentative (523.0 events to evaluate models on, 10.8 of high risk). Depending on the \predictability" of events that ended up on the test set, we either obtained a very high or very low performance: a mean gain of 0.04% on the test set, with a standard deviation of 9.83%. Here, the optimal trade-o lay in a test size of 0.2, where a 6.45% gain in the training set performance over the LRP baseline translated into a 1.12% gain in the test set. It Fig. 18 Expected 1=F performance gain (%) over the LRP is common for data scientists to use 80/20 splits of the baseline as data splits varied in the fraction of events assigned dataset, as a rule of thumb inspired by the Pareto princi- to either the training or test set. ple. Note we experimentally converged as this being the and test sets. Nonetheless, as shown in the 1=F plots in ideal setting. To establish an ML performance baseline, we now turn Figs. 15 and 16, even if the predicted nal risk values were not accurate, those perturbations moved the values across directly to the F score rather than its inverse (see the the 6 risk threshold to result in improved capability discussion in Section 4.3). F , which ranges in [0; 1], is to forecast the nal risk class. With a test size of 0.2, the harmonic mean of precision and recall, where recall 67.55% of the trained models outperformed the LRP (ability to identify all high-risk events) is valued two times baseline on both their training and test sets. This was in higher than precision (ability to ensure all events pre- stark contrast to Fig. 14, where only 0.04% of the models dicted as being of high-risk indeed are). A Monte Carlo (two out of 5000) outperformed the LRP baseline for the cross-validation with a test size of 0.2 and 505 strati- 1=F metric. ed random data splits evaluated the LRP baseline (the 2 Spacecraft collision avoidance challenge: Design and results of a machine learning competition 17 Fig. 19 Variation in the training and test set performance of the ML models (1=F ) as a function of data availability. Performance normalized with respect to LRP baseline performance in the same datasets. direct use of an event's latest CDM's risk value as pre- 6.3 Feature relevance diction) to a mean F score of 0.59967 over the test set The experiment described in the previous section provides (standard deviation: 0.04391). Over the same data splits, a basis on which to quantify feature relevance that is a LightGBM regressor, acting over the same CDM raw independent from the speci cs of any particular data values (see data and model con gurations in Section 6.1), split or the decision-making of any individual model. We evaluated to a mean F score of 0.60732 over the test provide that information here, to illustrate what signal set (standard deviation: 0.04895; statistics over 50500 ML models use to arrive at their predictions, and to direct trained models). Therefore, a gain of 1.2743% over the future research towards the more important features to LRP baseline . The di erence in performance between train models on. both approaches is statistically signi cant: a paired Stu- Of the 1 million models trained in the previous sec- dent's t-test rejects the null hypothesis of equal averages tion's experiment, 47.75%, from across di erent test size (t-statistic: 10.23, two-sided p-value: 1:83 10 ; per settings, surpassed the 1=F LRP baseline on both their data split, the LRP F score was paired with the mean training and test sets. We selected all those models and LightGBM model F score to ensure independence across used LightGBM to quantify their \gain" feature rele- pairs). vance. This process did not measure how frequently a This is the strongest evidence yet that ML models can feature was used across a model's decision trees, but indeed learn generalizable knowledge in this problem. rather the gains in loss it enabled when that feature was used (loss here refers to the objective function op- In a domain that safeguards assets valued in millions, timized by the algorithm while building the model, not a 1% gain in risk classi cation can already be transfor- to the competition's MSE =F scoring function). For mative. Furthermore, this would be a 1% gain on top HR 2 each model, relevance values were normalized over the of approaches for risk determination that have been de- features' total gains and converted to percentages. Sub- veloped for decades. Note that these results, obtained sequently, the values were aggregated through weighted using a classi cation metric, were achieved through a statistics across the selected models, resulting in the rele- regression modeling approach. Furthermore, it had an vance assessments shown in Table 5 (only the top twenty intentionally limited modeling capability, was trained features are shown, out of the 100 used). The models' over a basic data preparation process, and was evaluated fractional gains in performance over the test sets' LRP under adverse conditions (owing to imbalances in the 1=F baseline were used as weights. mission id and type of chaser object). Thus, we expect it The LRP is a strong predictor, as previously dis- will be possible to signi cantly surpass these performance cussed. However, relevance measurements indicated that levels with more extensive research in data preparation in the ML models, features directly related to risk (risk, and modeling. max risk scaling, max risk estimate ) together accounted ¬ For comparison, cross-validation of team sesc's method (Sec- for only half (54.44%) of the models' gains in loss. Models tion 5.3.1), over the same 505 data splits with test size of 0.2, widely used the information available to them, with the evaluated to a mean F score of 0.51563 over the test set. This was a performance loss of 14% with respect to the LRP baseline. top twenty features in Table 5 accounting for 78.32% of 18 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. Table 5 Feature relevance estimates. In the prediction of increasing relevance by 0.07%. near-term changes in risk, percentage of the reduction in error Note that models under consideration used CDM raw attributable to the feature. A description of the features is available on the Kelvins website values as inputs. After some feature engineering, the attributes presented in Table 5 may follow a di erent Feature Rank Mean Std. dev. ranking. A result of their information content with re- risk 1 29.275 9.557 spect to the prediction target becoming clearer to identify max risk scaling 2 22.544 8.979 and use by the ML algorithms. Note also that correlated mahalanobis distance 3 3.261 1.675 c sigma t 4 3.000 1.715 features may have split relevance values between them, max risk estimate 5 2.624 1.367 causing them to appear lower in this ranking. c sigma rdot 6 2.191 1.369 miss distance 7 2.089 1.112 c position covariance det 8 1.778 1.066 7 Conclusions c sigma n 9 1.312 0.625 time to tca 10 1.236 0.517 The Spacecraft Collision Avoidance Challenge enabled, c sigma r 11 1.177 0.739 c obs used 12 1.164 0.554 for the rst time, the study of the use of ML methods c sigma ndot 13 0.964 0.437 in the domain of spacecraft collision avoidance owing to relative position n 14 0.954 0.754 the public release of a unique dataset collected by the c recommended od span 15 0.945 0.423 relative position r 16 0.835 0.440 ESA Space Debris Oce over more than four years of c sedr 17 0.779 0.486 operations. Several challenges, mostly derived from the SSN 18 0.773 0.372 unavoidable unbalanced nature of the dataset, had to c crdot t 19 0.718 0.468 be accounted for to release the dataset in the form of relative speed 20 0.699 0.400 a competition and the use of automated, o -the-shelf the gains in loss, and only two of the 100 features having ML pipelines were limited. Nevertheless, the competition a relevance of 0.0. results and further experiments presented here clearly A set of 40 features had values for both the \target" demonstrated two things. On one hand, naive forecasting (the ESA satellite { pre x t ), and \chaser" that should models have surprisingly good performances and thus are be avoided (space debris/object { pre x c ), for a total established as an unavoidable benchmark for any future of 80 of the 100 features. Note the absence of \target" work in this subject; on the other hand, ML models can features in Table 5. The relevance of \target" features improve upon such a benchmark, hinting at the possibility summed to a total of 9.41%, while \chaser" features of using ML to improve the decision-making process in summed to 23.49%. If the models were to rely too much collision avoidance systems. on the properties of the \target", they would be learning mission-speci c rules. Instead, we observed a greater Acknowledgements reliance on properties of the \chaser", and in features with relative values, thus enabling better generalization The ESA would like to thank the Unite States Space across missions. Surveillance Network for the agreement that enabled the The mean relevance estimates were very stable. The public release of the dataset for the objectives of the unweighted aggregation of normalized relevance values in competition. the remaining 52.25% of trained models not included in The authors would like to thank all the scientists that the selection above had a total of 10.51% absolute di e- participated in the Spacecraft Collision Avoidance Chal- rence across features. The higher-performing models from lenge and that dedicated their time and knowledge to an which the statistics in Table 5 were drawn exhibited by important element of ESA's operated satellites. comparison a greater reliance on risk and max risk scaling In particular, we would like to acknowledge all members (+4.62%). The SSN, the Wolf sunspot number, at a rank of 18, was one of the most relevant features. It was also of team sesc, whose methodology is brie y described in one of the features with a greater increase with respect this paper: Ste en Limmer, Sebastian Schmitt, Viktor to the alternate ranking, climbing three positions, and Losing, Sven Rebhan, and Nils Einecke. Spacecraft collision avoidance challenge: Design and results of a machine learning competition 19 References [15] Merz, K., Bastida Virgili, B., Braun, V., Flohrer, T., Funke, Q., Krag, H., Lemmens, S., Siminski, J. Current [1] Liou, J. C., Johnson, N. L. Instability of the present collision avoidance service by ESA's Space Debris Oce. LEO satellite populations. Advances in Space Research, In: Proceedings of the 7th European Conference on 2008, 41(7): 1046{1053. Space Debris, 2017. [2] Krag, H. Consideration of space debris mitigation re- [16] Braun, V., Flohrer, T., Krag, H., Merz, K., Lemmens, quirements in the operation of LEO missions. In: Pro- S., Bastida Virgili, B., Funke, Q. Operational support ceedings of the SpaceOps 2012 Conference, 2012. to collision avoidance activities by ESA's space debris [3] Klinkrad, H. Space Debris. Springer-Verlag Berlin Hei- oce. CEAS Space Journal, 2016, 8(3): 177{189. delberg, 2006. [17] Alfriend, K. T., Akella, M. R., Frisbee, J., Foster, J. L., Lee, D. J., Wilkins, M. Probability of collision error [4] Anselmo, L., Pardini, C. Analysis of the consequences analysis. Space Debris, 1999, 1(1): 21{35. in low Earth orbit of the collision between Cosmos 2251 [18] Hyndman, R. J. A brief history of forecasting competi- and Iridium 33. In: Proceedings of the 21st International tions. International Journal of Forecasting, 2020, 36(1): Symposium on Space Flight Dynamics, 2009: 2009-294. 7{14. [5] Ryan, S., Christiansen, E. L. Hypervelocity impact test- [19] Kisantal, M., Sharma, S., Park, T. H., Izzo, D., M artens, ing of advanced materials and structures for microme- M., D'Amico, S. Satellite pose estimation challenge: teoroid and orbital debris shielding. Acta Astronautica, Dataset, competition design, and results. IEEE Transac- 2013, 83: 216{231. tions on Aerospace and Electronic Systems, 2020, 56(5): [6] IADC. IADC space debris mitigation guidelines. Avail- 4083{4098. able at https://www.iadc-home.org/ (cited in 2007). [20] Merz, K., Virgili, B. B., Braun, V. Risk reduction and [7] Walker, R., Martin, C. E. Cost-e ective and robust collision risk thresholds for missions operated at ESA. mitigation of space debris in low earth orbit. Advances In: Proceedings of the 27th International Symposium on in Space Research, 2004, 34(5): 1233{1240. Space Flight Dynamics (ISSFD), 2019. [8] Biesbroek, R., Innocenti, L., Wolahan, A., Serrano, S. [21] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bow- M. e. Deorbit-ESA's active debris removal mission. In: man, S. GLUE: A multi-task benchmark and analysis Proceedings of the 7th European Conference on Space platform for natural language understanding. In: Pro- Debris, 2017: 10. ceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, [9] Liou, J. C., Johnson, N. L., Hill, N. M. Controlling the 2018: 353{355. growth of future LEO debris populations with active [22] Hyndman, R. J., Athanasopoulos, G. Forecasting: Prin- debris removal. Acta Astronautica, 2010, 66(5{6): 2288{ ciples and Practice, 2nd edn. OTexts, 2018. [23] Christ, M., Braun, N., Neu er, J. tsfresh a python pack- [10] Izzo, D. E ects of orbital parameter uncertainties. Jour- age. Available at https://tsfresh.readthedocs.io. nal of guidance, control, and dynamics, 2005, 28(2): [24] Olson, R. S., Bartley, N., Urbanowicz, R. J., Moore, 298{305. J. H. Evaluation of a tree-based pipeline optimization [11] Smirnov, N. N. Space Debris: Hazard Evaluation and tool for automating data science. In: Proceedings of Debris. CRC Press, 2001. the Genetic and Evolutionary Computation Conference, [12] Flohrer, T., Braun, V., Krag, H., Merz, K., Lemmens, S., 2016: 485{492. Virgili, B. B., Funke, Q. Operational collision avoidance [25] Wang, C., B ack, T., Hoos, H. H., Baratchi, M., Limmer, at ESOC. In: Proceedings of the Deutscher Luft-und S., Olhofer, M. Automated machine learning for short- Raumfahrtkongress, 2015. term electric load forecasting. In: Proceedings of the 2019 [13] Logue, T. J., Pelton, J. Overview of commercial small IEEE Symposium Series on Computational Intelligence satellite systems in the \New Space" age. In: Handbook (SSCI), 2019: 314{321. of Small Satellites. Pelton J. Ed. Springer, Cham, 2019: [26] Mueller, J., Thyagarajan, A. Siamese recurrent archi- 1{18. tectures for learning sentence similarity. In: Proceedings [14] Flohrer, T., Krag, H., Merz, K., Lemmens, S. CREAM- of the 30th AAAI conference on arti cial intelligence, ESA's proposal for collision risk estimation and auto- 2016. mated mitigation. In: Proceedings of the Advanced Maui [27] Kingma, D. P., Ba, L. J. Adam: A method for stochastic Optical and Space Surveillance Technologies Conference optimization. arXiv preprint, 2014: arXiv:1412.6980. (AMOS), 2019. https://arxiv.org/abs/1412.6980. 20 T. Uriot, D. Izzo, L. F. Sim~ oes, et al. [28] Hastie, T., Tibshirani, R., Friedman, J. Boosting and later joined the European Space Agency (ESA) and became additive trees. In: The Elements of Statistical Learning, the scienti c coordinator of its Advanced Concepts Team. 2nd edn. New York: Springer New York, 2008: 337{387. He devised and managed the Global Trajectory Optimiza- [29] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, tion Competitions, the ESA's Summer of Code in Space, and W., Ye, Q., Liu, T.-Y. Light-GBM: A highly ecient gra- the Kelvins innovation and competition platform for space dient boosting decision tree. In: Proceedings of the 31st problems. He published more than 180 papers in interna- Annual Conference on Neural Information Processing tional journals and conferences making key contributions to Systems, 2017: 3146{3154. the understanding of ight mechanics and spacecraft control [30] Kuhn, M. Johnson, K. Applied Predictive Modeling. New and pioneering techniques based on evolutionary and ma- York: Springer-Verlag New York, 2013. chine learning approaches. Dario Izzo received the Humies Gold Medal and led the team winning the 8th edition of Thomas Uriot graduated from the Uni- the Global Trajectory Optimization Competition. E-mail: versity of Oxford in the UK, where he Dario.izzo@esa.int. obtained his master degree in statistics Open Access This article is licensed under a Creative Com- and mathematics. Thomas worked as a mons Attribution 4.0 International License, which permits researcher at the ESA in the Advanced use, sharing, adaptation, distribution and reproduction in Concepts Team, where he conducted re- any medium or format, as long as you give appropriate credit search on evolutionary machine learning to the original author(s) and the source, provide a link to and spacecraft collision avoidance. E- the Creative Commons licence, and indicate if changes were mail: uriot.thomas@gmail.com. made. Dario Izzo graduated as a doctor of The images or other third party material in this article are aeronautical engineering from the Uni- included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material versity Sapienza of Rome (Italy). He then took a second master degree in satellite is not included in the article's Creative Commons licence and platforms at the University of Cran eld in your intended use is not permitted by statutory regulation or the UK and completed his Ph.D. degree exceeds the permitted use, you will need to obtain permission directly from the copyright holder. in mathematical modelling at the Univer- sity Sapienza of Rome where he lectured To view a copy of this licence, visit http://creativecoorg/ classical mechanics and space ight mechanics. Dario Izzo licenses/by/4.0/.

Journal

AstrodynamicsSpringer Journals

Published: Apr 7, 2021

There are no references for this article.