Training sets for algorithm tuning : exploring the impact of structural distribution
Abstract
Algorithm tuning is a growing area of research and practice. In particular, this area
of work is developing in the context of metaheuristic algorithms, which involve configuring the parameters of an algorithm on the basis of a "training set" of problem
occurrences. There has been significant interest (but to date, limited research) in
the relationship between the training set and the generalisation performance, often assuming that if an algorithm is targeted at problems of a certain type, then
the training set of problem instances should represent a diverse collection of problems of that type. The current research investigates this assumption, focusing on
how certain structural aspects of training set problem instances affect generalisation
performance.
In our first study, using travelling salesmen problem (TSP) instances across both
uniform and Gaussian distributions, we sought evidence regarding the veracity of
a "widely held assumption" about training sets and generalisations that training
sets should be pulled from the same distribution as the problem instances. We
found evidence against this assumption, showing that the algorithms on problem instances from the same distribution as the training sets did not perform better than
those that were trained on distinct training sets; algorithms did, however, appear
to perform consistently better when TSP distributions were Gaussian, as opposed
to uniform. In our second study, we replicated our findings from the first, showing again that the performance of algorithms on problem instances from the same
distribution as the training set did not perform better. Extending these findings,
we explored whether problem difficulty played a role in generalisation, finding that
greater difficulty by using fitness-distance correlation in training appears to be linked
to better performance, but only in two-region problems. Finally, the third study explored these effects in the context of vehicle routing
problems (VRPs). In our final study, we again found evidence against the ‘widely
held assumption’, and even found that algorithms trained on different training sets
seemed to perform better than those trained on the same distributions. We further
showed that the difficulty of distribution which we measured by using fitness-distance
correlation was tied to higher performance, confirming what was found in the second
study. Overall, the primary finding across both TSP and VRP contexts is that many
researchers have a prior assumption that they must train it on instances from a
distribution similar to the problem instances. One significant limitation is that these
data were generated using a synthetic data approach, and simplified to an extent that
creates a differentiation between these and real-world data. However, the findings
do have implications for the development of a-world optimisers for specific problem
classes, such as VRPs with particular depot and customer distribution setups.