Solving the Rubik's cube with stepwise deep learning

This paper explores a novel technique for learning the fitness function for search algorithms such as evolutionary strategies and hillclimbing. The aim of the new technique is to learn a fitness function (called a Learned Guidance Function) from a set of sample solutions to the problem. These functions are learned using a supervised learning approach based on deep neural network learning, that is, neural networks with a number of hidden layers. This is applied to a test problem: unscrambling the Rubik's Cube using evolutionary and hillclimbing algorithms. Comparisons are made with a previous LGF approach based on random forests, with a baseline approach based on traditional error‐based fitness, and with other approaches in the literature. This demonstrates how a fitness function can be learned from existing solutions, rather than being provided by the user, increasing the autonomy of AI search processes.

Another approach focuses on how the fitness function itself can be transformed to avoid these problems. This has a long history in the evolutionary computation, typified by work on fitness scaling (Grefenstette, 1986;Hopgood & Mierzejewska, 2009;Kreinovich et al., 1993;Ware et al., 2003). The principle aim of fitness scaling is to prevent premature convergence of the search algorithm, by composing a scaling function with the fitness function that does not change the ranking of points in the search space but ensures a more even distribution of the fitness values allocated to those points. A more recent version of this transformation approach is exemplified by geometric semantic genetic programming (GSGP) (Moraglio et al., 2012), which attempts to reconfigure the problem so that a much simpler search process such as hillclimbing can be used.
The aim of these approaches is to create a transformed problem that can be solved more effectively. However, these transformations sometimes involve some kind of tradeoff. For example, in basic GSGP, this tradeoff is against the size of the solution, though more recent implementations have used a caching strategy to make implementation more efficient (Vanneschi et al., 2013). This idea of reconfiguring the fitness function prior to the main evolutionary algorithms being run is one source of inspiration for the work in this paper; this has been explored elsewhere in evolutionary computation in work showing how a good choice of genotype-phenotype mapping can be used to create a smoother landscape (Asselmeyer et al., 1996).
A domain-specific smoothing the search landscape is in the form of pattern databases (Culberson & Schaeffer, 1998). These consist of patterns in the search space such that any point that matches the pattern has the same cost of solution-typically, these represent symmetries of the underlying problem. If a solution of a particular cost is known for one problem that matches the pattern, then any other solution matching the pattern will have at most that cost to solve because all of the moves to the solution can be similarly transformed. As with the approaches discussed earlier, the idea here is to transform the search space. The key difference is that pattern databases are based on domain knowledge. Some work has used learning methods to generalize from pattern databases-for example, by using neural networks to learn how pattern databases can be combined (Mehdi Samadi & Felner, 2008).
Another important source of inspiration is the view that traditional fitness functions take a very narrow view of the problem; while a traditional fitness function is a good guide as to which elements of the population to choose for the next generation, it is a very simple representation of the complexity of a problem. Instead, it is argued, rather than a fitness function that returns a single number or a ranking, we should be using more complex fitness drivers that give us more information about the population member, allowing a more directed application of operators (Krawiec, 2016;Krawiec et al., 2016). However, such fitness drivers can require more domain-specific knowledge than a traditional fitness function. One of the aims of this paper is to give a generic method by which more information about problems can be incorporated into the evolutionary search, in this case by pre-training.
A more fundamental problem for evolutionary algorithms is that for some problems, defining the fitness function is difficult, because each problem has a different goal state. Call these non-oracular problems. As an example, consider the protein-folding problem in bioinformatics (Dobson, 2003). Biological proteins consist of a sequence of amino acids, which then fold into a three-dimensional shape, which is (with a few exceptions such as prions) entirely dependent on the sequence. To define this as a traditional evolutionary search is problematic, because we do not have access to a measure of how far a particular configuration is from the solution-indeed, if we did know what configuration we were searching for, we would have solved the problem! Therefore, evolutionary computing approaches to these types of problems have focused on learning parameters in, or functional forms of, a domain-specific model (Widera et al., 2010).
Another potential advantage to pre-training for simplifying the fitness landscape is that more extensive computational effort can be expended during an early training phase, and then when evolution is applied to a specific problem, the evolutionary algorithm can run in fewer generations because more domain-specific information has been encoded into the fitness function. This may be of importance in some application where running a traditional evolutionary algorithm might be infeasible because of the need for a large population and many generations to escape local minima, whereas a smaller population and fewer generations might be needed for the simpler function. This is particularly important where the evolutionary algorithm needs to run in a time-constrained situation.

| DEEP LEARNED GUIDANCE FUNCTIONS
Fitness functions in evolutionary learning are provided as part of the problem definition. Typically, these are then used directly-individuals are evaluated using the fitness function, and operators in the search are used to avoid problems in the fitness landscape such as local minima. However, an alternative approach has been applied in both evolutionary learning (Szubert et al., 2013) and reinforcement learning (Erez & Smart, 2008), where the fitness function is shaped so that it more directly represents routes through the fitness landscape from an arbitrary point to the desired target.
A form of this called Learned Guidance Functions (LGFs) was introduced by (Johnson, 2018). The input to this is a search space and set of existing solution trajectories for the problem. For example, in the protein folding problem these would be sequences of points in the space of threedimensional structures, going from a sequence to a completely folded structure. For an image denoising problem, this would be a sequence of images from a clean image to a very noisy one. These are an example of True Distance Heuristics (Sturtevant et al., 2009), but with a particular layered structure and the use of a predictive function to give the heuristic value rather than a look-up table.
These solution trajectories can be obtained in a number of ways. For some problems, we will have access to a set of already-solved examples.
For others, we can construct artificial examples by starting from a solved state and carrying out a number of moves from that solved state to generate trajectories in reverse (a similar approach has been called Autodidactic Iteration in McAleer et al., 2018).
These trajectories can then be used to create a training set for a supervised learning problem. Each trajectory will consist of a number of states in the search space of the problem, each of which is paired with a number that is the number of steps away from the target that it took to get to that state.
These pairs then become the training set: so, the task for the supervised learning problem is to build a model that takes an arbitrary state of the system and assigns a number predicting how many steps it will take to get to the target state. The LGF is the model learned from this supervised learning process.
This LGF then be used as a ranking function in an evolutionary algorithm. Take each member of the population, and apply the LGF to it. Then select the individuals that will form the parents of the next population from the lowest scoring ones on the LGF-these are the ones that are being predicted as being closest to the solution.
A similar approach has been taken in the recent papers by McAleer et al. (2018) and Agostinelli et al. (2019), though these are grounded in a reinforcement learning approach rather than a supervised learning approach. Compared with the work in this paper, their algorithm learns a mapping from points in the state space of the cube to a pair consisting of a value and a policy. The approach that we take in this paper can be seen as the equivalent of learning just the value function. Indeed, the greedy experiment in McAleer et al. (2018, fig. 5a) demonstrates a similar percentage-solved behaviour to the experiments in this paper. However, their full approach, which combines value function approximation and policy learning, and then uses an Monte Carlo Tree Search (Browne et al., 2012) approach, continues to work for problems of much higher complexity that the experiments in this paper, but at the cost of much longer search times (around 10 min as shown by Figure 5b and the description on Page 6 of McAleer et al., 2018) compared to a few seconds for our approach. In summary, their approach is scalable to larger problems (more scrambles), but at the cost of increased computation time, whereas the approach in this paper offers a quicker solution approach to simpler problem instances but fails for larger problems.

| Formalization
Now we formalize this idea. Consider a search space S consisting of a set of points, which is the node set of a directed graph M S , which represent the possible moves (mutations) from each point in the search space. Identify one or more of these as goal states; these might be the only goal states, or they might represent a sample of the class of states that the eventual problem is trying to solve. X can now be used as a training set for a supervised learning algorithm. The trained model from that supervised learning algorithm, L : S ! ℤ ≥0 , is a function that takes a set in the search space and predicts how many moves are needed to get to the goal state. This function will be used as an alternative kind of fitness function in the experiments below.

| EXAMPLE: APPLYING DEEP LGFS TO THE RUBIK'S CUBE
In Johnson (2018) we applied the LGF to the problem of unscrambling the Rubik's Cube. We used a number of classifiers from the scikit-learn library (scikit-learn, n.d.) to implement LGFs, and demonstrated that (1) the LGF can learn to recognize the number of turns that have been made to a cube to a decent level of accuracy; and, (2) that this LGF can then be used to unscramble particular states of the cube in a sensible number of moves. Unscrambling is not one of the non-oracular problems, because the goal state is known, but it has a complex fitness landscape with many local minima, and so is a good test for these kind of algorithms.
The search space C consists of all possible configurations of coloured facelets on the six faces of the cube, each of which has a 3 × 3 set of facelets. The move set M is notated by a list of twelve 90 moves, Singmaster, 1981), which are functions from C to C. We use the notation m(c) to denote the application of move m M to the cube c, returning the new state of the cube.
An earlier paper by the author (Johnson, 2018) applied a number of learning algorithms to the problem of learning LGFs for the Rubik's cube, with random forests demonstrating itself to be the best approach. That paper did not use any deep learning (Goodfellow et al., 2017) approaches-in this paper, we extend the work by using deep learning.

| Constructing the LGF
The LGF for this problem is constructed as follows (pseudocode in Algorithm 1). For n s iterations, start with a solved cube and make n ℓ − 1 moves.
Each time a move is made (and in the initial state), the pair consisting of the current state and the number of moves made to get to that state is added to the training set. This is illustrated in Figure 1.
The LGF is then constructed from this training set by applying a supervised learning algorithm, specifically a deep neural network implemented in the Keras framework (Keras, n.d.) on TensorFlow (TensorFlow, n.d.). The specific network used is illustrated in Figure 2. This is a fairly standard deep learning network, with dropout (Srivastava et al., 2014) used to encourage generalization and prevent overfitting. The categorical crossentropy function was used for the loss function, and the Adam optimizer was applied. Future work will apply meta-learning of parameters and network shape to optimize the model produced (Hutter et al., 2019).
Once an LGF is learned, it can be applied to the task at hand, which is to take a scrambled state of the cube and move through the search space with the aim of finding the solved state. This is done using a variant on evolution strategies (ES). The initial state of the cube is duplicated to fill the population. Then, in each generation a number of mutants are generated by making a random move for each member of the population.
Any solutions that are predicted by the LGF to be closer to the solution than the current one are placed in an intermediate population pool, and a new generation created by uniform random sampling with replacement from this pool to bring the population up to full size. This is repeated until one of three states occurs: (1) the solution is found; (2) none of the mutants produce any improvement, in which case the algorithm is restarted;

Algorithm 1
Training set construction for the Rubik's cube 1: procedure ConstructTrainingSetRubik (n s , n m ) This is summarized in pseudocode in Algorithm 2, where the inputs are: n ℓ , the problem size (number of scrambling twists given); n p , the population size; θ the maximum number of generations; and, L the LGF function used. In the results tables, this is referred to as Deep LGF + ES

(ES refers to Evolution Strategies).
When Algorithm 2 discovers a solution, it is then easy to reconstruct a path through the various population iterations to create a solution trajectory. Each population consists of solutions that are mutations (moves) from the previous population, so by starting from the solution, identifying which cube it was a mutation of, and so on back to the original scrambled state, we can construct the trajectory (Figure 3).
A second approach is to use the same fitness function, but to remove the population-based aspect of it. So, instead of having a population of cube states, we start with a single cube state, and then at each iteration generate all possible 12 moves from that state, and use the deep network to predict how many scrambles it is from solved. If one of these predictions is closer to the solved state, then that move is made. Otherwise, a random (neutral) move is made. This is summarized in pseudocode in Algorithm 3, with the input parameters being a subset of those in Algorithm 2.
In the results tables, this is referred to as Deep LGF + Hillclimbing.

| Sources of error
Note that if a perfect LGF existed for a problem, we could solve the problem in a minimal number of steps. Starting from an arbitrary scrambled state, we can examine all possible moves from that state. At least one of these will be closer in terms of number of moves to the target state, and so we can move the state of the system to the state which is closest, and repeat until we reach the target state.
In practice, there are two forms of error. The first is in the formation of the training set for the problem. A particular sequence of scrambling moves of length n might, nonetheless, end up with the cube in a state which could have been reached using fewer moves. A simple example of this is where one move is followed by a move which is the inverse of that move (this is explored in more detail by Johnson (2018)).
The second is where the model makes the wrong prediction. For these reasons, the fitness landscape created by a real LGF will still have local minima.
F I G U R E 2 Keras deep learning network used for training

| EXPERIMENTS AND RESULTS
The experiments were carried out as follows. LGF function used L is the result of running Algorithm 1 with trajectory length n m and 100,000 trajectories, then using that as the training set for the Keras network in Figure 2 with 50 epochs.
The total time to run all of these experiments was under 3 h, not including time to train the models (training time was between 11-59 s per epoch depending on the size of the model).
Results for the unscrambling experiments are presented in two tables. Table 1 shows for each (n ℓ , n m ) pair the percentage of times that the problem was solved. Table 2 shows the number of generations taken by successful algorithms to unscramble the cube.
There are a number of observations. Firstly, for the smaller problem sizes, a solution to the problem is frequently found; for problems below size 9, at least half of the attempts are successful, and it is very reliable for small problem sizes. Secondly, the size of the model makes little difference-using a larger model than the problem size is of little value. Thirdly, the number of generations needed is small for the lower problem sizes, but increases for large problem sizes; this may be an effect of more re-initialisations needing to be carried out.
Fourthly, note that some of the average lengths in Table 2 are shorter than the problem size. This is because the problems were constructed by scrambling randomly n ℓ times, but no check was made to ensure that the resulting state could not be solved in less than n ℓ moves; indeed, doing such a check is rather complex. Therefore, the starting state for some runs may contain a problem that can be solved in fewer than n ℓ moves.
Tables 3 and 4 present the results for the Deep LGF combined with hillclimbing. The total time to run all of these experiments was considerably than the previous experiments, at around 52 h, not including time to train the models (the models were reused from the previous experiment).
The results in terms of percentages solved are overall worse for the hillclimbing approach rather than the evolutionary approach. The lengths to solution are larger in the hillclimbing process for simpler problems, but are in many cases better for more complex problems.
Tables 5 and 6 compare the results of the experiments in this paper (a preliminary version of which as presented by Johnson (2019)) to two experiments in a previous paper (Johnson, 2018). The main experiments in the current paper (Deep LGF + ES) varied from the experiments in this earlier paper (Random Forest LGF + Hillclimbing) in two main ways. Firstly, the models were trained using a random forest classifier (the implementation in the scikit-learn package (scikit-learn, n.d.)). The tables give the results for models trained on examples with up to 13 moves. Secondly, the unscrambling in the earlier paper was based on only the hill-climbing approach, whereas in this paper the evolution strategy approach is also used.

F I G U R E 3 Reconstructing a single trajectory from the evolutionary search once the solution has been found
These tables also contain a comparison with a baseline experiment also described in detail in the earlier paper (Johnson, 2018) (Error + Hillclimbing). This also uses a simple hill-climbing method, but the choice of moves is made by choosing the move that maximizes the number of correct facelets. This is a traditional error-based fitness function.
It is notable that the percentage of successes in the Deep LGF + ES approach is considerably higher than the Random Forest LGF + Hillclimbing approach. However, the length of the solutions found by the new approach is much larger for larger problem sizes. This may well reflect the use of reinitialisation in the latter approach; in the earlier paper, a search that did not terminate was considered a failure. The evolutionary approach using the Deep LGF is largely a better performer than the hillclimbing approach. All of these new methods clearly outperform the traditional error-based fitness measure, demonstrating the value of this pre-training step.

| WHAT IS BEING LEARNED BY THE DEEP NETWORK?
What is being learned by the network? Is it learning the structure of solutions, or relationships between fixed values in the network? This next set of experiments will investigate this.
Recent work by Geirhos et al. (2020) that demonstrates that deep learning learns so-called "short-cuts" to classifying a particular class of items in the training set. For example, a deep learning network will misclassify an image of a cow when presented on a beach rather than a field, whereas a human would not make the same mistake (Beery et al., 2018), and an object recognition task will prioritize the position of an object in the visual field over the content of the objects in that field (Geirhos et al., 2020). Deep learning demonstrates cognitive biases in learning (Geirhos et al., 2019), that might not always align with human cognitive biases.
T A B L E 4 Average number of generations needed to solve problem of each size using trained model of each size   Size of problem   2  3  4  5  6  7  8  9  10  11 12 13 Size of model We would expect an intelligent solution to a problem such as learning an LGF for the Rubik's Cube to learn the structure of the problem, rather than learning relationships between colours at specific positions. It seems reasonable that a system that has learned tout court that a cube with four sides each with a single row of miscoloured faces is likely to represent a single scramble has found a more intelligent and generalisable solution to the problem than one that has learned that relationship based on lines of specific colours, and is unable to generalize when those colours are permuted.
In this next set of experiments, we will explore these ideas in the kinds of models used in this paper. This will be done by creating a restricted training set, T a based solely on anticlockwise scrambles. This will then be tested on two test sets: one generated using anticlockwise scrambles, and generated using clockwise scrambles.
The hypothesis being tested is that the deep learning system will not be able to develop a structural understanding from the restricted training set. If it did, it should be able to transfer that understanding from the anticlockwise training to the clockwise test.
In detail, a solved cube c was generated, and the pair (c, 0) added to the (initially empty) training set T a . This was then scrambled 10 times (n = 1, …, 10) using just anticlockwise scrambles, and after each of these scrambles the pair (c, n) added to T a . This process was repeated 100,000 times to produce the training set. A model M was then generated using T a and the network from earlier (Figure 2), again trained over 50 epochs.
The testing was as following. For each of n = 2, …, 9, a testing set was created of 100 cubes generated using random anticlockwise turns only, as was a second testing set in the same way except that clockwise turns were created. Predictions were then made using M, and a count made of how often the model made a correct prediction. The results are presented in Table 7.
Notably, the performance is overall considerably worse for the clockwise test set, though this evens out for more complex problems (7,8,9 scrambles) where the overall performance is lower. This demonstrates, certainly for the regime seen at smaller problem size, that the learning does not generalize from the anticlockwise to the clockwise.
This points towards alternative approaches based on representations that have a more explicit representation of these structural relationships. For example, it would be interesting to explore whether a representation that contains explicit representations of relationships between facelets on the cube, such as program code manipulated with an approach such as genetic programming (GP) (Poli et al., 2008), would make this kind of generalization more possible.
More generally, an interpretable representation (Murdoch et al., 2019) such as those generated by GP (at least for small program sizes) would not only be useful for human interpretability, but would also expose the key meaningful ideas used in classification. This would then provide a representation that could be manipulated by another AI system, facilitating transfer learning (Pan & Yang, 2010) to other related problems.

| OTHER RELATED WORK
There are similarities between this approach and the idea of a learned value function in reinforcement learning (Sutton & Barto, 2018 In the metaheuristics literature the idea of learning from a set of already-solved problems is explored in the idea of target analysis (Glover & Greenberg, 1989). This takes a set of solved problems from a problem class, and uses those known solutions to set the parameters of a metaheuristic. This is different to our approach in that it still relies on the metaheuristic operators to avoid local optima in the landscape, whereas the approach in this paper uses those already-solved problems to construct a new landscape based on a metric which is designed to have fewer such local optima. The idea of learning a metric from a large set of examples is explored in the literature on metric learning (Kulis, 2013), and it would be interesting to explore further connections between metric learning and the idea of constructing new fitness functions.
It should be noted that there are algorithms specifically for solving the Rubik's Cube, as summarized in the book by Slocum et al. (2011). However, comparisons with these are less relevant to this paper, which was using the Rubik's Cube as an example to see whether a learning algorithm could learn naïvely from it. The importance of these methods that can learn without explicit human knowledge has been emphasized as an important route towards artificial general intelligence (Silver et al., 2017). More widely, this approach points towards a human-like approach where the machine learns concepts such as similarity from existing solutions. There is an interesting continuum of learning methods from much contemporary machine learning that attempts to generalize from examples (whether labelled examples as in supervised learning or reward as in reinforcement learning), through the approaches in this paper that learn from solution trajectories, through to expert systems where the inputs data explicitly attempts to articulate what aspects of successful solutions should be used by the machine.

| SUMMARY AND FUTURE WORK
We have developed the idea of deep learning for learned guidance functions, and shown how these can then be used as fitness drivers in evolutionary computation. This has been applied to a case study of solving a Rubik's Cube, and shown to have a advantages in terms of frequency of finding a solution and the size of the models needed when compared to a random forest-based LGF; however, the number of generations needed is, for more complex problems, larger. Compared to the work by Agostinelli et al. (2019), the results for smaller problems are comparable but quicker to compute, but the combination of policy and value learning approach in that paper allows reliable solution of more complex problems compared to the approach in this paper that relies solely on value function approximation.
There are a number of areas for future work. Firstly, there is much of scope for optimizing the deep learning system using automated machine learning approaches both to optimize the parameters and the structure of the system (Hutter et al., 2019). Secondly, there are a number of further experiments that would investigate the behaviour further: investigating the frequency of and impact of the reinitialisation in this method, using measures of landscape smoothness to understand the effect of the LGF on the landscape, and experimenting with different population sizes.
Finally, there are a large number of other problems to which this approach could be applied, for example, protein folding, and de-noising of audio and video files, audio transcription (Souto-Rico et al., 2020), etc.

CONFLICT OF INTEREST
The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request. The code that supports the findings of this study is available in the supplementary material of this article.