rotate being this puzzle thing where you rotate pieces, adapted from some ancient Nokia phone game, and also rather reminiscent of Gripple.

For the sake of reducing confusion, let’s consider rotate4 to be the 4×4 puzzle where the pieces don’t have visible orientation, and rotate4o to be the puzzle where the pieces do. rotate4 can be represented as an isomorphism of with nine generators, corresponding to the nine possible moves. rotate4o likewise has nine generators, but is isomorphic to some semidirect product . (The orientation subgroup is not for parity reasons.) The nine moves may be labelled as , as before. Not all nine moves are necessary, though: rotate4’s group can be generated with just five of them, and rotate4o’s group can be generated with six.

I wrote an optimal solver for rotate4 in Python last year, along with two different non-optimal solvers. I briefly mentioned these on the blog, too. To quickly review, the first non-optimal solver used the subgroups (separating the top eight values from the bottom eight), (solving the top half) and (solving the bottom half). While it had three phases, the coset spaces for the second and third phases are isomorphic, so I just termed it a two-phase solver. The other one also had three phases, using the subgroups (solving corners), (solving left/bottom edges) and (solving the rest). This generally produced solutions with much fewer moves than the “two-phase” solver, but the code was a fair bit messier. (These non-optimal solvers solved each phase optimally using lookup tables, which is feasible because each individual phase is quite small.)

The optimal solver tried move sequences of increasing length to simultaneously put a position in (separating top/bottom) and (separating left/right), stopping when it encountered a move sequence that produced a solved state. Note that the intersection of these two subgroups is *not* the trivial group, so it was necessary to keep the state around separately. I never figured out the exact reasons the optimal solver was as slow as it was, though when I ported/rewrote it in JavaScript, I did plenty of algorithmic optimisations that made it run a lot more performantly. I’ll cover this later.

Making these solvers work with orientation was somewhat nontrivial. The two-phase solver does not adapt trivially: taking orientation into account, separating the puzzle into two independent halves is *not* the same as having the top pieces in the top half and the bottom pieces in the bottom half. The difficulty here is in ensuring that the top half has zero total orientation, so that it can be solved using only moves that affect only the top half, and likewise for the bottom half. It’s not obvious how to index the corresponding coset space. The three-phase solver does trivially, but writing the code for it was a pain in the arse so I gave up; the involved coset spaces were also considerably larger and it took about ten minutes to generate transition tables, which is not nice.

What I noticed was that (the same five moves as in Gripple) was isomorphic to . In other words, it’s impossible to rotate a piece in place using only those five moves, even if you don’t care about the orientation/permutation of the other fifteen pieces. This means that we can choose a reference orientation such that five of the nine moves (specifically, those five generators) don’t affect orientation, while the remaining four moves () do. The squares of these four moves can be represented with the other five moves, so we can consider as our generators instead.

From this the original two-phase solver for rotate4 can be adapted into a three-phase rotate4o solver: solve orientation (9f*), separate top/bottom pieces (9f*), then solve each half individually (16f* for each half). This produced solutions with mean length about 32.96 moves (in the “face”-turn metric, where half turns are counted as single moves) and a maximum length of 50 moves. The code for this isn’t particularly complicated either, since the first phase is only concerned with the orientation, while the other two phases are only concerned with permutation. The price to pay for this simplicity is that the move counts are quite high.

It is known that Gripple can be solved in 38 moves. We can divide the puzzle into four quadrants, each with a corner piece, two edge pieces and one centre piece. Solving the three non-centre pieces of a quadrant can be done in at most 11 moves (I checked this by hand; I don’t know the optimal value but it’s probably about 9), and doing this for two adjacent quadrants takes at most 22 moves. I did a brute-force search through the 10! permutations for the ten remaining pieces and found that those can be solved within 16 moves. Adding these gives an upper bound of 38 moves, which is somewhat better than the 41 moves obtained from the second/third phases in the previous paragraph.

The next thing I did was to merge the second and third phases. The resulting coset space is isomorphic to , which is too big to fit in memory, so I wrote an optimal solver for it using IDA*. I didn’t think it’d be feasible at all, seeing how my earlier attempt at an optimal solver for a very similar list of generators was too slow to be useful, but I went ahead anyway. The first attempt was a direct port of the Python code, which worked about as poorly as expected. Then came optimisations.

Just to reiterate, previously, the IDA* search used two heuristics. These corresponded to the coset space , which has 12870 cosets. One way to think of it is to tag eight of the pieces and their solved positions, and seeing how many moves it takes to bring all the tagged pieces to tagged positions (not necessarily the solved position). The two heuristics corresponded to tagging the top half of the puzzle and tagging the left half of the puzzle. The first optimisation was to use two more heuristics of the same type: tagging every other row and tagging every other column. Adding these two heuristics also gave the property that, if all four heuristics report zero distance from the solved state, the state actually is solved, so instead of keeping a copy of the state for every node searched, we only had to store four integers (each denoting the index of the coset in the coset space ). The original Python code used separate transition tables for the different heuristics which was unnecessary; using the same transition table led to a somewhat faster initialisation.

The next optimisation was to “canonicalise” move sequences. Many pairs of moves commute, so it doesn’t make sense to search for both and when they’re equal. The moves are indexed from 0 to 8, and a move is forbidden if it has an index not greater than the index of the previous move and it also commutes with the previous move. This does not, strictly speaking, canonicalise the move sequences as there exist move sequences identical under permutation of commuting moves that have multiple such representations, but it’s good enough as an optimisation. Doing a proper canonicalisation would involve looking more than one move into the past, which also means it’d be slower and not worth it. This optimisation reduces the number of visited nodes by about 85%. (Checking whether two moves commute was done with a lookup table, not by explicitly checking that they really do commute.)

The last one was to just use better heuristics. There is inherently some trade-off involved here: choosing a smaller subgroup to reduce to (i.e. having a larger coset space) increases the initialisation time and memory usage, but generally results in more accurate heuristics, which would in turn speed up the actual solving. The new coset space I tried was with 43680 cosets, which corresponds to solving four pieces (and ignoring the remaining 12). Again, I used four heuristics, each one solving the four pieces in each quadrant. This was about eight times as effective at pruning the search tree and also about eight times as fast. However, using the new heuristics in conjunction with the “tagging” heuristics (eight heuristics in total) did not provide much further improvement, pruning only 20% more nodes, which doesn’t make up for having to update twice as many heuristics at every node.

js> S8 = scramble(4)
[[8, 15, 9, 12, 13, 0, 6, 11, 2, 14, 3, 1, 4, 7, 5, 10], [1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1]]
js> prettyprint(S8,print,1) // the numbers below correspond to the orientation: 0 = upright, 1 = rotated right, 2 = upside down, 3 = rotated left
i2 p3 j3 m2
n3 a2 g2 l3
c0 o2 d1 b2
e0 h2 f3 k2
js> solve4a(S8) // three-phase solver
[0, 0, 1, 7, 5, 1, 1, 2, 3, 3, 5, 5, 6, 6, 6, 4, 7, 7, 1, 1, 4, 4, 0, 1, 1, 0, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 0, 0, 0, 6, 6, 8, 7, 7, 8, 6, 6, 7, 7, 8, 7, 7, 6, 6, 6, 7, 7]
js> solve4b(S8) // two-phase solver; the "time taken" is only for the second phase; it does not include initialisation time
L1: 5.844085693359375 / 9
L2a: 4.85003885003885 / 7
L2b: 6.056254856254856 / 9
L3a: 6.149130036630036 / 9
solving phase 1
solving phase 2
0 nodes searched, 0 nodes pruned
starting search at depth 9
107 nodes searched, 1312 nodes pruned
starting search at depth 10
2208 nodes searched, 25303 nodes pruned
starting search at depth 11
37393 nodes searched, 415976 nodes pruned
starting search at depth 12
565538 nodes searched, 6204791 nodes pruned
starting search at depth 13
7946215 nodes searched, 86625498 nodes pruned
starting search at depth 14
solution of length 14 moves found
69370866 nodes searched, 752203551 nodes pruned
time taken: 197.777 s
[0, 0, 1, 7, 5, 4, 4, 2, 7, 7, 4, 4, 2, 2, 4, 4, 4, 3, 3, 1, 1, 8, 8, 4, 4, 4, 6, 6, 3, 3, 8, 8, 4]

So as you can see, it still takes multiple minutes to solve random states. It’s also funny to note how the second/third phases in the three-phase solver took 35 moves when the optimal solution was only 14 moves.

What further improvements can be made here? One thing that stands out in the two-phase solver is that the work done in each phase is rather unbalanced: the first phase has only 32768 positions, while the second phase has 20922789888000 positions. It’d probably be better to have more balanced phases, like, say, using as the intermediate subgroup (solve orientation, one corner and two edges adjacent to that corner), where the first phase has 110100480 cosets and the second phase has 6227020800 cosets. Calculating lookup tables for the first phase would probably take like half an hour and use 1 GB memory unless I mess with typed arrays, so I guess both phases would have to be done with IDA*.

Addendum: Instead of using the above subgroup, I realised that using the square subgroup () produced an almost even split between the two phases. The square subgroup is characterised by every piece-position pair having a unique orientation and if you tag the pieces/positions like on a checkerboard, the tagged pieces are in tagged positions, and the permutation parity of the whole puzzle is even. Taken together, this subgroup has elements, and the coset space has 812851200 elements. IDA* was used for both phases and the resulting solver was not unreasonably slow, generating a solution in 0.19 s on average. What was not so nice was that generating the transition tables for the four heuristics (each of which has about 100000 entries) took about four seconds. I’ll see if using typed arrays speeds this up any.

Another bizarre optimisation I came across when trying to reduce the initialisation time (it was in fact originally around seven seconds) was that these two practically identical functions run at vastly different speeds on Chrome (which was the only environment I used a profiler in; the SpiderMonkey shell doesn’t have a profiler).

function compose_map(a,b)
{
return a.map(x => b[x]);
}
function compose_loop(a,b)
{
var c = [];
for (var i = 0; i < a.length; i++) c[i] = b[a[i]];
return c;
}

Spoiler: the second one ran *over a hundred times faster* and shaved off about a second from the initialisation time. Now, Chrome’s profiler is actually sorta buggy and reports impossible times sometimes, but this is still weird. This function is called very often during initialisation (probably around half a million times) and I guess creating the lambda function for the map version at every call only to destroy it later is not very efficient. I also converted a slice/map call to an explicit for-loop in another function, which shaved off another second from initialisation. It’s unfortunate that we have to write uglier code to get better performance.