## Sorting networks on blocks

The latest CS2020 assignment contains this problem (slightly paraphrased): given $n$ blocks of data (with each block being of size $S$, with the primitive operation of sorting any two blocks of data together (i.e. take two blocks, concatenate, sort, then split back into two blocks), devise an algorithm to sort all $n$ blocks of data. Inspection of the elements within each block is not allowed.

The first thing to note is that we must have $n\ge2$, because we cannot sort a single block on its own. (A bizarre restriction.) The second thing to note is that because our sorting algorithm necessarily acts independent of the data, we may treat it as being similar to a sorting network and make use of the 0-1 principle (where every individual datum is either “0” or “1”) to simplify analysis. The third thing to note is that, without loss of generality, we may assume that the contents of the individual blocks are already sorted, and equivalently, we may also treat a block as containing an integer between 0 and $S$, inclusive.

Why does the 0-1 principle hold? This is not trivial (the standard proof does not immediately generalise), but we can always convert a block sorting network into a normal sorting network by substituting the sort-two-blocks operation with a sorting network; therefore the 0-1 principle remains valid.

Define a block comparator as the primitive operation of sorting two blocks together. In the case where $K=1$, this is obviously analogous to comparators in normal sorting networks.

A clean block is a block where all $K$ elements are identical; a mixed block is a block with both 0’s and 1’s. Any sorted block sequence has at most one mixed block.

We could directly adapt the bubble/insertion network to this problem, and this gives a block-sorting network with $\Theta(n^2)$ block comparators and depth $\Theta(n)$. Correctness may be proven via induction, as for the original sorting network. To be more precise, the insertion network for $n$ elements is given by the concatenation of the insertion passes over $k\in\{1,2,\dots,n-1\}$, where each insertion pass consists of the (block) comparators $(k-1,k),(k-2,k-1),(k-3,k-2),\dots,(0,1)$. Visually, it looks like this:

(Image source: https://en.wikipedia.org/wiki/File:Six-wire-insertion-sorting-network.svg; licensed under CC-BY 3.0.)

If $n=2$, the sole block comparator in the network sorts both blocks. Done. Suppose the network is correct for some $m\ge2$; we will proceed to show correctness for $n=m+1$, whence by induction correctness is proven for all $n\ge2$.

By the inductive hypothesis, after the first $m-1$ passes the first $m$ blocks are sorted. Consider what the $m$th pass does. If all the blocks are clean, this reduces to insertion sort and we are done. If the first $m$ blocks are all-zero, we are done. Otherwise, let $k$ be the index of the first block among the first $m$ that is not all-zero. As the subsequence of blocks over the indices $\{0,\dots,m-1\}$ is sorted, the blocks after index $k$ are all-one. Applying the comparators $(m,m+1),\dots,(k+1,k+2)$ shifts the last block next to block $k$. Applying the comparator $(k,k+1)$ completes the sort.

The insertion network is not the only one that still sorts correctly when adapted to block-sorting. A careful case analysis (>>cases) shows that the odd-even merge sort (with the generalisation to non-power-of-two sizes) has the same property; bitonic merge sort can also be proven to be correct when applied to block-sorting.

This does bring up the question: do all sorting networks work as block-sorting networks? It’s not obvious whether we can force a contradiction from a sorting network that works when $S=1$ but not for some larger value of $S$.

## Odd-even merge sort

def create_oemsort(n):
return create_oemsort_range(0, n)

def create_oemsort_range(lo, hi):
if hi - lo < 2:
return ()
if hi - lo == 2:
return ((lo, lo+1),)
mid = (lo + hi) // 2
sort0 = create_oemsort_range(lo, mid)
sort1 = create_oemsort_range(mid, hi)
merge = create_oemerge(lo, mid-lo, hi-mid, 1)
return sort0 + sort1 + merge

def create_oemerge(lo, n0, n1, pitch):
# create network for odd-even merge
# n0/n1 are the number of elements in the halves we are going to merge
if n0 == 0 or n1 == 0:
return ()
# e0/e1 are the number of even-indexed elements in the corresponding halves;
# o0/o1 are the number of odd-indexed elements in the corresponding halves.
# (even/odd refer to offset of the indices from lo, not the actual index.)
e0 = (n0 + 1) // 2
e1 = (n1 + 1 - n0%2) // 2
o0 = n0 - e0
o1 = n1 - e1
merge_even = create_oemerge(lo, e0, e1, 2*pitch)
merge_odd = create_oemerge(lo+pitch, o0, o1, 2*pitch)
# analysing this by the four possible combination of parities of n0/n1 is horrifically messy,
# but it turns out that the result can be simplified into a single range().
slide = tuple((lo+i*pitch, lo+(i+1)*pitch) for i in range(1-n0%2, n0 + n1 - 1, 2))
return merge_even + merge_odd + slide

As it turns out, analysing odd-even merge sort for power-of-two sizes is easy, and then trying to do anything else makes everything super-hairy because of case explosion. Well, it’s really only four cases, but we have to be more careful about dealing with indices since we can’t assume we’re accessing everything with power-of-two strides.

The idea behind OEMS is the same as the usual merge sort: we split the input array into two halves, sort them independently, then merge the sorted halves to get a fully sorted array. The difference between the two algorithms is in how the merging is done. In the ordinary merge sort, we traverse one half of the array and weave the other half into it; in OEMS, we merge the even-indexed elements of both halves, then the odd-indexed elements (collectively, the merging phase), and then test the elements against their neighbours (the sliding phase).

Merging in merge sort is linear time and merging in OEMS is linearithmic time (thanks to the recursion); what then is the use of OEMS, if it is asymptotically slower? The reason is that the sequence of comparisons made on the data is actually independent of the data, and so can be precomputed as a sorting network. Furthermore, it readily lends itself to parallelisation, where it becomes a $\Theta((\log n)^2)$ algorithm if arbitrarily many processors are available. Combined with not needing to communicate between processors during each phase, this has a very low overhead and is a common choice for a fast parallel sorting algorithm. (Also, the gap between any two elements being compared is always a power of two, but this probably doesn’t affect speed directly.)

The usual assumption for OEMS is that the input has a power-of-two number of elements. One way of relaxing this is to round the input size to the next highest power of two, generate the network, and then prune out-of-bounds comparators, but this is sort of ugly. Another is to pay a lot of attention to the base cases of the merging algorithm, as well as to how the parities of the two halves of the input array affect the sliding phase. This is also ugly, but I already wrote the code for that, so, yeah.

Bitonic merge sort is another sorting network with $\Theta(n(\log n)^2)$ comparators, and happens to have been devised by the same guy as OEMS. In terms of the number of comparators bitonic merge sort is worse, but (I think) in terms of the number of phases needed, bitonic merge sort is somewhat better. So there’s some sort of a trade-off there. Also note that bitonic merge sort is a bit trickier to implement, since in its direct form you have to force reverse sorting at some parts of the sorting algorithm. It’s possible to “flip” the comparators to make them all sort the “right way”, but this is somewhat confusing to reason with, especially in the case of non-power-of-two sizes.

## Playing Ataxx

(An implementation of) Ataxx used to be in the GNOME Games package up until around 2007, so while I’ve played the game for a while back then, it wasn’t included in newer distros and I kind of just forgot about it until now.

And then I decided to implement it on a whim yesterday when I was stuck in the train getting home, because why the hell are there Saturday classes for Analysis II again?

Anyway. The rules aren’t too complicated. This is a two-player game played on a 7×7 board, where we start with two black pieces at two opposite corners of the board and two white pieces on the remaining two corners. At a player’s turn, he may either “split” a piece of his colour to any unoccupied adjacent cell or “jump” to any unoccupied cell of distance 2. (Here the distance is measured with the infinity norm, so for example moving one cell to the left and one cell to the right counts as being distance 1, and a knight’s move is distance 2.) For splitting a new piece is placed, but for jumping the piece in the source cell is moved to the destination cell.

All opponent pieces adjacent to the destination cell are converted to his colour. If a player has no such legal move, he must pass (and may not pass in any other circumstance).

The game ends when either a player has no more pieces of his colour on the board, or when the board is completely filled. The winner is whoever has more pieces of his colour on the board.

I couldn’t be bothered to make a fancy UI for the game yet, so I wrote it in JavaScript with a read-eval-print-loop CLI. I also threw together an implementation of the alpha-beta negamax algorithm for the AI, which takes about a second at a search depth of 4 plies. Probably could be optimised a lot more, but it works.

The board evaluation function I used was to take the signed difference between the number of own pieces and the number of opponent pieces, and this turns out to have some interesting artifacts. Throughout most of a game, many pieces switch sides on every turn, so this simplistic evaluation function is heavily dependent on the parity of the number of plies we search. The only exception is during the first few moves, where trying to maximise the evaluation function leads the AI to choosing splits over jumps, and since split moves increase the number of pieces of its own colour.

We do get a reasonable estimate of which player’s ahead by doing two minimax searches at, say, depths 3 and 4, and then averaging these values. This kills the oscillation in the evaluation function and gives something a bit more reasonable. Haven’t implemented this yet, but it (probably) matters more for showing the evaluation result to a human observer than for the minimax algorithm itself.

Another thing I’d like to note is that, after nine or so years of not having played this game, I’m pretty awful at it now, to the extent of almost losing to an AI making completely random moves. Um. Yeah. This makes judging how powerful the AI really is pretty difficult, actually. Then again, I suck at board games in general, so make of that what you will.

Code will be uploaded to GitHub at some point, probably.

## Worrying about things, 2016 edition

Everything is screwed up in one way or another this semester.

I wanted to take both Analysis II (S) and Algebra II, but the tutorial sessions for the former coincide with the lecture sessions for the latter. Well, okay, guess I just won’t take Algebra II.

Guess what, half the Analysis II class is taking Algebra II, and now we have to reschedule the tutorials. And it looks like there are no common free time slots for every weekday, so now the tutorials are on Saturdays, and I’m still not taking Algebra II because bidding closed last week.

I wanted to take CS2020, but now we’re in the second week and I still haven’t fully sorted out everything and I’m still not enrolled yet. The supposed deadline for adding modules is apparently last Friday (hurr) so I have no idea whether I’m completely hosed and have to take CS1020 instead, in which case I’m doubly completely hosed because I didn’t bother going through all the admin matters for CS1020 (and there were a few important things to handle).

On the fifth I filed an appeal through CORS for CS2020, which was rejected. At that point I thought I didn’t have any chance of getting in, so I signed up for a bunch of other classes, and then on the eighth, my CS1010S lecturer last semester told me to email the lecturers for CS2020, which I did on the eleventh after revising my message a hundred times over the span of five hours. As if that wasn’t enough, I was told to attend the first lecture (well, this is reasonable) and had to fill in an online form to actually register for the class. That I did, and on the 14th I was informed that manual registration failed because the exam schedule conflicted with one of the backup classes I registered for, and on the 15th I sent the FoS CORS handling peeps an email saying I wanted to drop the conflicting class, which I couldn’t do through CORS by myself because that brought me below the minimum workload of 18 MCs.

Now I have no idea whether CS2020 is supposed to be added by Computing or Science, because the email message I got from the lecturer seemed to hint that this should be handled by FoS, but my impression is that SoC handles their modules. I basically froze in panic for the past two days and my productivity (school or otherwise) dropped to zero.

I chose the modules I chose to avoid having to deal with administrative nonsense, but here I am, procrastinating said administrative nonsense by writing a post about it. Everything is wrong.

On one hand I got into the Dean’s list for “amazing” performance last semester, and on the other I’m convinced the physical possession of the certificate is more incriminating than useful for anything. If the last time I got a “major” award counts for anything, that is—remember when I got the “Outstanding Serviceman of the Month” award during national service? People still won’t shut up about it, and that was in 2013. This bothers me. Well, part of it was that I felt that that award was entirely undeserved. Maybe being on the Dean’s list is different from that and I’m really that awesome? Ha, who am I kidding. I messed up on the ODE exam so thoroughly, plus I got a “completed; unsatisfactory” grade for the career planning module because I couldn’t be arsed to show up for the sessions. It’s pretty ironic that I actually got on the Dean’s list at all.

## Generating sudoku grids

This is an Interesting Problem and one I’ve put some thought into way back in 2013.

How would you write an algorithm to generate a random sudoku grid (with every cell filled in), so that the distribution of generated grids doesn’t have significant correlations between cells and the like? I mean, we already know that having a digit x in a cell precludes that same digit from being elsewhere in the same row/col/box, but the question is whether our algorithm has a predisposition to causing nontrivial correlations between cells that don’t share rows/cols/boxes. We’d like to thwart solving methods/algorithms that try to exploit such correlations.

Oh, and of course, it should be fast.

The basic approach to this is to first write a sudoku solver (using dancing links or whatever), and then do this:

2. Optionally seed it with some values.
3. For each empty cell:
1. Fill it with a random value.
2. If the solver says there’s no solution, fill it with a different value.
3. If none of the possible values lead to a solution, backtrack.

(An alternative is to start with a fixed grid, and then apply transformations that preserve the sudoku invariant. However, this results in correlations between cells. One workaround is to precompute multiple grids and randomly choose one to transform. This is the strategy used by one iOS sudoku app, and likely by many others too.)

As it is, if we don’t seed it with any values and we visit the cells in random order, it can take a ridiculously long amount of time to generate a single grid, as the algorithm backtracks and wastes a lot of time trying to find solutions to partial grids that don’t have solutions.

Even an extremely simple seeding algorithm helps considerably. If we take any group (row/box/col), then by symmetry, all 9! permutations of digits are equally likely among the seven sextillion grids, so we could just pick a fixed group (we don’t have to randomise this!) and fill it with a random permutation.

Naively, we might expect that seeding with a row/col is better than seeding a box if we want to curb the worst-case behaviour. (Which is the point, because the exponential growth means that we should care about minimising the branching ratio more than anything else.) If we seed with a row, then every unseeded cell will have at least one candidate removed, but if we seed with a box, only the 36 cells sharing a row or column with the box will have at least one candidate removed. The total number of removed candidates is the same in either case (108), but it’s more spread out when seeding with a row. I don’t know whether this actually matters since I’m too lazy to write the code for that, but it doesn’t really matter because we have something better.

If we think of the 9×9 sudoku board as a tensor product (sorta) of two 3×3 boards, this suggests that we can think of the board as a two-dimensional projection of what is really a 3×3×3×3 board. This also gives an alternative description of the rows/cols/boxes; with a particular choice of coords, we could say that a “row” in the 2D projection is a plane fixing the first and second coords, a “column” is a plane fixing the third and fourth coords, and a “box” is a plane fixing the first and third coords. (If we add regions corresponding to planes fixing the second and fourth coords, this gives us “disjoint groups” sudoku, and we get some interesting nonobvious sudoku invariant-preserving symmetries out of this.) Expressed this way, every row intersects with every column (duh) and less trivially, for a row and any other disjoint region (row/col/box), there must be a region that intersects nontrivially with both the row and the other region.

(Proof: A box intersects with some columns and with some rows, with the intersection necessarily being a line (since it fixes three coords out of four). A row intersects with every column and vice versa. Therefore if we have a box and a row/col, there exists a col/row intersecting with both regions. If we have two rows/cols, every col/row intersects with both regions.)

This means that if we want to seed our grid generation algorithm with two regions, these regions must both be boxes. In fact, we can fit in three boxes such that no two of them “see” each other: we just pick the three boxes lying on the main diagonal. (The antidiagonal and the broken diagonals would work just as well. All we need is that there’s no row/col intersecting more than one of these boxes.) We fill these three boxes with random permutations independently, then let our backtracking solver do the rest of the work. This kills off around four candidates per unsolved cell, which is a huge improvement over using only one row/col/box as a seed.

The number of sudoku grids is not divisible by 9!3, which implies that this seeding method does not give a uniform distribution over the roughly seven sextillion grids. However, if anything, this might have less correlation/predictability than a uniformly randomly sampled sudoku grid, since we force the diagonal boxes to be completely independent in this construction.

Now, I’ve not proven that all combinations of permutations lead to solvable puzzles, but it would be very weird if the 27 clues were enough to force an indirect contradiction. It might not be too hard to verify this assertion via brute force, since there’re essentially only (8!/4)(8!/4+1)/2 ≈ 51 million cases to consider if we exploit symmetry reductions.

Note that if we want to generate grids with nonstandard regions, the above approach might not work without some modification. Take for example sudoku X, where we have that the diagonals follow the digit uniqueness constraint. Even if we generate the three diagonal boxes so that there are no immediate conflicts within the diagonal boxes, it is possible to hit a case where a row, a column, and a diagonal intersect with a cell and kill every single candidate within it. (Fill A1:A3 with 1,2,3; D6,E5,F4 with 4,5,6; I7:I9 with 7,8,9. Then I1 has no valid candidate values.) Backtracking would be unavoidable should we proceed with seeding three diagonal boxes.

Another caveat is that seeding with three diagonal boxes seems to hit bad cases more often, where the generation algorithm wastes a lot of time backtracking through partial grids. I don’t have an explanation for why this happens, nor have I crunched the statistics demonstrating that this happens at all. Maybe it’s just my mind playing tricks on me.

The solver is used very many times while generating a grid, so it is essential that the solver is fast. In particular, since the main algorithm already involves backtracking, we don’t even need a full brute-force/backtracking sudoku solver. We could restrict ourselves to fast and simple techniques, such as hidden/naked singles, and this is in fact what I used in the sudoku solver/generator I wrote. Pruning more nodes of the search tree is pointless when it costs more CPU time than it saves, which is a truism we’ve also seen in the context of IDA* search for twisty puzzles.

The linked code also visits the cells in random order in an attempt to break up any possible correlation between any two cells, but it might be faster to bifurcate the search tree on cells with fewer candidates rather than going through them in a random order. We’d lose some nice guarantees on the lack of correlation, but it can’t be helped. Also, note that the grid generation algorithm can also be stated as trying to solve an empty or seeded grid, so the solve_full function in my code serves both as a backtracking sudoku solver and a grid generator.

Implementation details, such as utilising bitmask representations, can affect the overall speed dramatically, but we will not discuss that any further.

Surveying the first twenty Google search results for “generating sudoku” sans quotes, we have the following breakdown on methods of generating sudoku grids/puzzles:

1. Precompute a single grid and transform. [1][2][3]
2. Precompute multiple grids and transform. [4]
3. Precompute multiple grids. [5]
4. No explicit seeding; raster order. [6]
5. No explicit seeding; random order. [7][8]
6. No explicit seeding; minimum-branching order. [9]
7. Seeding a region (9 cells). [10]
8. Seeding a box row and a column (33 cells total). [11]
9. Seeding random values in random cells; retry if unsolvable. [12]
10. Fill random values in random cells; retry if no/nonunique solution. [13]
11. Fill random values in random cells; use simulated annealing until partial grid has unique solution. [14]
12. Invert the solving process to obtain a partial grid with a unique solution. [15]

(Grouped by similarity, not sorted by the order in which they appeared as Google results. Paywalled papers have been excluded because you don’t deserve readership if you put your shit behind paywalls.)

[10] did not explicitly specify whether the seeding region was a box or a row/col. [11] is interesting because it seeds more cells than my approach, and similarly the author didn’t prove whether their approach would always produce a solvable grid. (My only defence is that seeding the three diagonal boxes is a lot easier to implement than generating a random band.)

Note that even without explicit seeding, if we use a fixed raster order (as in [6]), this is essentially the same as seeding the first row, since it’s impossible for the backtracking algorithm to backtrack beyond the first row. (Iirc an earlier version of my sudoku solver exploited this fact, but it’s faster to explicitly seed the grid anyway.) Minimum-branching order (as in [9], where we choose the most constrained cells to visit first) should also implicitly seed the first row or column if ties are broken in raster order, but I’m not entirely sure about this.

[12] and [13] are similar, but the main difference is that in [12], it’s okay to have multiple solutions, and indeed, in that paper, they suggest using eleven random cells, so having a unique solution is impossible. These eleven cells are used as a seed to the generation algorithm described in the second section of this post if they don’t conflict. [13] suggests using seventeen random cells and stops if there is at least one solution. This is not acceptable for the current discussion (and [8], which references [13], agrees), so I’ve taken the liberty to reinterpret it as an algorithm generating a uniquely solvable puzzle by looping the algorithm until that happens. This happens with extremely low probability if we stick to filling only 17 cells, so this should be increased to, say, 20-25.

#1 and #3 are both terrible ways of generating sudoku grids, and while I think [5] is purely a mathematical curiosity that no one’s bothered to implement as a real sudoku grid generator, #1 is in three (!) of the links. This is bad because it reduces the puzzle to a mere one-step pattern recognition problem, rather than a chain of logical deductions.

[15] is a totally cool approach to this and sidesteps the whole problem of generating a grid then making a puzzle out of that. I haven’t read the paper in detail, but it’s also the only one typeset in TeX and therefore is automatically a lot more credible than the rest. It doesn’t mention expected running time and it’s likely that this approach is considerably slower than the conventional generate-grid-then-add-holes approach to generating a puzzle.

[14] is also cool since it uses simulated annealing and also does the same sidestepping as [15], but SA doesn’t come with a convergence guarantee so this algorithm has to be accompanied with an explicit timeout. It might not be as bad as it sounds, anyhow—backtracking solvers/generators can have terrible worst cases too.

On another note, [3], [12] and [15] are all attributed to four-digit teams. Did they originate from some high school sudoku research project or what?

For the closely-related problem of generating sudoku puzzles, some of the above approaches already do this directly. The more common approach is to first generate a filled grid, and then remove the clues one by one until we reach a state where removing any more clues renders the puzzle unsolvable with a certain set of techniques. If we want to generate only “easy” puzzles, we can restrict our solver to using the naked/hidden singles methods, rather than using a full backtracking solver.

Scare quotes around the “easy” are necessary because such puzzles can sometimes be even harder for human solvers if we get bottlenecked on, say, having to find a certain hidden n-tuple in order to proceed on the puzzle. Exhaustive searches over all cells and values are easy for computers, but human implementations are far more error prone. There’s been way too many times when I wasted minutes constantly overlooking hidden n-tuples.

Hand-crafted sudoku puzzles often feature symmetrically-located clues for aesthetic reasons, and we can easily modify the above algorithm to generate symmetric puzzles by removing the clues together with their opposites.

In [12] it is pointed out that traversing the cells in random order (without the symmetry restriction) has the tendency to leave clues scattered throughout the grid, which makes the generated puzzles easier on average. I can corroborate this statement, and this observation holds for symmetric clue removal as well. A suggestion provided within for generating harder puzzles is to use (i) raster or serpentine order, or (ii) serpentine order skipping every other cell on the first pass, then going back to those skipped cells on the second pass (let’s call this “parity serpentine order”). In the former case, the puzzles will necessarily have clues clumped together at the bottom, which can be made less obvious by applying a random transformation to the grid. The parity serpentine order also leads to a characteristic checkerboard pattern to the clues, but applying a random transformation is likely to make it uglier.

I think traversing the cells in a spiral might make for interesting results. In particular, we can enforce some artificial difficulty by spiralling inwards, since this would ensure that there are fewer clues on the outer part of the puzzle compared to the inner part, and so for a human solver, more eyeball movement would be needed to solve the puzzle. This also forces clustering of the clues, which might introduce legitimate difficulty as well.

As a point of optimisation, we can fix the cell traversal order before attempting removal of clues in the cells. This would avoid visiting a cell multiple times to remove a clue when it’s an “unremovable clue”, where removing it necessarily results in the puzzle having multiple solutions. This also eliminates the need for any backtracking (outside of the solving algorithm) in generating a puzzle.

Furthermore, when we’re testing whether we can remove the clue at cell x, we need not try to solve the whole puzzle in order to determine whether the puzzle has a unique solution; if we take as our inductive hypothesis that the current puzzle has a unique solution, then it is enough to show that the clues sans cell x can uniquely determine the values at that cell. This optimisation is, needless to say, implemented in my code. It generalises in an obvious manner if we remove multiple clues simultaneously (e.g. when generating symmetric grids); instead of checking that the clues sans cells x, y guarantee a unique solution, we need only check that they can uniquely determine cells x and y.

## Not that kind of cube

Problem: Let $p$ be any prime number, and $G$ be a group with order $p^3$. Show that for any $x\in G$, $x^p\in Z(G)$.

This came up in the Algebra midterm. My initial approach was to show that the cyclic case is trivial, and that the order-bounded-by-$p$ case is also trivial, which leaves the case where $G$ is nonabelian and has at least one element of order $p^2$. Unfortunately, that was also as far as I got. Well, we have somewhat more machinery to deal with this now, like direct and semidirect products.

The proof is as follows.

If $G$ contains an element of order $p^3$, then $G$ is cyclic and thus abelian. Consequently, $x^p\in Z(G)=G$. Suppose hereafter that the order of the elements is bounded by $p^2$.

If $G$ does not contain an element of order $p^2$, then $x^p$ is always the identity element and thus $x^p\in Z(G)$. Suppose hereafter that there exists some element with order $p^2$; let $x$ be this element and $H:=\langle x\rangle$.

As $H$ is a subgroup of index $p$, it is necessarily a normal subgroup. [0]

Side note: At this point, we can choose to complete the proof in a few different ways. We can start by tackling the case where every element of $G\backslash H$ has order $p^2$ and otherwise separately, or we can look at individual elements. The latter is what the suggested solution uses, so we’ll present that first.

Claim: $\forall z\in G\quad z^p\in\langle x^p\rangle$. (This obviously holds for the elements of $H$, so we only need to check this for the elements not in $H$.)

Let $y\in G\backslash H$. The order cannot take on the values 1 or $p^3$, so either $o(y)=p$ or $o(y)=p^2$. In the former case, $y^p=e\in\langle x^p\rangle$. Suppose hereafter that $o(y)=p^2$.

If $y^p\notin H$, then $K:=\langle y\rangle$ has trivial intersection with $H$. Since both $H$ and $K$ are normal subgroups, we have that the elements of $H$ commute with the elements of $K$ (because $hkh^{-1}k^{-1}=h(kh^{-1}k^{-1})\in H$ and $hkh^{-1}k^{-1}=(hkh^{-1})k^{-1}\in K$ implies $hkh^{-1}k^{-1}=e$), and hence $HK$ is a subgroup of order $p^2\cdot p^2$. However, this contradicts $|HK|\le|G|$. Therefore $y^p\in H$. Since $o(y^p)=p^2/p=p$ and the only elements of order $p$ in $H$ are in $\langle x^p\rangle$, it follows that $y^p\in\langle x^p\rangle$. The claim is proven. IOW, $\{z^p:z\in G\}=\langle x^p\rangle$.

Furthermore, we also have that $\langle y^p\rangle=\langle x^p\rangle$, since a prime-order cyclic group can be generated by any single non-identity element. This implies that $y$ commutes with $x^p$ for all $y\in G\backslash H$.

$x^k$ for $k\in\mathbb Z$ commuting with $x^p$ is trivial; therefore $x^p\in Z(G)$. Consequently, $\langle x^p\rangle\subset Z(G)$. QED.

Alternatively, consider the following cases. Either $G\backslash H$ is solely composed of order-$p^2$ elements or has at least one order-$p$ element.

In the first case, the $p$th power of any element of $G\backslash H$ must have order $p$ and thus be in $H$, where we can use the same approach as above to complete the proof.

In the second case, note that the $p$th power of any element of $G\backslash H$ must be the identity element and therefore also in the centre of $G$. Fix any element $y\in G\backslash H$ and let $K:=\langle y\rangle$. Since $H$ and $K$ have trivial intersection and $|H|\cdot|K|=p^3=|G|$, $G$ is a semidirect product $H\rtimes_\phi K$. The automorphism group of $H$ is cyclic with order $p(p-1)$, where the order-$p$ elements are $\{(x\mapsto x^{1+ip}):i\in\{1,2,\cdots,p-1\}\}$. (The order of these elements may be verified by taking a binomial expansion.) Note that the identity automorphism may also be written as $(x\mapsto x^{1+0p})$.

Any homomorphism $\phi:K\to\mathrm{Aut}~H$ must map $y$ to either $\mathrm{id}_H$ or one of the order-$p$ automorphisms. Either way, let $\phi(y)=(x\mapsto x^{1+ip})$. We have that $yx^py^{-1}=(yxy^{-1})^p=(x^{1+ip})^p=x^{p+ip^2}=x^p$, so $y$ commutes with $x^p$. Since $G$ is generated by $x$ and $y$ and $x^p$ commutes with both generators, $x^p$ must be in the centre of $G$. Therefore $\{z^p:z\in G\}\subset Z(G)$.

I wrote this post over a month ago and I don’t remember why I didn’t just publish it, but here you go.

## Rectangles

Every few months I reread all of the Quantum Computing Since Democritus notes because dang, these make for a good read.

One of the questions posed within was to prove that there do not exist five points $P=\{A,B,C,D,E\}\subset\mathbb R^2$ such that, for any subset $P'\subseteq P$, there exists an axis-aligned rectangle $R=[x_0,x_1]\times[y_0,y_1]\subset\mathbb R^2$ such that $P\cap R=P'$. This took a few minutes of thought, but it’s obvious once you look at the case of four or fewer points and try to generalise. Crucially, there does exist a set of four points such that you can select any subset with an appropriately chosen axis-aligned rectangle, so there’s something about the fifth point that forces failure. Or actually, I just tried to type out a proof and immediately noticed a flaw in the argument. Lemme ponder on this for a few more minutes.

Definition (stolen from QCSD): A set of points $P\subseteq S$ is shattered by a concept class $C\subseteq2^S$ iff for any subset $P'\subseteq P$ there exists $f\in C$ such that $f\cap P=P'$. The VC dimension of $C$ is the size of the largest possible $P$ that is shattered by $C$.

Definition: Let $\mathrm{Rect}$ be the set of closed axis-aligned rectangles. More specifically, $\mathrm{Rect}:=\{[a,b]\times[c,d]:a,b,c,d\in\mathbb R\wedge a\le b\wedge c\le d\}$.

Definition: The bounding box of a bounded set of points in the Euclidean plane is the axis-aligned rectangle $[x_{\min},x_{\max}]\times[y_{\min},y_{\max}]$ where $x_{\min},x_{\max},y_{\min},y_{\max}$ are the extrema of the coordinates of the points.

Lemma: For any finite set of at least four points $P\subset\mathbb R^2$ such that there are two points sharing the minimal x-coordinate among all points, $P$ is not shattered by $\mathrm{Rect}$. (As an obvious corollary, by symmetry, this also applies to sets with points sharing maximal x-coord, or minimal y-coord, or maximal y-coord.)

Proof: Let $P$ be a set as described, and let $A,B$ be two points with minimal x-coordinate such that $y_A. If $y_A$ is not the minimal y-coordinate, let $C$ be a point with minimal y-coordinate; then $B$ is contained in the bounding box of $\{A,C\}$, and so $P$ is not shattered. Likewise, if $y_B$ is not the maximal y-coordinate, then $P$ is not shattered.

Suppose $y_A,y_B$ are the extrema of the y-coordinates of the points in $P$. Let $C$ be a point with maximal x-coordinate. Then the bounding box of $\{A,B,C\}$ contains all of $P$, but there is at least one point other than $A,B,C$, so $P$ is not shattered.

Proposition: For any set of five points $P\subset\mathbb R^2$, $P$ is not shattered by $\mathrm{Rect}$. In other words, the VC dimension of $\mathrm{Rect}$ is at most 4.

Proof: Suppose for contradiction that there does exist a set of five points that is shattered by $\mathrm{Rect}$. By the lemma, there must be exactly one point with minimal x-coordinate, one point with maximal x-coordinate, one point with minimal y-coordinate, and one point with maximal y-coordinate. (A priori we cannot assume that these four points are distinct.) This leaves at least one point that does not lie on the bounding box of $P$, but since it must be in the bounding box, there cannot be a rectangle containing all points but that, a contradiction. QED.

Corollary: The VC dimension of $\mathrm{Rect}$ is 4. Proof: take the four points defined by (±1,0) and (0,±1); these are shattered by $\mathrm{Rect}$.

My initial broken proof just skipped to the last step and assumed there must be one point left over, but I realised that it didn’t account for the case where multiple points shared an extremal coordinate. Anyway, what if we allow non-axis-aligned rectangles as well? The VC dimension increases to at least five, because now the set of five points arranged in a regular pentagon is shattered. A regular hexagon doesn’t work because you can’t isolate three vertices forming an equilateral triangle with any rectangle, but this doesn’t rule out the VC dimension being 6. (I just asked my sister about this, and she came up with the equilateral triangle counterexample way faster than I did.)

To be continued?

If we further extend to the set of all finite convex regions, then it’s easy to see that the VC dimension is exactly $2^{\aleph_0}$, since the unit circle is shattered by the set of convex regions. On the other hand, if we consider the set of all open finite convex regions, then the VC dimension is at least $\aleph_0$. We can still explicitly construct a shattered set, but it’s a bit more involved. (I’m not sure if it’s possible to construct any larger shattered set, and it’s possible we get into the hairiness that is the continuum hypothesis should we delve further.)

Consider the semicircular arc $(\cos\pi t,\sin\pi t)$ over $t\in(0,1]$, and select the countably many points corresponding to $1/t\in\mathbb N$. We argue that this set $P:=\{(\cos\pi/k,\sin\pi/k):k\in\mathbb Z_{\ge1}\}$ is shattered by the set of all open finite convex regions. For any subset of the positive integers $N\subseteq\mathbb Z_{\ge1}$, we may index into it with the positive integers starting from 1 in ascending order, so $N=\{n_1,n_2,n_3,\cdots\}\wedge n_1. For any point $A_{n_k}=(\cos\pi/n_k,\sin\pi/n_k)$, we define a shadow point $A_{n_k}':=A_{n_k}(1+n_{k+1}^{-4})$, where we take $n_{|N|+1}:=\max N+1$ if $N$ is finite. Taking the open polygon with vertices at all the shadow points and at the origin, it’s clear that it contains $P_N$ (the restriction of $P$ to the points indexed by $N$). It remains to show that it does not contain $P\backslash P_N$.

An equivalent way of defining the polygon is as taking the union of (closed) triangles $\bigcup_{1\le k<|N|}\Delta OA_{n_k}'A_{n_{k+1}}'$ and then excluding the border.

Let $i be positive integers. If $k, then $A_{n_k}\notin\Delta OA_{n_i}'A_{n_j}'$ is obvious. Otherwise, wave hands around for a while, mumble a bit about how this should be obvious, give up and hit the “Update” button, and call it a day. QED!

Well, the idea is just that we define shadow points just slightly extending out of the original semicircle, so that our resulting open polygon is still convex and does not contain unwanted points. The “hard” part is finding exactly how much to extend these points. I think I got at least the order correct, but I’m not sure about constant factors and whatnot. But whatever, that’s not what the main question posed in this post, which is about freely-rotatable rectangles.

I thought that there might be something to do with the region (in the rectangles case) being bitonic (i.e. every straight line crosses the boundary of the region at most twice), thereby forcing impossibility for some $n$, but obviously convex regions are also bitonic and we don’t have similar impossibility results for the set of convex regions (regardless of whether we need them to be open).

If we extend to the set of parallelograms, then our concept space would be invariant under affine transforms. Does this help? Is this an easier problem than the original?

If that’s not a wide enough extension, maybe consider the set of all convex quadrilaterals. Now this is closed under projection transforms. (Wait, convexity is preserved by projection, right? Or is it not?)

## A generalisation of the alpha max beta min approximation to higher dimensions

Note: not original research. I stole this from the xkcd forums because it looked interesting. Actually, you know what, go read that thread instead.

To generalise α-max-β-min to n dimensions, given an n-dimensional vector, we take the absolute value of each coordinate and sort them in descending order, then take the dot product of that with a fixed reference vector to approximate the length (Euclidean norm) of the original vector. This essentially dissects n-space into 2nn! congruent slices, depending on the sign of each coordinate as well as the ordering; we may choose the slice where the coordinates are all already nonnegative and in descending order. Further, we may normalise the length of the input vector, so this leaves us with only a simplex on a (n−1)-sphere. (Hooray, arbitrary-dimension spherical geometry!)

To minimise the maximum relative error of α-max-β-min, the error should be the same on all the n vertices of the aforementioned simplex. These vertices are given by (1,0,0,0,…), (1,1,0,0,…), (1,1,1,0,…), etc. up to normalisation. Since the error is a monotonous function in terms of the angular deviation between the input vector and the reference vector (which is always strictly less than π/2 thanks to the coordinates being nonnegative), this means that (the direction of) the reference vector must be equidistant from all the vertices of the simplex. In other words, this amounts to finding the centre of a circumhypersphere, and to do this we simply find a vector orthogonal to all of the simplex’s edges (in Euclidean n-space).

At first it sounds daunting that we have to consider n(n−1)/2 edges to be orthogonal against, but note that the edges are not all linearly independent. Indeed, obviously they can’t, since n(n−1)/2 is way larger than the dimension of the vector space, which is only n. We can drop this by one more because the dimension of the vector space spanned by the edges is only n−1. Now all we have to do is Gram-Schmidt, after which we obtain w = (√1−√0,√2−√1,√3−√2,√4−√3,…) as the direction of our reference vector.

The next step is to find how much to scale this to obtain our reference vector b = rw. To do this, we equate the absolute value of the errors in the w direction and in the direction of any of our simplex’s vertices. The signs of these errors are guaranteed to be opposite, so this leaves a linear equation in r, which gives r = 2/(1+|w|).

Now, the form of w is actually independent of the number of dimensions, so it seems like maybe b could converge as the dimensionality tends to infinity. And in fact it does converge pointwise… to the zero vector. The reason is that |w| diverges to positive infinity as the number of dimensions increases, so dividing by that is essentially multiplying by zero.

In other words, this approximation to α-max-β-min gets unboundedly bad as the number of dimensions increase. Furthermore, we should also note that it’s also a lot less useful as a speed optimisation once we reach higher dimensions, because sorting the coordinates requires linearithmic time, whereas just directly computing the length of the vector is linear-time.

We could recursively apply the original two-dimensional α-max-β-min to pairs of coordinates, and then pairs of these pairs and so on, but I think the error here would be even larger than the “one-pass” n-dimensional generalisation given above.

## Solving twisty puzzles with neural networks, part 3 of ???

Maybe thinking about how to progress after training a neural network to solve the “two-face domino” (see part 1 for a description) was thinking too far ahead.

It takes some ridiculous amount of CPU power to perform backpropagation (increases at least quadratically with the number of neurons in each layer), and multiply that by however many times we have to feed random cases, and then multiply that by however many random cases we’re teaching the neural network with. I said 4900 was a small number before, but if we have to go through each of all these 4900 states tens of thousands of times, this suddenly becomes very nontrivial.

Using more than one hidden layer seems to have the effect that the network converges to returning almost exactly the same values regardless of input, though I also noticed this on another training set I was playing with yesterday, where cranking up the number of training iterations to 30000-ish actually started converging to something useful. Doing that many iterations on this network seems like it’d take forever, though.

Basically, this miniproject might as well be dead unless I can figure out how to make it work. Maybe trying to teach the NN optimal solutions was the wrong thing to do, as I’d been suspecting for a while, and we should go with teaching while scrambling.

## Solving twisty puzzles with neural networks, part 2 of ???

I side-tracked a bit on watching anime and stuff, so uh, yeah, there hasn’t been much progress on this little project. This post is just a minor status update.

I sat down for a couple of hours, tried to work out the equations for backpropagation on my own after consulting a whole bunch of sources, and then wrote an implementation of a feedforward neural network in JS. I think it works. Maybe.

Some common choices of activation functions for NNs are the logistic function $x\mapsto 1/(1+e^{-x})$ and the hyperbolic tangent $x\mapsto\tanh x\equiv\frac{e^{2x}-1}{e^{2x}+1}$. These satisfy the differential equations $y'=y(1-y)$ and $y'=1-y^2$, respectively. Being able to calculate the derivative directly from the value simplifies the implementation of backpropagation slightly, but it’s probably not all that big of a difference for other activation functions in general. Anyway, currently, to switch between these two activation functions requires commenting out a few lines and uncommenting a few others, so I spent a while testing this out.

I’ve noticed that, unlike all the success stories we hear about how awesome neural networks are, my implementation doesn’t seem to correctly learn from the input data if there’re more than five or so nodes in the hidden layer. I have no idea if it’s caused by a bug or if this is expected of real NNs and I just need to train for more iterations or something. tanh seems to be slightly more “robust” than the logistic function, but they’re both problematic.

## Solving twisty puzzles with neural networks, part 1 of ???

Sort of a continuation of the previous post.

But first, a note about left cosets versus right cosets. Last year I wrote that the de facto standard for solving to subgroups of twisty puzzles was to use right cosets rather than left cosets, which I found a tad strange. In the process of adapting some of my rotate4 solver code for the content of this post, I noticed that my compose function was actually composing permutations in the opposite order. (Maximum irony in that my original Python solver had the comment that “the argument order is reversed”.) With that rectified, it became more natural to treat the moves as right actions rather than left actions. This also meant that using right cosets was the “obviously correct” thing to do.

Consider just the U and D faces, and restrict ourselves to the 〈U, D, F2, B2, L2, R2〉 subgroup. In other words, we have four each of white corner stickers, white edge stickers, yellow corner stickers and yellow edge stickers, along with the fixed white/yellow centre stickers. The moves act on these stickers by permuting them. The corner stickers can permute in any way, as can the edge stickers, so this has a total of $\binom84\binom84=4900$ states. This is small enough that we can compute a lookup table for every single state very quickly, so that is in fact what I did. Note also that the D move is actually redundant, since that is equivalent to a U move followed by a rotation of the whole puzzle.

The next step would be to use this info to train the neural network, which is where I’m getting stuck because I have yet to write a single line of code for the neural network part of this whole experiment.

As outlined in the previous post, the number of input nodes for the neural network would be the number of stickers times the number of colours, so in this case, we would have 18⋅2 = 36 input nodes. An input node for a sticker location/colour pair would be activated (i.e. set to 1) if there is indeed a sticker of that colour at that location, and otherwise deactivated (i.e. set to 0). We would have one output node for every “primitive” move, along with one more to signify that the puzzle is solved, so that’s |{“solved”, U, U’, D, D’, F2, B2, L2, R2}| = 9 output nodes. What we want is that, after training the neural network, only the “solved” node will be activated if the puzzle is solved, and otherwise the nodes corresponding to moves that bring the puzzle closer to a solved state will be activated.

In other words, we reduce solving the puzzle to a classification problem about what move to make. Then we just keep applying the move the neural network tells us to make, until it tells up to stop.

In this case, we have only 4900 distinct states, so it’s (probably) feasible to train the NN using every single state. This feels a bit like defeating the entire purpose of having a neural network, but it’s not like we can continue doing this as we move on to more complicated puzzles.

The training strategy will also be a problem, as we get to puzzles where we cannot quickly produce optimal solutions, such as… the Rubik’s cube itself. It already takes anywhere from a few seconds to a few minutes to optimally solve a random state, and the algorithm used by Kociemba’s Cube Explorer has the particularly pathological case where the cube is in the “opposite colours” subgroup but not the square subgroup. (For example, the N-perm PLL takes over ten minutes for optimality to be proven, compared to only two minutes for the V-perm PLL, even though both are 14 moves away from being solved.)

A reasonable method of generating training samples could be to start with the solved state, and then scramble it by applying random moves. Along the way, we record the states traversed and the moves applied, and then use the inverse of the moves to train the NN. This still requires a “magic parameter” in the number of random moves to apply, but hopefully we can hard-code that as 40 or something and be done with it. Or, if we really want to go with the “teach a computer how to cube” idea, we feed in only a small number of random moves every time (say, 10 moves) and hope the NN can learn to extrapolate to solving the whole cube.

Too many ideas, too little code. I’m currently using JS for this project since I already had a sizeable chunk of code written in JS, but it looks like feedforward neural networks involve some linear algebra stuff and I’d hate to have to implement that from scratch. I’ll have to seriously read up on backpropagation later today. Oh, and recurrent neural networks seem like a better fit to this problem (feed in a fixed input, and the RNN spews out a list of outputs, rather than generating the output only once), but are also more complicated to implement.

## Some cubing thoughts

Something I’ve been wondering for a while was if it was feasible to train a neural network to solve the cube. There are a few things to care about here. What does it even mean to train an NN to solve a cube? We need to feed the NN some kind of training data, which can either be sourced from lots of human solutions or by optimally solving lots of cubes. Both of these are pretty slow, so perhaps we should start with a smaller puzzle than a whole Rubik’s Cube. The input to the NN would be the sticker colours (represented as 54 six-bit bitfields), and the output would be one of the twelve possible quarter turns.

One reason we would want to use sticker colours as the input rather than pieces is that thinking of the puzzle in terms of “pieces” seems to be too much of a high-level insight, and it feels like cheating to hard-code that into the NN. After all, most human beginners don’t instantly think of the puzzle in terms of corner cubies and edge cubies that can’t mix.

Further, if the NN only outputs one move every time it’s accessed and it doesn’t store “history” of the previous moves, there’s a potential problem in it returning some move, then undoing that, then reapplying it, etc., getting stuck in an infinite loop. It would seem to be quite annoying to try to encode the structure of “human” algorithms if we’re not even allowed to have memory of what the cube looked like a few moves ago, because human algorithms tend to make liberal use of hard-coded move sequences we recall from memory, or some other kind of abstraction that allows us to think in terms of groups of moves rather than each move individually.

But first, I need to do more reading up on neural networks. My currently-shallow understanding probably won’t get me very far.

On another note, I was fooling around with the idea of using two-cycles to solve the cube, one piece at a time. This is a technique used in some blindfold solving methods (classic Pochmann, M2/R2), and its popularity stems from the fact that setups tend to be very short (compared to some three-cycle methods) and that it’s (supposedly) very easy to keep track of the piece cycles during a blindfold solve. (Personally, using three-cycles to solve two pieces at a time works slightly better for me, but that might be because I never bothered learning a proper two-cycle-based method.)

The classic Pochmann method uses some of the PLLs that effect a two-cycle in the corners and a two-cycle in the edges, specifically J/T/Y. The other corner/edge two-cycle PLLs are F, R, N, V, but these have somewhat worse execution speed and it’s more annoying to set the pieces up appropriately to use them. I suck at executing Y perms too, so let’s also leave that out.

It’s essentially possible to solve the entire cube using just the J/T perm algs to solve one piece at a time via conjugations, though this is incredibly inefficient on any aspect other than the short setups. If we’re willing to sacrifice the advantage of having very short setups, then we can set up both a corner and an edge at every step, so we solve one corner piece and one edge piece at the same time. This would reduce the number of times we have to execute J/T from 20-ish to only 12-ish, which sounds impressive except for the fact that executing the PLL algs can be over twice as fast as all the other moves in the solve since they don’t require active thought. There are also a total of 462 combinations of positions for an unsolved corner and an unsolved edge, so the setup moves have to be conjured on the fly unless you want to memorise the setups for all these combinations. More thinking means less speed.

So that’s not a very useful idea. Let’s stick to one piece at a time. Suppose our buffer pieces are URF and UR, the corner target is URB and the edge target can be UL, UB or UF. If we’re doing a non-blindfold solve, we can start by placing as many pieces correctly by “intuition” as possible so as to reduce the length of the steps, but this is totally optional.

Step 1: solve the corners, one piece at a time. We always use UF and UR as the edge buffer here (i.e. we always use J perms). If the original permutation was odd, we swap URF and UBR after solving all the corners. (If we’re doing a sighted solve, then we don’t need to mentally keep track of which edges we’re moving where, so we can use any two-cycle PLL alg we prefer.)

By convention, we’re taking Ja to swap URF/UBR and UB/UR, Jb to swap URF/UBR and UF/UR, and T to swap URF/UBR and UL/UR. Also, we’ll take [x:y] to mean the conjugate x y x’.

target moves
UFL [L2 B2 : Jb] or [U’ : Ja]
ULB [B’ : Jb]
UBR Jb
DLF [D’ B2 : Jb]
DBL [B2 : Jb]
DRB [D B2 : Jb]
DFR [D2 B2 : Jb]

etc.

Only listing the seven of 21 cases where the corners are already oriented. Note that for the UFL case, we can set the pieces up to use Ja instead of Jb, which saves on two moves. (Or we can rotate the whole cube to save on four moves, but lol cube rotations.)

Step 2: solve the edges. This time we have some freedom in choosing where we want to set up the target edge.

Let Tf be (R2 B’ R’ B R’ F’ U’ F R U R’ U’); this is like a T-perm, but it flips the UL and UR edges. I can actually execute this slightly faster than a T-perm, though it’s more tiring to execute as well. This isn’t a common alg, so alternatives are listed. Note that [M’:Jb] and [M:Ja] have move cancellations and are just wide-move variants of the J perm.

target moves
UF Jb
FU [l’ : Ja]
UL T
LU Tf or [L2 D M’ : Jb] or [L2 D’ M : Ja]
UB Ja
BU [l : Jb]
FL [L’ : T]
LF [L’ : Tf] or [d’ L : T]
BL [L : T]
LB [L : Tf] or [d L’ : T]
BR [d2 L’ : T]
RB [d L : T]
FR [d2 L : T]
RF [d’ L’ : T]
DF [l2 : Ja]
FD [M’ : Jb]
DL [L2 : T]
LD [L2 : Tf] or [D M’ : Jb] or [D’ M : Ja]
DB [l2 : Jb]
BD [M : Ja]
DR [S’ : T] or (d l2 B2 L U L’ B2 l B’ l y)
RD [S’ : Tf] or [D’ M’ : Jb] or [D M : Ja]

For the LD and RD cases it might be better to use the non-Tf alternatives, because S moves suck and we have move cancellation in [M’:Jb] and its mirror. If we consider [M’:Jb] as a single alg, then the setup takes less than one move on average!

## Nonexistence of a simple group of order 144

Question 7 of the Algebra I exam; I screwed up because I forgot about prime-index normal subgroups and whatnot.

First, some lemmata (from the notes).

Lemma 1: Let $G$ be a simple group, and let $H$ be a proper subgroup of $G$. Then $G$ is isomorphic to a subgroup of the symmetric group over the cosets of $H$, $S_{G/H}$. Furthermore, $|G|\le[G:H]!$.

Proof: Consider the homomorphism $\pi:G\to S_{G/H}$ induced by left multiplication on the cosets of $H$. (Specifically, $\pi:x\mapsto(yH\mapsto xyH)$.) The kernel must be a normal subgroup of $G$, and hence $\ker\pi\in\{\{e\},G\}$.

If $\ker\pi=\{e\}$, then $\pi$ is an injective homomorphism, giving $\pi(G)$ (the image of $G$ under $\pi$) as the desired subgroup of $S_{G/H}$. If $\ker\pi=G$, then $\forall x\in G\quad xH=H$, and so $\forall x\in G\quad x\in xH\subseteq H$, but this implies $H=G$, contradicting the assumption that $H.

(Note that if $[G:H]>2$, then the subgroup of the symmetric group must contain only even permutations and hence is a subgroup of the alternating group, but we don’t need this fact here. This can be proven by showing that a nontrivial proper normal subgroup of $G$ exists after assuming it contains some odd permutation. The proof fails for $[G:H]=2$ because in that case, the normal subgroup obtained is the trivial subgroup, and indeed the only simple group that can have an index-2 subgroup is $\mathbb Z_2$.)

Lemma 2: Let $G$ be a finite group with a subgroup $H$ of index 2. Then $\forall x\in G\backslash H\quad 2|o(x)$. As an obvious consequence, at least half of the elements of $G$ have even order.

Proof: Let $x\in G\backslash H$. $x$ acts on the cosets $G/H=\{H,xH\}$ by interchanging them, so $H=x^{o(x)}H=x^{o(x)\,\bmod\,2}H$. Therefore $o(x)\equiv0\pmod2$.

(Not actually from the notes, but a homework problem. Again, this proof readily generalises. Let $G$ be a finite group with a subgroup $H$ of index $p$, where $p$ is the smallest prime factor of $|G|$. Then $\forall x\in G\backslash H\quad p|o(x)$. There’s a subtlety that makes this statement fail to be true if $p$ is not the smallest prime factor—in that case, the group action does not necessarily lead to a cyclic group. For instance, consider the complement of any $\mathbb Z_2$ in $S_3$. The same argument essentially also leads to the standard theorem that $H\trianglelefteq G$ if the index is the smallest prime factor.)

Lemma 3: Let $G$ be a finite group and $p$ be a prime dividing the order of $G$. Suppose that $n_p\not\equiv1\pmod{p^2}$. Then there exist $P,Q\in\mathrm{Syl}_p(G)$ such that $[P:P\cap Q]=[Q:P\cap Q]=p$. Furthermore, $P\cap Q$ is a normal subgroup of both $P$ and $Q$.

Proof: Since the Sylow theorems imply $n_p\equiv1\pmod{p}$, we have that $n_p\ge1+p>2$.

Suppose for contradiction that there does not exist two distinct Sylow p-subgroups with intersection of index $p$ in either Sylow p-subgroup. This implies that the intersection of any two Sylow p-subgroup must have index at least $p^2$. Let $P\in\mathrm{Syl}_p(G)$. Consider the orbits under $P$ acting on the set of Sylow p-subgroups by conjugation.

$\mathcal O(P)=\{P\}$, since the conjugation of $P$ by any element of itself must remain as $P$. Let $Q$ be any other Sylow p-subgroup. $\mathcal O(Q)$ must have cardinality divisible by $[P:P\cap Q]$ and hence by $p^2$, since $P\cap Q\le\mathrm{stab}(Q)$ and $|\mathrm{stab}(Q)||\mathcal O(Q)|=|P|$. Since $n_p$ equals the sum of the cardinalities of the disjoint orbits, and that every orbit other than $\{P\}$ has cardinality divisible by $p^2$, we have that $n_p\equiv1\pmod{p^2}$, a contradiction.

Therefore the initial supposition cannot hold, and there exist $P,Q\in\mathrm{Syl}_p(G)$ such that $[P:P\cap Q]=[Q:P\cap Q]=p$. Since an index-p subgroup of a p-group is necessarily normal, it follows that $P\cap Q\trianglelefteq P$ and $P\cap Q\trianglelefteq Q$.

Proposition: Let $G$ be a group of order 144 = 24 32. Then $G$ is not simple.

Proof: Suppose for contradiction that $G$ is simple.

We have from the Sylow theorems that $n_3\in\{1,4,16\}$. $n_3=1$ can be ruled out since that would imply that the Sylow 3-subgroups are normal. Noting that the index of the normaliser of any Sylow 3-subgroup is $n_3$, $n_3=4$ cannot hold, because otherwise the first lemma would imply $144=|G|\le4!=24$. Therefore $n_3=16$.

Since $n_3\not\equiv1\pmod9$, let $P,Q\in\mathrm{Syl}_3(G)$ such that $[P:P\cap Q]=[Q:P\cap Q]=3$. This implies $|P\cap Q|=9/3=3$. Let $R:=P\cap Q$. From the inclusion-exclusion principle, we have that $|P\cup Q|=|P|+|Q|-|R|=9+9-3=15$. Consider the normaliser of $R$.

$R\trianglelefteq P\implies P\le N_G(R)$ and likewise, $Q\le N_G(R)$. Therefore $P\cup Q\subseteq N_G(R)$ and $|N_G(R)|\ge|P\cup Q|=15$.

$P\le N_G(R)\le G\implies9|N_G(R)\wedge N_G(R)|144$, by Lagrange’s theorem. It follows that $|N_G(R)|\in\{18,36,72,144\}$, which we will consider separately.

If $|N_G(R)|=18$, then noting that $[N_G(R):P]=18/9=2$, by the second lemma, at least nine elements have even order. However, we know that every element in $P$ or in $Q$ has odd order (since these groups have odd order), and so the fifteen elements of $P\cup Q\subset N_G(R)$ have odd order, leaving at most three elements to have even order, a contradiction.

If $|N_G(R)|\in\{36,72\}$, then $[G:N_G(R)]\le4$ and by the first lemma, $G$ is isomorphic to a subgroup of $S_4$. This has already been earlier shown to be impossible.

If $|N_G(R)|=144$, then $R$ is a nontrivial proper normal subgroup of $G$, contradicting the supposition that $G$ is simple.

Therefore the initial supposition that $G$ is simple cannot hold, and $G$ is necessarily nonsimple.

## FMCをやってみた

#99

Scramble: R’ L F B2 D2 U R’ D L2 F B’ L2 U R’ L F B2 U2 D2 R2

Annotated solve:
inverse:
D’ L F B’ D B2 (6; 2×2×2 block)
U’ F’ U2 F’ U’ F2 (6/12; 2×2×3 block)
(F) U F2 L2 F (4/16; EO)
L U2 L2 U’ (L’) (4/20; F2L)
(L) R U2 L’ R’ B’ U F’ U2 B U’ F U’ (12/32; G perm)

Solve:
U F’ U B’ U2 F U’ B R L U2 R’ U L2 U2 L’ F’ L2 F2 U’ F U F U2 F U B2 D’ B F’ L’ D (32f)

Notes:
I tried to substitute the G perm with an antisune to get a skeleton (lol, skeleton with 20+ moves), but none of the four possible eight-mover insertions had any move cancellations and they all gave longer solutions. Besides, that’s one nice OCLL skip.

#98

Scramble: R F D U R2 D’ L’ R U B’ L’ U2 L D B U F B’ L’ U2

Annotated solve:
R D’ L F2 D2 F (6; 2×2×2 block)
U’ B L2 U L (5/11; 2×2×3 block)
B’ L U L’ (4/15; extra 2×2 block)
inverse:
L U L’ U’ (4/19; EO)

inverse:
B2 U2 B’ U’ B U (B’) (6/25; F2L)
(B U) B’ U B U’ B’ U B U2 B’ U2 (10/35; double Sune)
B2 (1/36; fix pseudoblock)

Solve: R D’ L F2 D2 F U’ B L2 U L B’ L U L’ B2 U2 B’ U’ B U2 B’ U B U’ B’ U B U2 B’ U2 B2 U L U’ L’ (36f)

Notes:
First time NISS was actually useful for me. Without the second inversion, F2L and 2GLL would’ve taken the same number of moves total, except there wouldn’t be move cancellations to reduce the move count. The bad thing about having 2GLL is that it’s impossible to use a short alg to fix edges and then insert a corner 3-cycle, but on the other hand I know a handful of optimal 2GLL algs. Also, it’s easy to write the moves with my right hand while executing them with my left.

I’ll try doing some more of the past weekly scrambles for practice.

#104 (2015-12-16)

Scramble: L’ D2 F2 B’ U’ L’ D’ U’ R2 B2 F R’ L’ F’ B’ U2 D R2 L’ B

Annotated solve:
D’ F’ @ U’ R B (5; EO)
L U2 F2 L2 U (5/10; partial F2L)
D’ R D (3/13; keyhole edge insertion)
L’ R’ U R (U’) (4/17; F2L-1)
(U R) U’ R U R2 U’ R’ L2 (7/24; edges)
Insert R D’ ^ R’ U’ R D R’ U at @ (8−4/28; corner 3-cycle, four moves cancel)
Insert D L’ D’ R D L D’ R’ at ^ (8−3/33; corner 3-cycle, three moves cancel)

Solve: D’ F’ R L’ D’ R D L D’ R2 U’ R D B L U2 F2 L2 U D’ R D L’ R’ U R2 U’ R U R2 U’ R’ L2 (33f)

I actually spent multiple hours before finding this skeleton that had nice cancellations. Also, my first EO-first attempt, as well as my first attempt with so many moves cancelled. There were a few blocks and almost-blocks right from the start, but after spending a lot of time trying to figure out how to make use of them, I gave up and trashed all but one corner-edge pair. I was going to use the U-D axis as my solving direction (as usual), but I noticed that the left face was mostly done and went with that instead.

One earlier attempt at this scramble using Petrus had a very nice-ish F2L finish (only seven moves), but the last layer was 13 moves optimal and couldn’t be broken up into commutators. Screw that.

#105 (2015-12-22)

Scramble: U2 L2 B2 R2 U’ D2 R’ B’ R2 L B2 R’ D2 L’ R U’ L2 F2 R2 L’

Annotated solve (#1):
L’ D2 (2; 2×2×2 block)
R U’ F U’ (R) (4/6; 2×2×3 block)
switch: R’ (1/7; premove)
switch:
(R’) F’ R F (3/10; slot)
U’ F U’ F’ L’ U’ (L) (6/16; EO and F2L)
(L’ U’) L U’ L’ U2 L / (L) U L’ U R’ U L U’ R U2 L’ U (5+11/32; two-alg ZBLL)

Annotated solve (#2):
B L’ D2 R U’ F U’ (7; 2×2×2 block and extra 2×2 block)
R F2 U’ R (4/11; EO and matching up the extra block)
F U B U’ B’ U2 B U2 B’ (9/20; slot)
U’ F’ U (3/23; F2L)
R2 U’ F B’ R2 F’ B U’ R2 (9/32; edge 3-cycle)

Annotated solve (#3):
@ B2 L’ D2 (3; 2×2×2 block)
R F2 U’ F’ (4/7; EO)
R2 U’ (2/9; 2×2×3 block + switch EO axis)
F R’ F R2 F2 (5/14; slot)
(F) R’ F’ (2/16; skeleton)
Insert D L’ U’ L D’ # L’ U L at @ (8/24; corner 3-cycle)
Insert D’ R D L’ D’ R’ D L at # (8−3/29; corner 3-cycle, three moves cancel)

I used Insertion Finder for the third attempt, so I guess that one doesn’t really count. 16-move skeleton is neat, though it’s mostly pure luck. I mean, the scramble did start with a 2×2 block ready to be exploited, and just like the #104 attempt above, I switched solving direction halfway through.

## Programming is hard(?)

Like I mentioned before, I’m taking CS1010S Programming Methodology this semester, which is an introductory course on programming taught in Python. It’s part of the program requirements (pun, etc.) for applied maths and statistics majors, but there’re also plenty of random people like myself around who are just taking it for fun.

Side note: I only recently found out that the “evil counterpart” CS1101S Programming Methodology, which has a functional programming focus, is taught in… JavaScript. And yes, it’s officially nicknamed CS1010S’s evil counterpart. Probably understandable, because how the heck would you teach functional programming in JavaScript?

I have about ten years of programming experience if you want to be really generous with what the words “programming” and “experience” mean, but that’s still a gigantic headstart compared to my classmates, most of whom seemingly have had zero programming knowledge.

The course materials for the module are not hosted on IVLE, but on this platform called Coursemology, which does this very buzzword-y thing of trying to gamify the learning process to better engage the students. (Not to be confused with trying to game the learning process, which would have the opposite effect.) I’ve mentioned about the leaderboard feature before, and of course I’m still near the top, though I probably won’t be able to claim first or second place by the end of the semester because I’ve lost too many points to careless mistakes. There’s still a contest for writing a 2048 AI open which I believe I have a high chance of winning (and getting extra points), but it’s not like I can tell what my classmates have implemented. We’ll see. (Edit: the results were already announced during the one lecture I overslept through, and apparently my code was so fantastic I got a bonus in addition to the first-place prize. The lecturer also flashed my smugshot on the projector, which was, uh, extra lols.)

The site also has forums, which were quite active until the post rate mysteriously plummeted about two weeks ago. There, we can start threads to clarify our doubts and to seek help in resolving problems, and, oh dear, the questions they ask. I don’t mean that the questions themselves are bad; it’s just that I hang out mostly with programmers on IRC and I’m a somewhat seasoned programmer myself, so it’s a bit of a culture shock to notice that there actually are a lot of people who aren’t programmers!

It seems that I’ve forgotten that I spent a few years completely struggling with code before I reached my current standard, and in comparison, the CS1010S module is taught over less than four months. No wonder so many people have difficulty keeping up with the pace. Taking this module also opened my eyes as to how hard teaching programming can be. Python is very simple to understand if you don’t touch anything remotely hairy. Like, say, if you restrict yourself to playing with integers. Except you can’t divide, because then you’d get into the territory of floating-point arithmetic, which is a 9 out of 10 on the Hairiness Scale. (It’s not hard to understand per se, but most people don’t even try and spread misconceptions about float arithmetic all the time.)

The very concept of a variable isn’t exactly easy to convey either. We could go with a bottom-up approach in teaching how variables work, e.g. “in Python a variable is really just a pointer to some data”, except Python doesn’t have pointers so now you’re referring to a concept that’s external to the language. It’s a notion that should be immediately familiar to anyone who’s used any mainstream programming language, but the problem is that a lot of people haven’t.

Functions are also seemingly hard for many students to grok. Functions were taught in calculus classes, right? Right?! Well, sort of. In introductory calculus classes, there often isn’t much of a strict distinction between the notion of variables being related by an equation and that of a variable being a function of other variables. They’re sort of interchangeable because we can divine the exact meanings through context, and that’s why you see the derivative of a function $f$ being written as $df/dx$ instead of $df(x)/dx$ even though the former is technically an abuse of notation. It might be because we’re so used to this abuse of notation that, when it’s taken away from us, our feeble minds start breaking down.

It doesn’t help that languages like Visual Basic and MATLAB try to support it by having f = result_of_doing_some_operations_on_the_input syntax, so the students who have programming experience but in the wrong languages might have even more difficulty understanding why mainstream high-level programming languages don’t work in the same way.

The brain damage caused by conventional high-school-level mathematics education to our notions of what functions are has the side effect that understanding “higher-order functions” is a lot harder than it should be. If you think of a function as just a thing that maps things to other things (as in conventional college-level mathematics education), higher-order functions are just the same as ordinary functions! And here, it doesn’t help that “higher-order function” is the disjunction of two related but distinct concepts: a function that takes in a function as input, and a function that returns a function as output. Of course students are confused.

A lot of this post is extrapolated from what I’m seeing people ask about in the forums and during the tutorial sessions, rather than any personal difficulty I’m experiencing with programming or this module, so there’s a reasonable chance that my interpretations are off the mark. I think there’s some interesting research to be done with the course data (especially with the Coursemology platform, which allows tutors to view every single student’s submissions), but I mainly just want to sate my own curiosity.