It’s known that applying a butterfly to the input of the following form splits a DCT-II of $2n$ inputs into a DCT-II of $n$ inputs and a DCT-IV of $n$ inputs

With the caveat that, if the butterfly is done inplace (like with a lifting scheme), the DCT-IV is on reversed input; ergo, the final permutation is probably not a bit-reversal permutation as stated in the last post. The overall construction ended up being right because I directly solved for three shears that would produce a correct four-input DCT-II, not for a correct two-input DCT-IV (which wouldn’t be correct anyway since it’s the wrong transform).

For that matter, a DCT-IV on reversed input is a DST-IV with every other output term negated.

I noticed the error when I was trying to work out time/frequency resolution switching (as in Opus and the Daala demos) on paper.

## Do you even lift?

Yesterday I was playing a bit with the DCT-II on four inputs, which turns out to factorise as a lifting transform quite readily, and along the way I noticed some stuff.

Focusing on 2×2 matrices with determinant 1, all such matrices can be decomposed into three shears with the sole exception of the diagonal matrices (which, excluding the identity matrix, requires four shears). Form the following equation:

$\displaystyle\begin{pmatrix}a&b\\c&d\end{pmatrix}=\begin{pmatrix}1&\\\alpha&1\end{pmatrix}\begin{pmatrix}1&\beta\\&1\end{pmatrix}\begin{pmatrix}1&\\\gamma&1\end{pmatrix}=\begin{pmatrix}1+\beta\gamma&\beta\\\alpha+\gamma+\alpha\beta\gamma&1+\alpha\beta\end{pmatrix}$

Then solve for $\alpha$, $\beta$ and $\gamma$. This handles the case where $b\ne0$; for the case where $b=0$ but $c\ne0$, use the following instead.

$\displaystyle\begin{pmatrix}a&b\\c&d\end{pmatrix}=\begin{pmatrix}1&\alpha\\&1\end{pmatrix}\begin{pmatrix}1&\\\beta&1\end{pmatrix}\begin{pmatrix}1&\gamma\\&1\end{pmatrix}=\begin{pmatrix}1+\alpha\beta&\alpha+\gamma+\alpha\beta\gamma\\\beta&1+\beta\gamma\end{pmatrix}$

This doesn’t work if $b=c=0$, but then we can use something like two scaled rotations (each implemented with two shears).

The DCT-II on four inputs is special orthogonal, which theoretically means that it can be implemented with shears and no reflections, but common lifting implementations apparently use reflections as well because that makes things easier. (Reflections are also practically free since a reflection is just a sign flip.)

Define $c_1=\frac1{\sqrt2}\cos\frac\pi8$ and $c_3=\frac1{\sqrt2}\cos\frac{3\pi}8$. The orthonormal DCT-II matrix is then:

$\displaystyle\begin{pmatrix}\frac12&\frac12&\frac12&\frac12\\c_1&c_3&-c_3&-c_1\\\frac12&-\frac12&-\frac12&\frac12\\c_3&-c_1&c_1&-c_3\end{pmatrix}$

It’s known that applying a butterfly to the input of the following form splits a DCT-II of $2n$ inputs into a DCT-II of $n$ inputs and a DCT-IV of $n$ inputs (ignoring scaling factors):

for i in range(n):
j = (2*n-1) - i
x[i], x[j] = x[i]+x[j], x[i]-x[j]

One way to think of this is to split the input into symmetric and antisymmetric components, which then satisfy the boundary conditions of the DCT-II (symmetric on both endpoints) and DCT-IV (symmetric on left endpoint and antisymmetric on right endpoint) on half of the original input length. This can be done recursively on the symmetric half, but it’s not obvious how the DCT-IV can be further factorised in general. In our case we only have a DCT-IV of length 2, which works out to just be a rotation.

Directly applying the butterfly in this manner leads to increasing the norm of the input though, since $\begin{vmatrix}1&1\\1&-1\end{vmatrix}=-2$. It is possible to apply the butterfly with orthonormal scaling, but this turns out to be wasteful because we can absorb the scaling factors into later steps. (Kinda like what was done in the Daala demo I linked before.) An unevenly scaled butterfly can be implemented with zero non-power-of-two multiplies like the following:

for i in range(n):
j = (2*n-1) - i
x[j] = x[i] - x[j]
x[i] -= x[j] / 2

On the other hand, an evenly scaled butterfly would be something like the following, which uses three non-power-of-two multiplies.

A = sqrt(2)-1
B = sqrt(1/2)
for i in range(n):
j = (2*n-1) - i
x[i] += A * y
x[j] = B * x - y
x[i] -= A * y

There’s a further complication because if you use the above butterfly as is the half-sized DCT-II and DCT-IV will have different scaling factors, so you’ll need loads of shears to correct this later on. One way out is to use different scaled two-input Hadamard transforms. Long story short, the first butterfly stage is just:

x[3] = x[0] - x[3]
x[0] -= x[3] / 2
x[1] += x[2]
x[2] = x[1] / 2 - x[2]

Now we need versions of the DCT-II on two inputs (i.e. yet another Hadamard transform) and DCT-IV that accept scaled inputs and produce normalised outputs. The former is just the reverse of a Hadamard transform accepting normalised inputs and producing scaled outputs, i.e. the reverse of what we’ve been doing for the butterflies. The latter can be done with three shears as introduced in the start of the post. Then after this is done we need to rearrange the outputs. Again, long story short:

c_1, c_3 = cos(pi/8)/sqrt(2), cos(3*pi/8)/sqrt(2)
Alpha, Beta, Gamma = (1-c_3)/c_1, c_1, (1-2*c_3)/c_1
x[3] = x[0] - x[3]
x[0] -= x[3] / 2
x[1] += x[2]
x[2] = x[1] / 2 - x[2]
x[0] += x[1] / 2
x[1] = x[0] - x[1]
x[3] -= Gamma * x[2]
x[2] += Beta * x[3]
x[3] -= Alpha * x[2]
tmp = x[1]
x[1] = x[2]
x[2] = tmp

Not unexpectedly, this gives a decent speed boost over the “brute-force” implementation of a DCT-II. The difference is not so large in Python (about 17%), but there’s plenty of overhead involved in Python already so that’s not exactly a fair comparison. In C, however, the lifting version turns out to be slower (by about 30%!) for both single-precision and double-precision floats, though of course all four C versions are still two orders of magnitude faster than either Python version.

That said, the lifting version does have one useful property, in that error appears to accumulate slower. To test, I applied the orthonormal DCT-II a million times over on the input vector (2,3,4,14), which has norm 15. The single-precision brute-force version ended up with a norm of 15.2877, while the single-precision lifting version ended up with a norm of 14.9962. (Both double-precision versions still had a norm of 15.0000000, to seven decimal places.) Pushing the test to a billion iterations causes the brute-force version to just implode on itself (norm over two billion), though the lifting implementation doesn’t fare very well either (norm ~261); the double-precision lifting implementation does have (slightly) lower error than the brute-force one though (norm 15.0000003 versus 15.0000000).

Of course, this is unrealistic. I can’t think of any scenario requiring the application of DCT-II more than just once, and here we’re running it a billion times over. For all practical purposes, either double-precision version may be considered to be almost exact. One thing to ponder over here is why a lifting implementation would be slower. Maybe using integer arithmetic (rather than floating point) would switch the ranking? DCTs are also usually applied across multiple blocks (rather than on one input vector), so maybe SIMD would change things too.

PS: the constants for the C version were precalculated with bc to 25 digits of precision, which is a couple more than necessary for double precision and a lot more than necessary for single precision, but just in case!

## I was wrong (again)

Normalised orthogonal transforms (with some exceptions) would be penalised by having to force points in a lattice to another, rotated at an angle such that (usually) at most one point coincides between both lattices. It follows that unless you have a clever trick to create a bijection between the two lattices, a reversible integer-to-integer variant is impossible.

This is somewhat incorrect, and I really should have known.

It’s known that special orthogonal matrices can be factorised into a (finite) product of 2D rotations, where we define a 2D rotation as an orthogonal matrix of the same size which differs from the identity matrix in four entries, lying in two rows and two columns.

On the other hand, it’s even more well known that any 2D rotation can be factorised into three shears. [1] Other than using three shears, a rotation can also be implemented with two shears and one scale (as in [2], but that uses two scales for some reason).

A lifting scheme is exactly identical to a composition of shears and (optionally) reflections. We’re not allowed to directly scale our basis vectors within a lifting scheme, but as above, using a couple of shears will allow us to scale our axes arbitrarily as long as the product of the scaling factors is 1.

Note that depending on exactly what’s considered a lifting scheme, direct scaling may in fact be allowed (which in turn means that all linear transforms would have such a lifting scheme), but this affects reversibility. With a shear/reflection-only lifting scheme, reversibility is guaranteed if addition and negation never lose precision. (True if you’re working with integer arithmetic (with caveats regarding overflow), not so true if you’re working with floating point.) On the other hand, if you add in a scaling operator, you now need to require that multiplication and division by all the relevant scaling constants not lose precision, which is impossible with integer arithmetic.

Arbitrary scaling of two axes (where the product is constrained to be 1) can be done with four shears; for example, if we have two inputs $x$ and $y$, the following shears would halve $x$ and double $y$.

$\displaystyle\begin{pmatrix}1&-{1\over4}\\&1\end{pmatrix}\begin{pmatrix}1&\\2&1\end{pmatrix}\begin{pmatrix}1&1\over2\\&1\end{pmatrix}\begin{pmatrix}1&\\-1&1\end{pmatrix}\begin{pmatrix}x\\y\end{pmatrix}=\begin{pmatrix}x/2\\2y\end{pmatrix}$

One can imagine that the series of additions and subtractions serves to “transfer” the precision from $x$ to $y$ if implemented in integer arithmetic. In this particular case, you could think of it as grouping the points of a lattice into pairs, then rotating each pair ninety degrees. I suppose a similar visualisation could also be done for noninteger (or even irrational) scaling factors, but that seems less obvious.

Many mistakes could’ve been avoided if only I read more carefully. :v

## ⏳

I just got reminded of how I used to spend my time in the school library reading issues of American Scientist. There weren’t a lot of issues stocked up and it didn’t take very long to finish reading all the interesting articles, most of which unfortunately I’ve already forgotten. Curses, memory.

If you take any two unit vectors, their sum and difference are orthogonal (if at least one of them is nonzero). This extends to all pairs of vectors of the same length. This is so obvious it doesn’t seem like it’d need a proof, but it’s not hard anyway. Let the vectors be $a,b$. Then $(a+b)\cdot(a-b)=a\cdot a+b\cdot a-a\cdot b-b\cdot b$, but since $b\cdot a=a\cdot b$ (due to commutativity) and $a\cdot a=b\cdot b$ (by the stipulation that they have the same length), the dot product of the sum and difference is zero and therefore they’re orthogonal.

I recall seeing this fact exploited in a not-too-technical overview of Opus. Iirc the left/right channels are normalised to unit norm (the normalisation factors are separately transmitted) and then mid/side values are calculated in the usual way (sum/difference respectively), then the arctangent of |S|/|M| (which determines how directional the audio is) is transmitted along with M and S normalised to unit norm. I don’t really remember the point of doing this, but the guaranteed orthogonality of M and S is probably used somehow.

## Asdf

I got myself a one-year VPN subscription for like an eighth of a bitcoin and somehow it seems that wordpress.com has no more loading issues. I don’t really know what’s up with that but if it works, it works! (I might or might not have to retract this statement in the future.) My connection wasn’t fast to begin with so the extra lag isn’t a big deal.

Addendum: I tried logging into Google and they gave me an extra check because I was signing in from an unusual location. Which is pretty cool imo, even if it’s not a very strong security measure.

## Placeholder

I’m feeling kinda tired so I’ll just wrap this up quickly.

I’m writing a post on another site about deconvolution, and somehow using (an approximation to) the error function as a window function is optimal for a specific case. So, I think of searching things like “erf window function” (without the quotes) and such, but Google tries way too goddamn hard in thinking I meant “Windows” (as in the OS) and brings up a bunch of function references on Microsoft’s site and such.

It’s even worse when I try to search “error function window” (also without the quotes) because now I get shit about JavaScript where window is the most common global object, function is a keyword, and errors are everywhere because it’s a terrible language everyone uses.

In the past you used to be able to specify that “I really mean this word so stop autocorrecting it” by prefixing words or phrases with a plus sign, but this no longer works for some reason. In fact, it hasn’t been working in a long time.

## What’s a sop?

So, it’s been a bit over a month since I stopped being a slave to the country.

Have I ever mentioned how there’s like no SOP for anything on the admin side, ever?

I don’t know how the operations side function and thank goodness I’m never going to have to know, because it’s very definitively in the category of things I very much do not care about. But I’ve been on the manpower section for twenty-three months, and if anyone is qualified to criticise that, that person would be me. (Versus, say, the people who’ve been in it for decades yet don’t actively try to make their jobs less mind-straining.)

Instructions on how to do things are usually word-of-mouth; written instructions aren’t quite the norm. I planned on writing my own guide for all the things I was doing, but as time passed, the software we use updated and my partial guide got outdated after mere months.

I’ve described my primary job scope on this blog before; for about fifteen months I was the only guy who knew what was going on with anything and everything regarding leave. Part of this job was data entry and document filing, which was for the most part nothing difficult. Well, sometimes people submit forms with illegible handwriting. (Isn’t it frightening how the people protecting our country can’t write?) Sometimes I guess what was meant, but when I simply can’t make out what they wrote, the document gets fed to the shredder. Sorry, I’m not your primary school teacher who should’ve given you penmanship classes.

The hard part is doing all the calculations. Well, not hard for me, except where the Leave Policy is obviously not describing the calculations people actually use. (Like I said, word of mouth.) The entitlement is prorated by duration and while this proration is not exactly specified, but you could sorta kinda figure out the general rule from the examples provided in annex… E, I think. (The examples aren’t non-normative?!) Of course, the examples are wrong sometimes (there’s one wrongly rounded value in the aforementioned annex), and sometimes they contain information that’s entirely irrelevant in the current context, and sometimes they have red herrings. Red herrings? In examples? Is this supposed to be a reading comprehension test or a policy document?

The leave system is mostly computerised so people didn’t have to go through me to apply for leave (which presumably wasn’t the case prior to 2006 or thereabouts), but if they were intending to do anything outside the norm or had any issues, I was the person to look for. And boy did issues crop up often. Very. I’ve probably mentioned that I had two predecessors; the people in camp could be grouped into two general classes (very rarely did a person cross from one group to the other), and my two predecessors handled one class each. One of those classes (let’s call it class A) had on the whole very few issues, so for those people leave administration was as easy as plain data entry and filing. The other class (class B) was fraught with leave issues. On the other hand, these issues were mostly easy; when class A people had issues, their issues were the worst and most annoying to handle.

These issues were typically bugs with the computerised system. Or logic gaps, as their service desk personnel call it. Like I mentioned, the Leave Policy is inexact in many places, but computers work on logic so they definitely have had to fudge the text of the Policy in some way to get explicitly specified rules. They’re just the backend guys though, how would they learn of all the word-of-mouth rules that every leave administrator knows about? That was a trick question; they didn’t, and for the longest period of time there was blatant disagreement with “our” rules and their Policy-based rules. Now, there were leave clerks from other camps which followed their Policy interpretation, so whenever the time came to check leave records from other camps, I basically had to recalculate and check every single line for the class B people. (The class A people usually have no issues, because their records typically extend multiple years before the current computerised system was implemented.)

I’m sure you can tell how much I loved this job which is all about duplicating work.

There were also a couple of occasions where I argued with those backend guys. Those people are sore losers, really. There was also that one time we argued about the aforementioned red herring in the Policy, which was actually something they brought up. (I actually can read; why would I be fooled by a red herring? Then again, I also had the expectation of examples being non-normative so I never paid much attention to those examples anyway.) This happened a couple of days before my last and after a couple of emails both ways, the person on the other end gave me a call and I decided to just pretend to be wrong. I had loads of backlog to clear, you see (the result of procrastinating, d’oh!); spending time arguing on the phone doesn’t help.

Anyway, there’s this thing I call the fourteen-day rule in the Policy, which surprisingly is properly specified, but the word-of-mouth variant of it was significantly different. (And apart from the Policy, there was an FAQ published on the intranet circa 2005 (!) which actually seemed to support a literal interpretation of the Policy version.) This rule came into effect very rarely, so people just kinda didn’t care about it. Until there was a case where someone (from class B) complained after he was informed that he had taken a day of leave more than he should have and that he needed to refund the pay for that one day. (Incidentally, apart from the Chinese names, his name was identical to one of my classmates’.)

This generated loads of drama and was sparked off when one of the backend people decided to regenerate all the internal leave records for that guy. Their program reported that he had taken a day extra, but my manual calculation (using the Policy’s fourteen-day rule, and my interpretation of annex E) disagreed with that. This person’s leave records was actually meant to be handled by my predecessor (predecessor B, let’s call him, for he was the one who dealt with class B people), but alas, he had already left for greener pastures. I was already at that point aware of the bug in the system that caused the miscalculation, so I made some leave records of my own (using a format that was designed by the leave clerk three or four generations prior) and predecessor A looked through it. It was then that he figured out the word-of-mouth fourteen-day rule applied, which we had to explain to our three-levels-up superior. (She was the one overseeing the case on our end because the drama blew up to a member of parliament; I usually soloed the smaller and less “important” cases.)

An email was sent, and then there was loads of discussion (in which neither me nor predecessor A was included for reasons I’m still unaware) over interpretations of the Policy. In the end, the word-of-mouth variant got implemented in the system. Ever since then, there’d been a couple of bugs with the implementation that prevented people from applying leave directly through the computerised system, so paper forms it was. Hooray, even more duplicated work. Around mid-2013 I got fed up of this and sent the service desk a bunch of screenshots and a report of some people the bug was affecting. He asked me to elaborate, and I did so in a reply ELI5-style. (Me, condescending? That goes without saying.) Then one of the service desk guys called me up and was like “that was very rude” and blogged about how I should’ve written more politely. Over the phone. By the way, have I mentioned how much I hate talking on phones?

That was one of those phone-call-making-me-mad incidents; I think I’ve written about every other one on the blog already.

Months passed with no reply, when I shot them a reminder asking them to fix it. And then this guy got his own interpretation of the fourteen-day rule—i.e. what was implemented in the system at that point—which disagreed with the common word-of-mouth version, which he argued for in a couple of emails. More months passed without a definitive response on whether it got fixed, when I had to process the affected guys’ leave records again, and found that it finally got fixed. Sore losers, like I said; couldn’t even bother to let me know. (The fourteen-day rule is rarely in effect, as earlier mentioned, and even more rarely so in other camps; I was in fact the first to bring this up to them.)

All that tl;dr above was just me blogging and didn’t really pertain to the topic, sorry. I just felt like writing about that. Anyway, about SOPs.

The Leave Policy actually does provide an SOP for handling, well, leave. Except for dealing with the computerised leave system, because it was written before the computerised leave systems existed. Said SOP was also incredibly tedious and in practice I never bothered following it to the letter. All changes to a person’s leave records must be approved by his commanding officer, but making a one-minute job take any longer just because I need someone’s rubberstamping is way too inefficient. (People also trusted me and rubberstamped my stuff anyway. That trust has never been misplaced, and that is something I’m proud to admit.)

Except! There’re more SOPs, and they just happen to not be included (or even referenced at all) by the Leave Policy. I didn’t know about them until fairly late, and right until the end I still never bothered with those because it’s plain bureaucratic inefficiency.

On the other hand, being a leave administrator wasn’t the only role I had. These other roles were basically constructed entirely out of word-of-mouth rules. Rewind our timeline a bit to around October last year, when I was supposed to be pushing all my work onto the new recruit. He accepted that he was going to take over leave administration, but he also steadfastly refused to take over all the other things I was doing. The end result was that as I left, he did end up taking over most of my roles, which I never bothered briefing him on because he wouldn’t listen.

This wouldn’t be an issue if only there were official SOPs. There weren’t.

I did mention that I went back the day after my last, but I forgot about all these other roles on that day and didn’t brief him on those. (We were already busy with leave until about 7 pm that day and neither of us wanted to stay back any longer.) Just last week he texted me with questions about one of those roles, and all I had to offer was a sympathetic “glhf”. (I’m not insane enough to explain through 160-character messages; luckily the guy who delegated that role to me also left behind a comprehensive guide.)

The crux of the issue is that even though we’re asked to do things, there isn’t always an official document designating the responsibility of doing said things nor an official method of doing them. I get by fine with the latter—it makes more sense to specify requirements, rather than methods—but when even the exact requirements aren’t published, problems just keep appearing. On the other hand, when there are SOPs, they sometimes epitomise the phrase “bureaucratic inefficiency”, which is not a good thing either! (If people follow it, things get done slowly. If people don’t, then what’s the point of having an SOP? Lose-lose.)

This is the part where I point out I have nothing more to say and like two thirds of this post wasn’t actually related to SOPs. I suck at writing.

## Help

Okay I’m in a bit of a bind because there’s so much literature to browse through and so little I’m actually understanding/absorbing. My very qualitative understanding of signal processing theory in general helps to get an intuition on things, but isn’t so useful in understanding things that evade intuition. (At least, until those things eventually become intuition, but the problem is that I’m not getting there yet.)

My desk had been mostly free of paper until yesterday when I started scribbling stuff furiously; it led me to the realisation that my 2 TB external hard disk is taking up precious desk estate, which made doing things (other than using the computer) on the desk quite much less convenient. (Now there’re two sheets of A4 paper on it, one almost entirely packed on both sides and one with barely one sixth of the page filled on one side.)

Suppose we have a DWT implemented with a two-step lifting scheme, a la the discrete Haar wavelet transform. Obviously we don’t care about the continuous Haar wavelet transform here so assume the discrete version unless otherwise specified.

Let the number of input samples be $N$, and let $n=\lfloor N/2\rfloor$, $n'=\lceil N/2\rceil$. A two-step lifting scheme would be defined by two matrices: an $n\times n'$ predictor matrix $P$ and an $n'\times n$ update matrix $U$. (Recall that the size of a matrix is conventionally specified as height-by-width.) By splitting the input into the even and odd samples (which we treat as column vectors $E$ and $O$ respectively), we compute the low-pass and high-pass outputs ($S$ and $D$ respectively) as:

$D=O-PE$ (predict odd from even and store the difference)
$S=E+UD$ (update even from odd)

Due to the lifting structure, $S$ and $D$ may be stored in the memory locations for $E$ and $O$ (respectively) to make this an inplace algorithm. One property of the Haar wavelet is that if you ignore boundary conditions, the mean of the low-pass output is equal to the mean of the input; this is termed as the update preserving the zeroth moment. Of course we can also talk about preserving higher moments, but let’s not get ahead of ourselves here.

The update preserving the mean places some constraints on $P$ and $U$. Each of the $n'$ columns of $P$ must sum to $n/n'$, and each of the $n$ columns of $U$ must sum to $n'/N$.

As a further restriction, we could have the predictor exactly predict constant inputs; this forces each of the $n$ rows of $P$ to sum to 1. This and the above place a total of $n'+n-1=N-1$ restrictions on the coefficients of $P$.

$N=2$ is not very interesting: $P$ and $U$ both have one restriction on a one-by-one matrix, so $P=(1)$ and $U=(1/2)$, which gives the following matrix as the transform:
$\displaystyle\begin{pmatrix}\frac12&\frac12\\-1&1\end{pmatrix}$

The two rows are orthogonal. You can normalise them to make this an orthonormal transform. There isn’t really much else to say.

$N=3$ is slightly more interesting. $P$ is still fully determined, from which we can determine $P=(1/2~1/2)$, but we have one degree of freedom with $U$. Symmetry is nice so let’s just take $U=(1/3~1/3)^{\mathrm T}$. This makes our transform matrix:
$\displaystyle\frac16\begin{pmatrix}5&2&-1\\-3&6&-3\\-1&2&5\end{pmatrix}$

This isn’t orthogonal but who cares about orthogonality anyway. We can then apply the two-input transform as derived above on the first and third rows to get our very own three-input wavelet transform, which is the following.
$\displaystyle\frac16\begin{pmatrix}2&2&2\\-3&6&-3\\-6&0&6\end{pmatrix}$

Interestingly, this three-input wavelet transform does have orthogonal rows, and if you normalise then swap the second and third rows, you get the three-input DCT-II.

Earlier yesterday afternoon I was trying to probe the properties of a four-input lifting scheme, but didn’t get very far.

I’ll finish off this post later. (By that I mean write a new post, not updating this one.)

## Sparsity

So, I mentioned about orthogonal transforms in the last post. I may or may not be fudging the distinction between orthogonality and orthonormality, but bear with me.

(Unrelatedly, is there a style guide which has recommendations on whether the leading “the” (and other articles in general) should be included in hyperlinks? Very unrelatedly, fun fact: Blogger strips articles out of permalinks.)

We need some measure of sparsity; we could be literal with the definition of “sparse” and just count the number of nonzero entries (i.e. L0 norm-except-not-quite) but this isn’t very useful because the presence of even a little bit of noise would throw this off, and even in the complete absence of noise most inputs would not fall exactly into a subspace spanned by a proper subset of the basis vectors. A sparsity measure that gives identical outputs for almost every input does not help.

If we use L2, we end up with the problem that the sum of squares of the transform is equal to the sum of squares of the input (because we specified that the transform was orthogonal), so the transform drops out of the equation altogether. So we use L1, aka the sum of absolute values. Let’s very unscientifically review the performance of a couple of orthogonal transforms in the L1 metric, where lower is better. The data point we’re going to use is Lenna, with the R, G and B channels averaged and rounded to the nearest integer.

We have, for starters, the identity transform. Think of it as a control. Let’s also add DCT-II, the Haar transform and the Hadamard transform into the mix. (All suitably normalised to be orthogonal, of course.) Heck, we might as well make one transform of our own while we’re at it. We’ll need the Gram-Schmidt process for this.

Suppose that we have $n$ inputs. We have as an ansatz that the inputs are roughly correlated with each other, so we can take $(1,1,\dots,1)$ as our first basis vector; the projection in the Gram-Schmidt process will cause the transformed outputs to be concentrated entirely within one coefficient if the input is constant. (Note also that all the other transforms mentioned above, save the identity transform, also have this as a basis vector.) Then because we’re uncreative let the rest of the vectors be $(1,0,0,\dots,0)$, $(0,1,0,\dots,0)$, and so on until $(0,0,0,\dots,1,0)$. We don’t need a $(0,0,0,\dots,1)$ vector in this case because then we’d be overspecifying. After applying Gram-Schmidt (and leaving normalisation for later), we end up with these basis vectors (read by rows, not columns):

$\begin{pmatrix} 1&1&1&\cdots&1&1&1\\ n-1&-1&-1&\cdots&-1&-1&-1\\ &n-2&-1&\cdots&-1&-1&-1\\ &&n-3&\cdots&-1&-1&-1\\ &&&\ddots&&&\vdots\\ &&&&2&-1&-1\\ &&&&&1&-1 \end{pmatrix}$

To verify that the rows are in fact orthogonal to each other, just take the dot product of any two rows, which is readily checked to be nonzero iff the two rows are identical. To finish off the orthogonalisation, we just need to normalise each row. (Not included.)

For convenience, I implemented it in my code with the first row negated and the matrix rotated upside down. Needless to say, the calculated L1 norms obviously also reflect this change. (DCT-II, the Haar wavelet transform and the Hadamard transform are identical modulo sign and output rearrangement on flipped input (i.e. the L1 norm doesn’t change), but our toy transform isn’t.)

Note that this toy transform can be implemented with just $\Theta(n)$ operations, same as fast wavelet transforms. The fast Hadamard transform and fast cosine transforms are $\Theta(n\log n)$ algorithms. Another thing to note is that the determinant of the transform matrix (after normalisation) is $(-1)^n$; if we cared about making the transform special orthogonal for every $n$, we can just negate every row, but this isn’t very important and we make no use of this fact.

Our source is two-dimensional, so to make our 1D transforms 2D we can just apply them horizontally then vertically. Our source is also 512×512, so to simplify things let’s just look at $n\times n$ non-overlapping blocks where $n$ is a power of two, up to 64. The L1 norms of the transforms over all blocks are added together, and values are presented rounded to two decimal places. The percentages refer to the relative value compared to the identity transform.

2×2:
Identity: 33614394

4×4:
Identity: 33614394
DCT-II: 9564337.23 (~28.45%)
Haar: 9649082.17 (~28.71%)
Toy: 9699372.97 (~28.85%)

8×8:
Identity: 33614394
DCT-II: 5580317.24 (~16.60%)
Haar: 5756032.31 (~17.12%)
Toy: 5980414.12 (~17.79%)

16×16:
Identity: 33614394
DCT-II: 3644920.79 (~10.84%)
Haar: 3885408.48 (~11.56%)
Toy: 4479501.74 (~13.33%)

32×32:
Identity: 33614394
DCT-II: 2732115.33 (~8.13%)
Haar: 2987479.44 (~8.89%)
Toy: 4146229.00 (~12.33%)

64×64:
Identity: 33614394
DCT-II: 2359805.35 (~7.02%)
Haar: 2576377.94 (~7.66%)
Toy: 4691806.63 (~13.96%)

Are we supposed to draw a conclusion from this? Idklol. I put the source code on a gist.

Anyway, it’s quite clear that our toy transform is relatively bad at sparsifying this one test image, which is expected since it wasn’t really designed for anything in particular. DCT-II is optimal (in a KLT sense) for inputs that are AR(1), and natural images are pretty close to that so it’s no surprise that DCT-II happens to be the best. I played a bit with the Hadamard transform circa 2010, and these results match my expectations that it’d be somewhat worse than the Haar transform. (Very vaguely put, a Hadamard transform is like a Haar transform except after splitting the signal into high/low subbands, instead of just continuing the splitting on the low subband, the splitting proceeds on both; the low-pass filter is very crude so aliasing accumulates rather quickly.)

It’d be interesting to compare better wavelet transforms, but having to make sure they’re orthogonal would be troublesome. Allowing nonorthogonal (but still linear) transforms would lead to unfair comparison, if we continue using L1 as a metric. If, instead, we were measuring something practical like bits per pixel for a specified PSNR under a specific lossy compression scheme, we could compare arbitrary linear transforms.

This leads me to bring up an issue with quite a lot of image compression papers. Many of them mention compressing to a specific size but without exactly specifying how this is done. Normally, compression is done with quantisation (after a suitable transform), so we can assume they just choose a quantisation matrix, divide the transformed input by it and round to nearest integer then entropy encode, right? Nope, wrong. Sophisticated image encoders should be able to account for rate-distortion optimisation, which is very broad and includes things like choosing to truncate floating-point values instead of rounding. (Because if you round away from 0, you might end up wasting a fraction of a bit to encode a slightly larger value for little quality gain; rounding towards 0 is almost always cheaper since smaller values require less information to transmit.) Then there’s also the question of how the quantisation matrix is chosen; is it necessarily a multiple of a fixed “base” matrix (in which case, how are noninteger multiples rounded?), or can it vary with the source image?

Basically, there’re way too many parameters involved in lossy image compression and authors don’t do anywhere near enough to document them. (No, “download the accompanying source code” is not documentation… That is, if source code is provided at all!)

Oh yeah, one last fun fact. JPEG uses full-range BT.601 coefficients for converting between RGB and YCbCr, but it doesn’t specify that the YCbCr values must be rounded to integers. Implementations that round these values before converting them to RGB (in the case of decoding) or before applying the DCT-II (in the case of encoding) are more inexact than they have to be, but of course every JPEG library rounds the YCbCr intermediates because floating-point arithmetic is slow compared to integer arithmetic.

Another thing to note regarding floats versus integers in image compression is that the input is pretty much invariably in an integer format, rather than floating point. Lifting schemes would allow implementing reversible integer-to-integer transforms which are almost linear (in that they approach linearity as the input precision tends to infinity). Normalised orthogonal transforms (with some exceptions) would be penalised by having to force points in a lattice to another, rotated at an angle such that (usually) at most one point coincides between both lattices. It follows that unless you have a clever trick to create a bijection between the two lattices, a reversible integer-to-integer variant is impossible. (A clever trick would for example be scaling the inputs and outputs to create a reversible normalised integer-to-integer variant of the four-input Hadamard transform, as illustrated in the third Daala demo.)

Stealth edit: “the projection in the Gram-Schmidt process will cause the transformed outputs to be concentrated entirely within one coefficient” → “[…] if the input is constant”

Stealth edit #2: “checked to be zero iff the two rows are identical” → “checked to be nonzero iff the two rows are identical” Also added a link to the Wikipedia article on PCA.

## There’s this one category I don’t use a lot and maybe I should use it more

I was thinking a bit about discrete orthogonal transforms, like the DFT or DCT. (Technically the DFT is unitary rather than orthogonal but whatever.)

The usual property exploited with such transforms wrt lossy/lossless compression is that the data becomes sparser after applying an appropriate transform. Not every orthogonal transform has such properties, and obviously no transform can sparsify every possible input because of the pigeonhole principle.

Taking normalisation into account, for an orthogonal transform over $n$ samples where each sample has the same range, we may conceive of it as an isometric transform of a $n$-cube in $n$-space. (Which is just a fancy way of saying a rotation and optionally a reflection, in the context of Euclidean geometry.)

Unless the transform is trivial (relabelling axes, reflecting along axes or any composition of such), the resulting $n$-cube will certainly not coincide exactly with the original. Practically, what this means is that there’s a range expansion involved, and if any one of the transformed coordinates is particularly high or low, it affects the possible range of the other coordinates.

Transforms are used to remove correlation, so this seems counterproductive, but that’s just how things work.

YCbCr is not orthogonal, but this also applies; there’s been some research on how to use this “extra” information on how one coordinate affects the others to better reconstruct lossily compressed images, and also for watermarking/steganographic purposes. (For TV-range YCbCr, if Y=0/255 this leaves a bit of room for Cb and Cr to vary while still having the decoded RGB be black/white (respectively).)

There was a nonlinear variant of the Hadamard transform (which is orthogonal after suitable normalisation) where the input and output had exactly the same range, but it is only $C^0$, i.e. continuous but without a continuous derivative. (Maybe it was for the Haar transform instead, but they’re identical for two inputs.) I can’t find the paper where I originally read about this now, but picture this. For two inputs, our input space is a square. Starting at the centre, divide this square into square “annuli”; the nonlinear transform was then defined as shifting each “annulus” $n$ spots clockwise/anticlockwise, where $n$ was about one eighth the “length” of the “annulus”. This length is an odd multiple of 4 if the input range is even, so some rounding convention needs to be specified. This is akin to rotating the square by 45 degrees, then squishing the diamond back into an axis-aligned square.