Something else

成績

試合数: 498
試合時間: 146.6 hrs

四人

試合数: 271
試合時間: 106.3 hrs

三人

試合数: 227
試合時間: 40.3 hrs

I’ve probably spent too much time playing on Tenhou. Also, as mentioned in the last post about mahjong, I dropped from 2dan to shodan in both four-player and three-player, though through some amazing amount of luck I managed to grind myself to 3dan in three-player. Three-player (with Tenhou’s rules) involves a lot of luck, anyway—a string of repeated dealerships can readily give one player a huge lead over the other two (this happens with greater frequency than in four-player simply because there are fewer players), and hand values are greatly inflated compared to four-player due to a larger proportion of tiles being yakuhai or dora and hon’itsu/chin’itsu being much easier to get. Yakuman and kazoe yakuman are also much more common; I got a kokushi musou and a suuankou tanki some time ago (along with a few more non-tanki suuankous and daisangens).

Anyway, I was thinking about this: suppose we generalise mahjong to a hand with arbitrarily many tiles (possibly restricted to 3n+2 as in normal mahjong); what is the computational complexity of determining whether a given hand can be interpreted as a group of melds and a pair? This doesn’t seem to be entirely trivial (figuring out the waits for single-suit hands with only 13/14 tiles still confuse me occasionally), but on the other hand it almost seems like it obviously should be in P.

I once again blog about how bad Windows is

Yesterday evening Two evenings ago I set the computer to sleep, with the battery about half full. My limited experience tells me the battery would theoretically take about two weeks to go from from fully charged to fully discharged when suspended, so just a day of suspension should be all right.

About an hour ago I switched it on, only to be greeted by “Resuming Windows”, which indicates that for some reason it decided to hibernate instead of sleep. Whatever, it’s not the first time this happened. Now, for some retarded reason I splurged on RAM upgrades (seriously what the hell do I need 16 GiB RAM for on a laptop) so hibernating takes some nontrivial amount of time, and so does resuming from it. But okay, whatever.

Then right after it finished starting up, Windows hurfed a durf about configuring updates and promptly—by which I mean without a prompt—restarted on its own.

Whatever.

Unrelatedly, I had the QET two days ago. I actually tried to do it seriously for a while before they announced that there was only fifteen minutes left on the test, at which point I just gave up because it’d’ve been impossible to finish the rest of what I had in mind with that little time. Guess I’ll have to take extra English modules.

RIP uptime

> 12:14:59 up 63 days, 18:52, 4 users, load average: 0.70, 0.61, 0.50

Guess I’ll take the old computer off the grid for a while.

Windows a shit, part 3

I just remembered I already wrote two posts on this topic, so I’m just going to call this part 3.

IE 11 does keep keyboard layout within a window consistent (IE 8 doesn’t)

J/k, it doesn’t. The location bar apparently keeps track of keyboard layout separately. Yes, that is as baffling as it sounds.

Anyway, I have a couple of things to say regarding package managers and Windows 7′s lack of an inbuilt one. Package managers are freakin’ awesome and make installing and updating things a lot less annoying. Instead, what I’m doing is I’m searching on Google for an executable to download (and 90% of the first page of results is inevitably shady) and then running the installer (quite often without “setup” or “install” in the name, because who needs to know whether that’s the installer or the program itself?), only to realise the retarded installer didn’t add the program to PATH (because having to type out the full path to a CLI tool every time is very user-friendly).

Speaking of PATH.

[screenshot: the Windows environment variable editor]

The “official” way of editing PATH is to open up this dialogue box and edit whatever is in that text field. Notice how it can’t display one full path from a typical Windows setup; it’s not even resizeable. Do I need to say any more regarding how dumb this is?

[screenshot: also the Windows environment variable editor]

Oh, and the dialogue from which the previous dialogue is launched is this one. It’s so long that it has a scrollbar, but you’re not allowed to use the scrollwheel on your mouse to scroll it. Every time I want to edit PATH (which is depressingly often) I have to use my mouse to drag the scrollbar. You know, that action that leads to a lot of jeering and mockery when people who know about scrollwheels see you doing that.

Now, about ramdisks. I tried ImDisk and it didn’t work. So now I’m using this “SoftPerfect RAM Disk” thing, which does work, but doesn’t exactly do what I want. Ideally the ramdisk shouldn’t actually be using RAM if it isn’t filled, but SoftPerfect RAM Disk allocates it all at once because it needs to be formatted as FAT/NTFS/something else Windows can read. (This also means that it takes a nontrivial amount of time to initialise if the ramdisk is a couple of gigabytes large.) By the way, Linux has inbuilt ramdisk support and treats the ramfs/tmpfs filesystem types as first-class citizens among other filesystem types. (Memory usage for ramfs/tmpfs is also dynamic, of course. Linux is actually sane.)

Because I had the brilliant idea of formatting both of my external hard disks as ext4, I can’t use them on Windows directly. So I plug them in to the old computer, set up win-sshfs on the new one, and access files over SSHFS like any sane person wouldn’t. This turns out to be awful for a couple of reasons. One, the router/modem SingTel provided is absolute horseshit and can’t handle speeds beyond around 6 to 7 megabits/second. Not even within the home network. Two, win-sshfs itself is not exactly the best software for SSHFS ever.

It has some documented quirks like not recognising symlinks, and that’s okay because at least those are documented. It also has some not-documented quirks like crashing if you happen to select multiple files and access them simultaneously at 4 o’clock on a Sunday afternoon. That, for example, is not okay. Furthermore, unlike the *nix SSHFS, it apparently doesn’t cache at all, so everything is always slow. (It has an approximately 14-second delay on opening previously-unaccessed files, but my experience with SSHFS on Linux hasn’t been much better. That said, said previous experience with SSHFS on Linux involved servers located in another continent, which contributed at least a little to the delay.)

Also, re: gdipp; I have that disabled now. I can’t stand the broken font rendering anymore. I mean, whatever Windows is doing is obviously also broken, but it’s less so. It’s 2014 already, why don’t we have 200 ppi displays on the desktop? Instead we get phones with absurdly and unnecessarily high resolutions.

Anyway, the last couple of paragraphs don’t really relate directly to Windows. One of the other things I was going to hurf durfs about (and forgot when I initially published this post) was that Windows does not provide workspaces, which is a common Linux DE feature I use a lot. Everything on the main workspace, IRC on a secondary workspace. Now, I usually use this Windows laptop with an external monitor attached. It turns out that the stand that came with the U2412M isn’t very tall at all so a dual-monitor setup would force me to look at the laptop monitor at a relatively extreme angle (which makes the colours look bad because the laptop monitor is TN) and that is already somewhat unpleasant to use. I did, however, experiment with it for a while.

My major gripe is that Windows does not, by default, use a colour scheme that allows easy differentiation of active and inactive windows. (I don’t know if this is related to the whole laptop-uses-TN-so-colours-all-look-like-junk thing, but hey.) So what ends up happening is that sometimes I’m looking at the wrong monitor and typing things into the wrong window. This is frustrating and basically a thing that shouldn’t be happening. In theory using two monitors should provide a better workflow than using two workspaces, but this turns out not to be the case with Windows. (Do people with multiple-monitor setups actually use multiple monitors concurrently? I find it hard to focus on what I’m doing if I have too many other things open.)

Untitled

Tomorrow I will stop procrastinating things and get the things I need to get done done.

Maybe.

We’ll see when the time comes.

(By “tomorrow” I naturally mean when I wake up, not literally tomorrow.)

Anyway, I have to go take this “Qualifying English Test” thing because I couldn’t be bothered with English classes in school before. I had a look at the sample question and the topic seemed awfully dreary. Failure is pretty much guaranteed at this point since I suck at writing (despite having produced over a thousand blog posts). Or maybe it’s that I suck at writing about things I don’t care about.

Yeah, let’s go with that.

Something about functional equations

\displaystyle f:\mathbb R\to\mathbb R\\  f(x)=f(1/x)\\  f(0)=0\\  f(1)=1\\  f'(0)=1\\  f(-x)=-f(x)

Assuming f is smooth, solve for f.

A nonsolution is f(x)=\exp(-(\log|x|)^2)=|x|^{-\log|x|} with the removable singularity at x=0 removed. It’s smooth, but f'(0)=0. (In fact, every derivative is zero at x=0.) Another nonsolution is f(x)=[|x|\le1]x+[|x|>1]/x, which satisfies every condition other than smoothness.

What I tried was to expand a Taylor series around x=1 then solve for the coefficients using only the constraint of f(x)=f(1/x), but this was fruitless. Let f(x)=\sum_{k\ge0}a_k(x-1)^k. Then:

a_1=0\\  a_2+a_3=0\\  a_3+2a_4+a_5=0\\  a_4+3a_5+3a_6+a_7=0\\  a_5+4a_6+6a_7+4a_8+a_9=0\\  \vdots

Any finite truncation of this list of equations would have more unknowns than equations and hence be unsolvable. From another point of view, this means that there are infinitely many degrees of freedom to exploit in order to make f satisfy all the other conditions, and I’m not sure what exactly to do here. (Incidentally, this led to the previous post about symmetric polynomials, where the relation to polynomials stem from treating the LHS as generating functions.)

One sort-of-obvious thing to note is that we somehow have to choose the coefficients such that they don’t grow too quickly, so we can have a nonzero radius of convergence. For instance, fixing a_k=0 for all even k\ge4 and a_2=1 the odd coefficients end up being:

a_3=-1\\  a_5=1\\  a_7=-3\\  a_9=17\\  a_{11}=-155\\  a_{13}=2073\\  \vdots

Which is A001469 which grows superexponentially and therefore has a zero radius of convergence. Nope, useless. On the other hand, fixing a_2=a_4=\cdots and a_3=a_5=\cdots and adding the oddness constraint results in a_2=1/2, which leads to f(x)=\frac12(x+x^{-1}), which has a pole at x=0. Durr.

Anything involving \log x is automatically unnice due to the singularity at x=0, even though converting it to a function of (\log x)^2 seems like the obvious course of action.

Update: Clearly any analytic function is not going to make the cut since it’d have to be symmetric in x and 1/x, so the function can be expressed as a Taylor series in terms of x+1/x, which necessarily causes a discontinuity at x=0. That said, explicitly constructing a function satisfying all the requirements is not difficult through the use of bump functions and the like; what I wanted was a simple function that can be defined without resorting to such hacks.

Something about symmetric polynomials

In the middle of researching something I discovered that (1+x)^{2n+1}-x^{2n+1} can be expressed as a polynomial in x(1+x) for certain small values of n. (This is reminiscent of and probably also related to Faulhaber’s formula and the Bernoulli numbers, incidentally.)

The very first thing to note is that this is not a symmetric formula in 1+x and x. On the other hand, it is symmetric in 1+x and -x, whence we may apply the fundamental theorem of symmetric polynomials which implies that (1+x)^{2n+1}-x^{2n+1} may be expressed in terms of (1+x)+(-x) and (1+x)(-x). Obviously, since the former term is just 1, this also means that (1+x)^{2n+1}-x^{2n+1} is a polynomial function of -x(1+x) and hence also of x(1+x).

I’m ashamed to admit that noticing this took me about an hour (interrupted by a shower, but still). The previous approach I had was—ugh—induction, which didn’t work out because I copied a sign wrongly somewhere. Having fixed that, a proof by induction is simple, and also suggests the more general statement that (1+x)^n+(-x)^n is a polynomial in x(1+x)… and this general statement makes the symmetry extremely obvious.

(Among other goofs I made today, I wrote 26 for 30−2 somewhere. That was dumb, and what made it even dumber was how long it took me for notice. I might be getting just slightly out of touch with basic arithmetic.)

TANOSHII

[screenshot: daisangen]

[screenshot: suuankou]

Of the three main yakuman hands the only one I haven’t gotten yet is kokushi, which I’ve been in tenpai for exactly once. That said, I still haven’t gotten any yakuman at all in four-player mahjong; been tenpai for shousuushii once, but that happened on my second last draw and all the winning tiles were either discarded or in my opponents’ hands.

Anyway, my rate (for four-player) has been plummeting since the start of the month. I initially attributed it to plain bad luck but having placed badly for most of the last fifty games made me adjust my priors a bit. So yeah I just sort of suck at this game. At this rate (pun intended) I’ll probably get demoted from 2dan to shodan.

Some ~notes~ about that suuankou win for my personal reference. There are three tiles that complete the hand; namely, 1/6/9s. Winning by 1/9s with self-pick gives suuankou (8000 basic points; double basic points from every player). Winning by either of those tiles on a discard would lead to riichi (1 han) + toitoi (2 han) + san’ankou (2 han) + north (1 han), for 18000 points (3000 basic points; six times basic points from the player who dealt in). Possibly more with ura dora, but the ura dora indicator turned out to be 9p so whatever.

Toitoi is usually cheap (640 basic points without other yaku, though personally I try to ensure I have at least yakuhai or tan’yao first) but the dora can appear in multiples of three, making the hand expensive very quickly. Especially with kan or kan dora.

With a 9s the hand can be grouped up in two different ways. 777888999s can be interpreted as either 789s 789s 789s or 777s 888s 999s, where the former interpretation leads to riichi (1 han) + menzen tsumo (1 han) + iipeikou (1 han) + north (1 han) with 24 fu for “only” 1920 basic points, while the latter interpretation leads to an instant 8000 basic points.

With a 1s the only possible interpretation is as four triples (also 8000 basic points for tsumo, 3000 for ron), and with a 6s the only possible interpretation is as 777p 11s 678s 789s 789s, which is also 1920 basic points only (1280 if without menzen tsumo). Winning with 6s by discard would lead to 7700 points, or 7800 by self-pick.

Going solely by the discards and my own hand, there were two 1s, four 6s and and one 9s live, so (assuming I didn’t get any new dora or luck-based yaku other than menzen tsumo) my expected hand value was 296800/21≈14100 points. Looking back, calling riichi turned out to be a good idea, though I was playing under speed rules (3 seconds per turn + 5 extra seconds per round) so I didn’t really manage to think it out.

The first round of that particular game was won by me with a riichi chiitoi, and the second was an open toitoi san’ankou, then the third was the suuankou, and the left player got busted by the right player on the fourth round with an open hon’itsu after his fourth(!) discard. Incidentally, toitoi and chiitoi hands (and also hands with a shanpon wait) can be almost impossible to read because the tiles don’t need to have any relation with each other; setting up traps for players assuming you have a two-sided wait is particularly easy with these. (That said, chiitoi/toitoi inherently have bad waits; having better waits is likely more useful than making traps.)

Trivia (part 2)

3|111 and 1|113;
4|112 and 2|114;
6|114 and 4|116;
9|117 and 7|119.

This was a thing I noticed some time ago when I was bored couldn’t sleep and started mentally factoring integers. For obvious reasons anything ending in 3 or 7 is not divisible by 5, and at most one of 6\mid(10n+8) and 6\mid(10n+4) can be true; hence, of the seven possible statements of this form (d\mid(10n+d+2) and (d+2)\mid(10n+d) where d\in[1,7]\cap\mathbb Z) at most four can be true. This is attained exactly when n\equiv11\pmod{126}.

T, part 2

Re: T.

So the thing is over, and guess what, I ended up 22nd place. Despite only participating for a single day.

This probably would’ve been better blogging material if I somehow managed to be top ten, but alas.

There were over 400 participants who cleared at least ten races… but half of them didn’t even manage to clear twenty. Put it this way: anyone who didn’t even clear a hundred races obviously didn’t really care about the prize that’d require actual effort. How many people managed that? Only 61. (The person ranked directly above me didn’t clear a hundred races, but he’s also Sean Wrona. No other person in the top 48 cleared fewer than a hundred races.)

I have no idea how large TypeRacer’s player base is, but this kind of seems to indicate that this promotional event was a bit of a disaster.

On P versus NP

Disclaimer: things may be wrong, etc., but considering that this is another of those posts where I talk about how other people are wrong I’ll be trying a bit harder to be accurate.

P and NP are complexity classes of decision problems; i.e. problems with a Boolean answer. It is not, for example, meaningful to claim that factoring is in P or NP or PSPACE or EXPTIME because the output of factoring isn’t Boolean.

NP contains P. This confusion stems from a common depiction of P as “polynomial-time” and NP as “not-polynomial-time” when the N in NP really stands for “nondeterministic”. But the reality is more nuanced than that.

For some reason, discussion of co-NP is often omitted in pop science media, despite the fact that NP is asymmetric with respect to true/false outcomes. What a problem being in NP means is that there exists a polynomially-sized and polynomial-time-verifiable proof that the answer is “true” if it is “true”, but it doesn’t require any such proof for a “false” outcome. co-NP is the complement—a proof must exist for a “false” outcome, without any requirement of a proof if the answer is “true”. The result of omitting co-NP in plebeian discussion is that people end up talking about NP-intersect-co-NP instead of NP or co-NP when that’s what they really mean.

Here are two different formulations of the travelling salesman problem. A: Is the shortest route of length at most x? B: Is the shortest route of length at least x? A is in NP, B is in co-NP. Yet the optimisation version of the TSP (to find the shortest route) is sometimes called NP-complete, when it’s not even a decision problem and hence cannot be in NP. Oops!

As another example, it’s clear that primality testing (outcome is true if the input is a prime, false otherwise) is in co-NP. To prove compositeness, simply provide a nontrivial factor! Verifying the proof takes no longer than O(n^2) if long division is used, which is polynomial time. Prior to AKS, it was also known that primality testing was in NP, due to the ECPP algorithm. Of course, now that AKS exists (where “now” means “since 2002″), we know that PRIMES∈P so this example is somewhat moot.

There’s a bit of a mental block when it comes to thinking about whether the proof is provided by an adversary, but in the discussion of NP/co-NP, the assumption that the proof is correct is made. This can be resolved with an alternative definition. Instead of requiring a proof for a “true” answer, we can allow randomisation and say that there exists a polynomial-time algorithm such that if the answer is “true” it would output a “true” value with nonzero probability, but if the answer is “false” it must output “false”. The use of the “nondeterministic” adjective is most obvious in this interpretation—if the algorithm guesses the correct random numbers to use every time then it provides a polynomial-sized proof for every “true” answer, where the proof is the list of “random” numbers used. (This happens with vanishingly low probability, possibly on the order of 2^{-n} for an n-bit input.)

It also highlights the similarity to RP (randomised polynomial-time), which is the same, except that a “true” value must be output with a nonzero probability independent of the input for a “true” answer. Likewise, co-NP and co-RP can be defined similarly. P problems are automatically in RP, and RP problems are automatically in NP, so we have P⊆RP⊆NP and also P⊆co-RP⊆co-NP. It is believed that derandomisation is possible and therefore that P=RP=co-RP, but like every other open complexity theory question a proof in either direction is unlikely to surface for a pretty long time.

The next thing people misunderstand a lot is what it means for a problem to be NP-hard. For a problem to be NP-hard, it must be at least as hard to solve as every problem in NP. We could similarly define co-NP-hard, but since NP and co-NP are complements the classes end up being identical. As an NP-hard problem is not necessarily a decision problem, it consequently is not necessarily in either NP or co-NP. Those which are in NP are called NP-complete, and those which are in co-NP are called co-NP-complete. It is somewhat unfortunate that the term “NP-hard” leads people to infer that a problem is in NP, and I certainly have made that mistake before somewhere.

More than I can chew

The thing about entropy or information is that you cannot meaningfully measure how much entropy a system has without a prior distribution, since that is exactly how entropy is defined. This means that entropy is dependent on the chosen prior distribution; for any particular input, it may be tempting to choose a prior that minimises entropy for that input, but then now you have this metadata which you cannot encode for free. So the prior leaks information into the main data itself (since any specific data has a specific optimal prior), which sounds dumb, and then you need a prior prior, which sounds even dumber. Not to mention that by choosing a prior distribution that is your input with probability 1, you end up having zero entropy.

I first noted this when I was exploring password complexity. (Circa 2008? I’m not logged in to my Google account so I can’t check the old blog, assuming I even bothered to write about it there.) Unless you know how the attacker is going to attack, you cannot exactly assign how many bits a particular password has. On the other hand, if you want to check password complexity, being pessimistic (i.e. underestimating the entropy) is a good thing—take a couple of common password schemes as priors, choose the one with least entropy as an entropy estimate.

Anyway, suppose your password generation scheme is the “ideal” one where you use a CSPRNG / real RNG to generate a bunch of bits (e.g. 100 bits) which you then encode using hexadecimal or base64 or the like. The problem is that if you do this and in the overwhelmingly unlikely case that your password ends up matching a dictionary word (or has some “pattern” to it), you get hit because much more people use insecure password schemes! The problem is that the password prior distribution used should not be uniform; to maximise entropy you have to deliberately choose something unlikely, but not do so deterministically. This somewhat antagonistic password choosing procedure affects the password prior, but the main point is that it has low impact because most people don’t use good passwords.

The topic of passwords wasn’t what I thought I’d write about when I started this post, but that’s just how the post evolved. That is tangentially related to what I was originally going to write about, but it’s late and I want to sleep so maybe I’ll finish it off another day, but who am I kidding.

Habits

I have this habit of not really researching something into depth before writing about it on this here blog.

This has a couple of consequences, the main one being that whatever I was going to write about turned out to actually not be novel.

On a handful of occasions I have trashed otherwise mostly completed posts because I found another website that described whatever I was writing about better. (Not a very high bar since I’m not good at writing.) I don’t really like putting effort into things that never see the light of the day, but somehow that’s what most of my things end up as anyway.

Currently I have a draft post about median selection networks (comparator networks that return the median of the input values), sort of as a follow-up to the running minimum/maximum post. Actually, I had two draft posts on that; the first one got trashed and converted into the running minimum/maximum post, since at that point I had more information to write about regarding that than the (running) median. I only lately (like, yesterday) discovered there actually was a term (“selection network”) to refer to a comparator network for rank order statistics. The more you know!

Unfortunately “selection network” is an extremely poor search term and just gives retarded results on which mobile network to select or something.

Anyway, a median selection network construction with \Theta(n\log n) comparators exists and is optimal, based on expander graphs. (This is apparently the same technique used for proving that \Theta(n\log n) sorting networks exist, albeit without insanely large hidden constants.) This also means that a branchless running median necessarily uses at least \Omega(\log n) operations. (If a more efficient running median existed, it could be adapted to determine non-running medians with an overhead factor of n, i.e. o(n\log n), contradicting the above.)

I reenabled search engine access to this blog soon after I purged it of a huge chunk of the chuuni posts. Of course, the view counts are as abysmal as before. I’m not sure what exactly I was hoping for.

Notes

I am finally getting around to purging old posts that I no longer deem fit to see the light of day. I think I got rid of about thirty already. Some posts may reference older posts that no longer exist, but if I nuked a post it most likely wasn’t worth reading anyway.

There was some serious edgemaster going on in those pre-2011 posts, damn.

Remember when I said that hypercube graphs contain every bipartite graph as a subgraph? That was an incorrect statement, and I updated the post accordingly. (The post is quite a few months old but this was a major (and majorly wrong) statement I made so I might as well admit it instead of just leaving it as an edit no one will ever read.) The Nauru graph is allegedly a counterexample to the original statement.

Remember when I mentioned an idea about linked lists using wider links? A usable implementation of that idea exists as skip lists. I also brought up the initial idea of using fixed ±1/2/4/8/… links on IRC once and was promptly pointed to its impracticalness. (Access is \Theta(\log n) (good), but updating is \Theta(n) average/worst-case (extremely bad).)

Remember that digital clock I had in my bedroom? Black patches started appearing on the display around two years ago, and just a couple of weeks ago I finally made the decision to toss it out. Yeah, I kept a useless clock in my room for more than three years. At least it didn’t take up much space.

Constant-time running minimum/maximum

Section 0: Abstract

A branchless (sort of) algorithm is presented to compute the running minimum/maximum of a discrete 1D signal with asymptotic constant running time per output sample.

Section 1: Introduction

A running minimum (resp. maximum) filter is a nonlinear filter defined over a signal X for a given window size n as the signal \displaystyle t\mapsto\!\!\min_{\tau\in[t,t+n]}\!\!X(\tau) (resp. \max). The focus will be on half-infinite discrete signals; i.e. where t\in\mathbb N\cup\{0\}. WLOG only the running minimum will be considered. There is an alternative definition of the running minimum/maximum as the cumulative minimum/maximum, but a constant-time algorithm for that is trivial and will not be considered.

Taking the minimum of two values naïvely would require a comparison and a branch, but the branch turns out to be unnecessary. x86′s SIMD minimum and maximum instructions were introduced with SSE over a decade ago; while x86 has no scalar minimum or maximum instructions, they can be emulated branchlessly with CMP and CMOV*. (Or so I’m told by an x86 assembly expert.) Furthermore, even without CMOV or similar, branchlessness is still possible with a couple of extra operations on integer types.

Such an algorithm would likely be of use in SIMD contexts, i.e. taking the running minimum of multiple 1D signals simultaneously. Branching is relatively cheap on scalar code in modern CPU architectures—especially so if branch prediction happens to be useful—but comparatively expensive in SIMD. (Or so I’m told by the same x86 assembly expert.) There are alternative branch-using algorithms for the running minimum that may perform better in scalar contexts.

Python will be used as pseudocode. Assume that the min and max functions are constant-time with two inputs and treat n-input calls as a wrapper around n-1 two-input calls. Input validation is simplified in the interest of brevity. Likewise, iterators are assumed to be infinite unless otherwise specified to avoid handling edge cases.

Section 2: Algorithmics

def running_minimum(iterable,n):
    iterable = iter(iterable)
    buffer = [next(iterable) for i in range(n)]
    while True:
        yield min(buffer)
        buffer = buffer[1:] + [next(iterable)]

The naïve algorithm has a time complexity of \Theta(n). More specifically, it involves n-1 min operations per output. This is optimal if no information can be reused across loop iterations apart from the buffer of past elements, and only one output value is yielded in each loop iteration.

This suggests an optimisation in yielding multiple output values per loop iteration, where calculations may be reused.

def running_minimum_slightly_better(iterable,n):
    if n < 3:
        raise ValueError
    iterable = iter(iterable)
    buffer = [next(iterable) for i in range(n+1)]
    while True:
        tmp = min(buffer[1:n])
        yield min(buffer[0], tmp) # == min(buffer[0:n])
        yield min(tmp, buffer[n]) # == min(buffer[1:n+1])
        buffer = buffer[2:] + [next(iterable), next(iterable)]

This optimisation reduces the operation count to n/2 mins per output by exploiting the overlap of n-1 elements. One direct extension of this into yielding k output values per loop iteration is to simply reduce the overlap accordingly.

def running_minimum_slightly_betterer(iterable,n,k=2):
    if n < 3 or k > n+1:
        raise ValueError
    iterable = iter(iterable)
    buffer = [next(iterable) for i in range(n+k-1)]
    while True:
        tmp = min(buffer[k-1:n]) if k <= n else None
        for i in range(k):
            yield min(buffer[i:k-1] + [tmp]*(k <= n) + buffer[n:n+i]) # == min(buffer[i:n+i])
        buffer = buffer[k:] + [next(iterable) for i in range(k)]

The operation count is \frac1k(n-k+k(k-1))=n/k+k-2 mins per output; this is minimised when k=\sqrt n, where the operation count is 2\sqrt n-2=\Theta(\sqrt n). This is not particularly close to being constant-time, but there is some low-hanging fruit left to be picked.

def running_minimum_even_better(iterable,n,k=2):
    if n < 3 or k > n+1:
        raise ValueError
    iterable = iter(iterable)
    buffer = [next(iterable) for i in range(n+k-1)]
    while True:
        tmp = min(buffer[k-1:n]) if k <= n else None
        tmp1 = [None] * (k-1)
        tmp2 = [None] * (k-1)
        for i in range(k-1):
            if i == 0:
                tmp1[k-2] = buffer[k-2]
                tmp2[0] = buffer[n]
            else:
                tmp1[k-2-i] = min(tmp1[k-1-i], buffer[k-2-i])
                tmp2[i] = min(tmp2[i-1], buffer[n+i])
        for i in range(k):
            yield min(tmp1[i:i+1] + [tmp]*(k <= n) + tmp2[i-1:i])
        buffer = buffer[k:] + [next(iterable) for i in range(k)]

Using a cumulative minimum shaves the operation count to \frac1k(n-k+(k-2)\cdot2+3k-2)=\frac1k(n-6)+4 mins per output when k\le n. The optimal choice for k is just the maximum possible value, namely k=n+1, where the operation count is \frac{3n-3}{n+1}=3-6/(n+1) mins per output. This can also be seen as reducing the overlap to zero.

A constant-time algorithm is thus obtained. It is not difficult to rearrange the operations to make it zero-latency, which, incidentally, also makes it correctly handle finite iterators.

def running_minimum_zerolatency(iterable,n):
    if n < 3:
        raise ValueError
    iterable = iter(iterable)
    buffer = [next(iterable) for i in range(n)]
    while True:
        tmp1 = [None] * (n-1) + buffer[n-1:]
        tmp2 = None
        for i in range(1, n):
            tmp1[n-1-i] = min(tmp1[n-i], buffer[n-1-i])
        yield tmp1[0]
        for i in range(n):
            buffer.append(next(iterable))
            tmp2 = buffer[-1] if i == 0 else min(tmp2, buffer[-1])
            yield tmp2 if i == n-1 else min(tmp1[i+1], tmp2)
        buffer.append(next(iterable))
        del buffer[:n+1]

It can be further modified to use only (1+o(1))n auxiliary memory instead of (3+o(1))n, which is left as an exercise for the reader.

Section 3: Conclusion

See section 0.

Section 4: Extensions

Other order statistics may be considered as well; for instance, the median is one of them. Computing the median is much less simple than the extrema, due to not being directly decomposable. As previously covered, medians may be decomposed into smaller medians, but with superexponential blowup, which suggests that that is not the right direction to pursue. Using a self-balancing binary search tree (which inherently involves data-dependent branches), running order statistics may be computed with O(\log n) time complexity per output value. It is not currently known if this bound can be achieved with branchless algorithms, or whether O(\log n) is optimal even among branch-using algorithms.

The running median will be covered in a later post.

Section 5: Postscript

The presented algorithm is not the first constant-time running minimum I had devised, though it is by far much simpler than my first attempt, which involved recursively exploiting the overlaps of overlaps to attain a bound of 4 mins per output when n is odd and 5 mins per output when n is even, with a structure depending on the bijective binary representation of n. Thank goodness I didn’t have to write that one up.

I’m not sure how badly this fares against an algorithm which allows data-dependent branches. I’ve tried searching around a bit, which points to a paper by Daniel Lemire titled “Streaming Maximum-Minimum Filter Using No More than Three Comparisons per Element”, which is worth a read too. (More so than this blog post, perhaps.) The comparison table in Lemire’s paper seems to suggest that my algorithm is at least not too bad, and further research reveals that it is in fact almost identical to van Herk’s, differing only in the handling of edge cases. (Of course, van Herk’s article itself is paywalled, despite being over two decades old; I base the statement off Lemire’s code.) Other than this, I haven’t managed to find a single useful paper covering constant-time 1D running minimum/maximum that isn’t also paywalled.

Paywalls. a. shit.

Section 6: Addendum

It appears that \Theta(\log n) worst-case is optimal for a running median (without a branchlessness restriction) from a quick search of available literature, though, again, paywalls make verification difficult. There is a proven \Theta(\log n) worst-case complexity for the two-dimensional running median from reduction to sorting. I’ve yet to digest the proof and I’m not sure if it adapts easily to the one-dimensional case.

It also seems that my variation of van Herk’s algorithm can be implemented in-place, which reduces the auxiliary memory usage to just o(n). This is, of course, assuming that in-place editing is possible; otherwise a (1+o(1))n bound on auxiliary memory is optimal (ignoring lower-order terms).

Previously.