Modern perfect hashing
Wojciech Muła posted about modern perfect hashing for strings and I wanted to make some comments about my own implementation (that sadly never got productionized because doubling the speed compared to gperf wasn't really that impactful in the end).
First, let's define the problem, just so we're all on the same page; the goal is to create code that maps a known, fixed set of strings to a predefined integer (per string), and rejects everything else. This is essentially the same as a hash table, except that since the set of strings is known ahead of time, we can do better than a normal hash table. (So no “but I heard SwissTables uses SIMD and thus cannot be beat”, please. :-) ) My use case is around a thousand strings or so, and we'll assume that a couple of minutes of build time is OK (shorter would be better, but we can probably cache somehow). If you've got millions of strings, and you don't know them compile-time (for instance because you want to use your hash table in the join phase of a database), see this survey; it's a different problem with largely different solutions.
Like Wojciech, I started splitting by length. This means that we can drop all bounds checking after this, memcmp will be optimized by the compiler to use SIMD if relevant, and so on.
But after that, he recommends using PEXT (bit extraction, from BMI2), which has two problems: First, the resulting table can get quite big if your input set isn't well-behaved. (You can do better than the greedy algorithm he suggests, but not infinitely so, and finding the optimal mask quickly is sort of annoying if you don't want to embed a SAT solver or something.) Second, I needed the code to work on Arm, where you simply don't have this instruction or anything like it available. (Also, not all x86 has it, and on older Zen, it's slow.)
So, we need some other way, short of software emulation of PEXT (which
exists, but we'd like to do better), to convert a sparse set of bits
into a table without any collisions. It turns out the computer chess
community has needed to grapple with this for a long time (they want to
convert from “I have a \ on \ and there are pieces on
relevant squares \, give me an index that points to an array
of squares I can move to”), and their solution is to use… well,
magic. It turns out
that if you do something like ((value & mask) * magic), it is very
likely that the upper bits will be collision-free between your different
values if you try enough different numbers for magic. We can use this
too; for instance, here is code for all length-4 CSS keywords:
static const uint8_t table[] = {
6, 0, 0, 3, 2, 5, 9, 0, 0, 1, 0, 8, 7, 0, 0,
};
static const uint8_t strings[] = {
1, 0, 'z', 'o', 'o', 'm',
2, 0, 'c', 'l', 'i', 'p',
3, 0, 'f', 'i', 'l', 'l',
4, 0, 'l', 'e', 'f', 't',
5, 0, 'p', 'a', 'g', 'e',
6, 0, 's', 'i', 'z', 'e',
7, 0, 'f', 'l', 'e', 'x',
8, 0, 'f', 'o', 'n', 't',
9, 0, 'g', 'r', 'i', 'd',
10, 0, 'm', 'a', 's', 'k',
};
uint16_t block;
memcpy(&block, str + 0, sizeof(block));
uint32_t pos = uint32_t(block * 0x28400000U) >> 28;
const uint8_t *candidate = strings + 6 * table[pos];
if (memcmp(candidate + 2, str, 4) == 0) {
return candidate[0] + (candidate[1] << 8);
}
return 0;
There's a bit to unpack here; we read the first 16 bits from our value
with memcpy (big-endian users beware!), multiply it with the magic value 0x28400000U found by trial
and error, shift the top bits down, and now all of our ten candidate
values (“zoom”, “clip”, etc.) have different top four bits. We use that to index
into a small table, check that we got the right one instead of a random collision
(e.g. “abcd”, 0x6261, would get a value of 12, and table[12] is 7,
so we need to disambiguate that from “flex”, which is what we are actually looking
for in that spot), and then return the 16-bit identifier related to the match (or zero,
if we didn't find it).
We don't need to use the first 16 bits; we could have used any other consecutive 16 bits, or any 32 bits, or any 64 bits, or possibly any of those masked off, or even XOR of two different 32-bit sets if need be. My code prefers smaller types because a) they tend to give smaller code size (easier to load into registers, or can even be used as immediates), and b) you can bruteforce them instead of doing random searches (which, not the least, has the advantage that you can give up much quicker).
You also don't really need the intermediate table; if the fit is particularly good, you can just index directly into the final result without wasting any space. Here's the case for length-24 CSS keywords, where we happened to have exactly 16 candidates and we found a magic giving a perfect (4-bit) value, making it a no-brainer:
static const uint8_t strings[] = {
95, 0, 'b', 'o', 'r', 'd', 'e', 'r', '-', 'b', 'l', 'o', 'c', 'k', '-', 's', 't', 'a', 'r', 't', '-', 'w', 'i', 'd', 't', 'h',
40, 0, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 't', 'e', 'x', 't', '-', 'o', 'r', 'i', 'e', 'n', 't', 'a', 't', 'i', 'o', 'n',
115, 1, 's', 'c', 'r', 'o', 'l', 'l', '-', 'p', 'a', 'd', 'd', 'i', 'n', 'g', '-', 'b', 'l', 'o', 'c', 'k', '-', 'e', 'n', 'd',
198, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 't', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', '-', 'o', 'r', 'i', 'g', 'i', 'n',
225, 0, '-', 'i', 'n', 't', 'e', 'r', 'n', 'a', 'l', '-', 'o', 'v', 'e', 'r', 'f', 'l', 'o', 'w', '-', 'b', 'l', 'o', 'c', 'k',
101, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 'b', 'o', 'r', 'd', 'e', 'r', '-', 'e', 'n', 'd', '-', 's', 't', 'y', 'l', 'e',
93, 0, 'b', 'o', 'r', 'd', 'e', 'r', '-', 'b', 'l', 'o', 'c', 'k', '-', 's', 't', 'a', 'r', 't', '-', 'c', 'o', 'l', 'o', 'r',
102, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 'b', 'o', 'r', 'd', 'e', 'r', '-', 'e', 'n', 'd', '-', 'w', 'i', 'd', 't', 'h',
169, 1, 't', 'e', 'x', 't', '-', 'd', 'e', 'c', 'o', 'r', 'a', 't', 'i', 'o', 'n', '-', 's', 'k', 'i', 'p', '-', 'i', 'n', 'k',
156, 0, 'c', 'o', 'n', 't', 'a', 'i', 'n', '-', 'i', 'n', 't', 'r', 'i', 'n', 's', 'i', 'c', '-', 'h', 'e', 'i', 'g', 'h', 't',
201, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 't', 'r', 'a', 'n', 's', 'i', 't', 'i', 'o', 'n', '-', 'd', 'e', 'l', 'a', 'y',
109, 1, 's', 'c', 'r', 'o', 'l', 'l', '-', 'm', 'a', 'r', 'g', 'i', 'n', '-', 'i', 'n', 'l', 'i', 'n', 'e', '-', 'e', 'n', 'd',
240, 0, '-', 'i', 'n', 't', 'e', 'r', 'n', 'a', 'l', '-', 'v', 'i', 's', 'i', 't', 'e', 'd', '-', 's', 't', 'r', 'o', 'k', 'e',
100, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 'b', 'o', 'r', 'd', 'e', 'r', '-', 'e', 'n', 'd', '-', 'c', 'o', 'l', 'o', 'r',
94, 0, 'b', 'o', 'r', 'd', 'e', 'r', '-', 'b', 'l', 'o', 'c', 'k', '-', 's', 't', 'a', 'r', 't', '-', 's', 't', 'y', 'l', 'e',
196, 2, '-', 'w', 'e', 'b', 'k', 'i', 't', '-', 't', 'e', 'x', 't', '-', 's', 'i', 'z', 'e', '-', 'a', 'd', 'j', 'u', 's', 't',
};
uint32_t block;
memcpy(&block, str + 16, sizeof(block));
uint32_t pos = uint32_t(block * 0xe330a008U) >> 28;
const uint8_t *candidate = strings + 26 * pos;
if (memcmp(candidate + 2, str, 24) == 0) {
return candidate[0] + (candidate[1] << 8);
}
return 0;
You can see that we used a 32-bit value here (bytes 16 through 19 of the input), and a corresponding 32-bit magic (though still not with an AND mask). So we got fairly lucky, but sometimes you do that. Of course, we need to validate the entire 24-byte value even though we only discriminated on four of the bytes! (Unless you know for sure that you never have any out-of-distribution inputs, that is. There are use cases where this is true.)
(If you wonder what 95, 0 or similar is above; that's just
“the answer the user wanted for that input”. It corresponds to
a 16-bit enum in the parser.)
If there are only a few values, we don't need any of this; just like Wojciech, we do with a simple compare. Here's the generated code for all length-37 CSS keywords, plain and simple:
if (memcmp(str, "-internal-inactive-list-box-selection", 37) == 0) {
return 171;
}
return 0;
(Again 171 is “the desired output for that input”, not a value the code generator decides in any way.)
So how do we find these magic values? There's really only one way: Try lots of different ones and see if they work. But there's a trick to accelerate “see if they work”, which I also borrowed from computer chess: The killer heuristic.
See, to try if a magic is good, you generally try to hash all the different values and see if any two go into the same bucket. (If they do, it's not a perfect hash and the entire point of the exercise is gone.) But it turns out that most of the time, it's the same two values that collide. So every couple hundred candidates, we check which two values disproved the magic, and put those in a slot. Whenever we check magics, we can now try those first, and more likely than not, discard the candidate right away and move on to the next one (whether it is by exhaustive search or randomness). It's actually a significant speedup.
But occasionally, we simply cannot find a magic for a given group; either there is none, or we didn't have enough time to scan through enough of the 64-bit space. At this point, Wojciech suggests we switch on one of the characters (heuristically) to get smaller subgroups and try again. I didn't actually find this to perform all that well; indirect branch predictors are better than 20 years ago, but the pattern is typically not that predictable. What I tried instead was to have more of a yes/no on some character (i.e., a non-indirect branch), which makes for a coarser split.
It's not at all obvious where the best split would be. You'd intuitively think that 50/50 would be a good idea, but if you have e.g. 40 elements, you'd much rather split them 32/8… if you can find perfect hashes for both subgroups (5-bit and 3-bit, respectively). If not, a 20–20 split is most likely better, since you very easily can find magics that put 20 elements into 32 buckets without collisions. I ended up basically trying all the different splits and scoring them, but this makes the searcher rather slow, and it means you basically must have some sort of cache if you want to run it as part of your build system. This is the part I'm by far the least happy about; gperf isn't great by modern standards, but it never feels slow to run.
The end result for me was: Runtime about twice as fast as gperf, compiled code about half as big. That's with everything hard-coded; if you're pushed for space (or are icache-bound), you could make more generic code at the expense of some speed.
So, if anyone wants to make a more modern gperf, I guess this space is up for grabs? It's not exactly technology that will make your stock go to AI levels, though.

















