Lookups in cuckoo tables hit two uncorrelated ("random") locations in the hash t...

nkurz · on Sept 1, 2016

I lost you in the last paragraph. Are you maybe missing a word, or saying cuckoo where you mean Robin Hood?

Separately, aren't TLB misses pretty fast compared to hitting RAM? I'd think they would only dominate if the TLB itself it too large for cache, which I don't think is common. And if you are using so much memory that this is happening, moving to GB Hugepages would solve it

The "sensitivity to a bad hash function" seems like an odd weakness. For a given quality of hash, wouldn't cuckoo tend to be less susceptible than any single hash approach?

I ask because I'm currently bullish on d-ary cuckoo hashes, and think they'd be a good fit for the fast gather on Skylake.

pkhuong · on Sept 1, 2016

My last paragraph says you're probably better off with a solution that doesn't achieve the same density but avoids the uncorrelated accesses. In other words, stick to some form of local probing and take the hit in density.

Why would TLB misses be small compared to RAM latency? A TLB miss must be serviced by reading more memory, and misses aren't handled in parallel, unlike random reads to RAM. Sure, you could use very large (1G) pages, but that's a pretty specialised setup that's not available on every platform and tends to require a reboot to enable/tweak. Not something we want to rely on in general.

Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed. That's particularly problematic with the usual interfaces that don't let the hash table specify a seed to the hash function and expect a machine word: we have to map values to hashes, and remix or split the hashes (in theory, that's defensible as long as the intermediate hash is strong and its codomain is at least (hash set size)^2), but there's nothing to remix away if we have too many values that map to the same intermediate hash. If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.

Simple deterministic probing technique just take a hit with a bigger cluster than expected; a performance degradation, but the table still work. You can also see theoretical analyses that'll lead you to similar conclusions if you look at the k-independence needed for each hashing technique.

Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line. You don't need SIMD instructions to expose memory level parallelism, you just need independent dependency graphs in your serial computation (until you're bottlenecked on an execution resource that's not duplicated and rarely pipelined, like the TLB miss logic).

nkurz · on Sept 1, 2016

My last paragraph says you're probably better off with a solution that doesn't achieve the same density but avoids the uncorrelated accesses. In other words, stick to some form of local probing and take the hit in density.

Thanks, I was misreading.

Why would TLB misses be small compared to RAM latency?

Because for recent CPU's (post-P5 for Intel) the page walks to service a TLB miss use the standard data caching mechanisms, thus for a frequently used hash table that is reading only a couple cachelines per lookup, the page tables usually remain in cache: http://electronics.stackexchange.com/a/67985.

So while the TLB miss requires a lookup, this lookup frequently doesn't require hitting RAM. My recollection is that this means a TLB miss usually costs only the relevant cache miss plus ~10 cycles. But this does require certain assumptions about the access pattern, and I've been meaning to retest this on recent hardware to be sure.

misses aren't handled in parallel

Based on your earlier phrasing you probably realize, but in case others don't, since Broadwell Intel CPU's do handle two page walks in parallel: http://www.anandtech.com/show/8355/intel-broadwell-architect....

Cuckoo is particularly sensitive to bad hash functions: if a few elements always hash to the same pair of values, you're screwed.

Yes, although if you can choose a good hash function this should be rare. And there are variations of cuckoo hashes that are much less susceptible to this. The first either increases the number of hashes (d-ary), and the second adds multiple "bins" as described by 'cmurphycode' in another comment. Then you can add a "failsafe" by adding a "stash" of last resort: https://www.eecs.harvard.edu/~michaelm/postscripts/esa2008fu...

If the hash table specifies calls the hash function with two different seeds, that means double the time spent in hashing, and that overhead can cover for a lot of linear probing.

If you can choose your own hash function, the hashing cost should be minimal even for a "perfect" hash. And a SIMD approach usually means that you can create 2, 4, or 8 hashes using different seeds in exactly the same time that you can create a single hash: http://xoroshiro.di.unimi.it/xoroshiro128plus.c

Finally, I don't think gathers are faster than independent memory accesses unless everything is in the same cache line.

They weren't any faster until Skylake, but they are significantly faster now: https://github.com/lemire/dictionary