It seems likely to me that multi-way merge is better. 256-way is a bit much, but, with appropriate queueing, you can get close. Among other reasons because, unlike multi-way partition, multi-way merge does benefit from vectorisation, and hence is more efficient. (Of course, for the 'very parallel' a la gpus, partition/radix is apparently interesting again, as you can afford the out-of-place parallel scan.)
Oh, I didn't know that about multi-way merge (regrettably I haven't really wrapped my head around SIMD merging yet). Got a source for this? https://vldb.org/pvldb/vol8/p1274-inoue.pdf mentions it but it looks more like interleaving several binary merges than true multi-way. Although I guess from a memory perspective there's no difference.
Yeah, that's logically what it is—point is just to reduce memory traffic. (And, for that matter, memory subsystem ops, which is amusing as distribution by contrast directly exploits the memory subsystem.) I haven't read anything about it, but I came up with the following scheme:
A basic efficient scheme for merging two arrays is: pop one vector from the left array, pop one from the right array, merge them with some oblivious sorting network, and then write out the low vector. Keep the high in reserve. Then repeatedly pop a vector from whichever array has the lower initial element, merge it with the reserve vector, and write out the low vector from the result.
We can imagine doing something similar to merge k inputs: pop one vector from each input, and merge them all together. Write out the low vector, keeping the remaining k-1 in reserve; then, repeatedly pop one vector from the input with the lowest initial element (can be determined with a tourney tree[0]), merge it with the k-1 reserve vectors, and write out the low result. The problem is that this is inefficient; we have to do a 1:k-1 merge. The solution is queueing. Pop k vectors, each time from the lowest remaining input. Merge the k vectors all together (if k is a power of two, then this is completely balanced), then merge the fresh k with the reserve k-1 (only slightly unbalanced), and write out the low k results, leaving the high k-1 as the reserve for the next iteration.
For very large inputs, this will be inadequate (I think I can get up to k=8 on avx512), but it can be layered using in-memory buffers; 8^2=64-way merge seems reasonable, especially given that at this point you probably want to use multiple cores.
Multiway powersort looks interesting, thanks! I was wondering how to solve that problem.