Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Intel Releases x86-SIMD-sort 2.0 With Faster AVX-512 Sorting, New Algorithms (phoronix.com)
120 points by ashvardanian on June 24, 2023 | hide | past | favorite | 56 comments


A interesting reading related to this is the thoughts of Linus about AVX-512 back in 2020: https://www.phoronix.com/news/Linus-Torvalds-On-AVX-512

My conclusion (feel free to enlighten me if I am wrong) is that a system will profit by having more cores instead of AVX-512 for the same power consumption.


(Frequent AVX-512 user, opinions are my own.)

It is time to stop quoting this rant, which is rather far removed from actual practice and reality.

Specifically for sorting, Intel's sort and ours can be about 10x as fast as scalar.

AVX-512 has high power? Great! Power is work per time. Let's have more of that. It is _useful_ work per _energy spent_ that we want to maximize, and there scalar code falls flat. Consider that OoO scheduling/control overhead is the real culprit; executing one instruction costs far more energy than the actual operation. SIMD divides that fixed cost over 16 elements, leading to 5-10x higher power efficiency.

Top frequency reduction? Not since the first implementation on Skylake, and even there a non-issue, see https://github.com/google/highway/blob/master/hwy/contrib/so....

More cores? The shared-memory fabric is already bursting at the seams (see on-chip bottlenecks of Genoa), we are already cramming in too many for the current memory interface.

What would actually be necessary: more bandwidth! A64FX CPUs actually beat GPUs in 2019 for supercomputing applications thanks to their HBM. Intel's Sappire Rapids Max also has this. Let's build more of those, at a decent price. And start using the wide vector units, they are there precisely because lots of highly-clocked cores in one big NUMA domain is not always the best way forward.


Please, you are overstating your case in a way that distracts from the issues at hand. I do agree that Linus's opinions are irrelevant to a lot of computing, and that sorting must make some use of SIMD for the best performance, and that there's good work in vqsort. But the 10x factor is highly misleading. Sure, it's technically correct that you "can be about 10x as fast" as a scalar sort. In the same sense that a scalar sort can be 10x as fast as an AVX-512 one.

The 10x number is from comparing against std::sort on random data, yes? Both common std::sort implementations are branching quicksorts from the era when introsort was still considered exciting new technology, and are much slower than pdqsort and other branchless quicksorts. The measurements at [0] show by my estimate about a gain of 3x against pdqsort and 2x against ipnsort. For vqsort only; intel barely beats ipnsort. I also pointed out to you at [1] benchmarks showing 1450MB/s for 32-bit radix sort at a million elements on M1, which is comparable to your Xeon measurements and nearly 3x faster than the 499MB/s for M1 in the vqsort paper.

If you're going to quote a std::sort number, at least give the name instead of implying it's a good representative for scalar sorts! I would strongly recommend, however, that you benchmark against and study more modern scalar algorithms. I know SIMD and have even vectorized parts of pdqsort in the past; I've chosen to work with scalar algorithms for the past year or two because I think these methods still have a lot to offer.

[0] https://github.com/Voultapher/sort-research-rs/blob/main/wri...

[1] https://news.ycombinator.com/item?id=36309049


What exactly do you see as overstated - only the "10x"? I realize characterizing sort performance with one number is very lossy, but consider "can be" to be a rather modest claim.

> The 10x number is from comparing against std::sort on random data, yes?

That's just one example. For another, check out how many projects still use the even slower HeapSort: https://sourcegraph.com/search?q=context:global+HeapSort+cou... (198K search hits)

When comparing against ipnsort, that is also (partially) vectorized, right?

FWIW I agree that radix sort can be great in certain cases, especially if one core is allowed to soak up all the system memory bandwidth. Unfortunately this is rare from where I sit.

I'm curious which specific parts/aspects of scalar algorithms you'd recommend for study? From my perspective (admittedly skewed towards servers), anything as energy-inefficient as scalar code running on big OoO cores is a hard sell whenever vector alternatives exist.


Yes, only your "10x" and "5-10x" numbers are overstated. Which is to say, all the quantitative comparison you've shared here, or in the vqsort README (aside from the link to Lukas's measurements, which is linked with no indication it benchmarks against faster sorts). What you share is defensible in that it's technically correct, but it's not helpful.

As an example, consider an application that uses heapsort now, but ships a unified binary for x86. The programmers could easily switch to pdqsort, which is well-tested and available in many languages. If they read only material you publish, they may get the impression that SIMD support (at least SSE4) is required to get a substantial improvement, and stick with heapsort to avoid the need to ship multiple binaries or add architecture detection. All you need to correct this is one mention that you can do better than std::sort! By leaving this out, you instead reinforce the idea that it's a reasonable choice if you have to go scalar.

ipnsort has no explicit vectorization; some parts may be auto-vectorized. From a quick glance through perf results, I don't see any vector instructions used in the parts that take up time. I didn't see any anywhere, actually. Same story with other quicksorts, although I know fluxsort/crumsort's initial analyzer is designed to be auto-vectorized.

I don't get why you dismiss radix sort so quickly, when your own research shows that an MSD/LSD hybrid addresses the bandwidth problem well? Thanks for sharing that paper by the way, very elegant approach. The bandwidth problem is not one I've seen discussed although I see how it would arise. I just don't deal with big arrays that much—and it seems if you're going to market vqsort as general-purpose you should also care about this use case! However, a nice thing about radix sort is that it can be used as a base case for quicksort/samplesort, and any partitioning done so far allows fewer steps to be used. I have a case in SingeliSort to do 4-byte sorting that fits in 2-byte range nearly in-place[0]. It's only used on 2^16 elements or fewer to avoid possible cache associativity problems, so that should fit in L2 and not use system bandwidth.

Comparison-based algorithms I consider most relevant at the moment are fluxsort/crumsort/quadsort (all by the same author; they share many pieces), ipnsort (largely derived from those plus pdqsort but it may be easier to read), and glidesort somewhat (the gliding technique is cool but I think there are flaws to address, see [1]). For distribution sorting, there's radix, the version of counting sort where you reconstruct from counts instead of moving the data (output can be vectorized with a scan for large ranges!), and I've developed Robin Hood sort to take advantage of smooth-ish distributions such as uniform random. I keep notes at [2], which are pretty out of date. I should be writing there instead of here...

I of course don't think there's any requirement to know all this to contribute to sorting research. I take issue with making blanket claims like "I'd consider AVX2 to be table stakes for any sorting algorithm" and "anything as energy-inefficient as scalar code running on big OoO cores is a hard sell" without being familiar with the state of the art in scalar sorting.

[0] https://github.com/mlochbaum/SingeliSort/blob/master/src/rad...

[1] https://github.com/Voultapher/sort-research-rs/pull/6#issuec...

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...


> Yes, only your "10x" and "5-10x" numbers are overstated.

OK. I am curious: what is the desired outcome for you in this discussion? If we are arguing whether in some cases the speedup is 2x, I struggle to understand why anyone would use an 'only' 2x slower algorithm in a time-critical application.

> they may [..] stick with heapsort to avoid the need to ship multiple binaries or add architecture detection.

Huh, why would they do that? Calling VQSort() takes care of detection, and does not require multiple binaries.

> All you need to correct this is one mention that you can do better than std::sort! By leaving this out, you instead reinforce the idea that it's a reasonable choice if you have to go scalar.

When do people 'have to' go scalar? I suppose we can mention: "Note that other algorithms such as pdqsort can be about twice as fast as LLVM's std::sort as of 2023-06."

> ipnsort has no explicit vectorization; some parts may be auto-vectorized

Yes, Lukas mentioned it uses 128-bit SIMD so I am guessing it's via autovectorization.

> I don't get why you dismiss radix sort so quickly, when your own research shows that an MSD/LSD hybrid > addresses the bandwidth problem well? Thanks for sharing that paper by the way, very elegant approach.

:) Thanks. Unfortunately virtual memory is quite finite (the OS does not leave much for users, especially on Windows), so this approach is problematic for large arrays. Also, many of the use cases we see are 64 or even 128-bit elements, with varied and unpredictable distributions. Sure, radix sort can be suitable when there are smaller keys.

> Comparison-based algorithms I consider most relevant at the moment are fluxsort/crumsort/quadsort .. > I keep notes at [2]

Thanks. It's not clear to me how those ideas can interoperate with SIMD. For example, counting the range during partitioning turns out to be expensive, more so than it would be in scalar code (because fewer ports can run vector instructions than scalar instructions).

An is-sorted check could be vectorized as you note, but it's not clear to me that that is a net win (extra cost if not alread sorted, and if it is, better to avoid the sort call in the first place).

> I take issue with making blanket claims like "I'd consider AVX2 to be table stakes for any sorting algorithm"

Here are some of my axioms: 1) all x86 CPUs we care about support AVX2 (which has been available since 10 years). 2) Lukas' measurements show that the fixed vqsort with AVX2 is the fastest starting at around 32 elements. (For less than that, I still struggle to understand how branchy insertion sorts can be worse than a single indirect branch to an AVX2 sorting network, and suspect this is due to remaining issues with measurement including lack of fences and an unusual attempt to 'poison' branch predictors/cache.)

3) It seems unlikely that anything using less SIMD hardware than vqsort (for example only SSE4) is going to outperform the AVX2 version. 4) we currently have no reason to use a slower sort.

Do you disagree with any of those? I suppose there are some CPUs without AVX2, but it's an open question whether one would want to ship an entirely different algorithm for them.


I just want you to accurately represent the performance, which as I've noted is not a big change. If a 2x speedup is definitive, why are you so desperate to push the number higher? You give the speedup of vqsort over std::sort in ideal conditions (AVX-512) but when asked to take into account faster scalar sorts, you go to pdqsort's 2x factor (again this implies that it's the best available!) rather than fluxsort's 3x or ipnsort's 4x.

No, I don't believe your current AVX2 performance will always be better than scalar algorithms. Lukas's benchmarks show... what, a 30% difference between vqsort and ipn in this case? I have 2-3x for random numbers (yes, other cases are not always this good) in Singeli sort, which I'm choosing not to promote until it's more mature. No, you don't need to range check in partitioning. If you don't understand how this can be achieved, that's your problem.


Did you used to have an axion: "All cpus we care about had altaVec?" All those CPUs are deperciated. The AVX-512 CPUs are depreciated. I have 4 different Xeon systems and none of them have AVX-512. It's a niche and it's cool... But in 4.5 years, we will probably not be having this discussion, and it's sad. Make a AVX-512 coprocessor/transputer please? Multi... So I can cram 3 of those in a PCIe lane and run database systems far and fast - with out oracle/ms-sql... And a broad range of compiler support... Put LAMP at the top of the heap... /End rant.


"can be", if taken as modestly as needed for it to be correct here, is pretty much a completely useless statement. Of course [insert modern sort] is [a ton] faster than [some other unspecified old sort that might not even be targeting perf for sorting plain integers]. So I'd assume most would think that's clearly the wrong interpretation, and read it as a stronger assertion, i.e. any currently known scalar sort is ~10x slower than vqsort/avx512sort, for at least some (this "some" being the justification for "can be" over "will be") input. (and naively this makes sense - 512-bit SIMD processes 16 32-bit elements in a single instruction, so a 10x improvement doesn't sound unreasonable)


I do not understand how one would jump from the readme's "As of 2022-06-07 this sorts large arrays of built-in types about ten times as fast as `std::sort`" (and this is now being clarified to add LLVM's std::sort) to 'any scalar sort is 10x slower'.


That's about the number in the post here. The readme is better at this, as it does actually mention std::sort, whereas here it's just "Intel's sort and ours can be about 10x as fast as scalar.". (not that I expect everyone to provide 100% of the necessary details on everything on every post anywhere on the internet, but here it's been explicitly pointed out to you that it's quite misleading)


Quite the robust defence of scalar, coming as it is from from a vector guy!

https://github.com/mlochbaum/BQN


Vector registers are great, but it's important to remember that random-access cache is also very powerful hardware, and can be better-suited to a lot of searching and sorting tasks.

One task where, unlike sorting, I think vectors have no shot, is random shuffling. For small sizes (okay, but too large to fit all in registers) it's hard to see how anything could beat Knuth shuffle since it just does one swap per value. At large sizes it's not cache-friendly, and there are known methods based on quicksort and mergesort that work better. But I found that a radix pass beats these easily, and I'd expect this to hold even with AVX-512 versions of merge/partition, because it takes 8 steps of those to add up to one 1-byte radix step. Sorting runs into similar problems at very large sizes. The vqsort authors found that at 1e8 elements a hybrid of ips4o (samplesort, think many-way partition) beats pure vqsort.

Writeup: https://mlochbaum.github.io/BQN/implementation/primitive/ran...

Timings (i5-6200U): https://matrix.org/_matrix/media/r0/download/matrix.org/AVtv...


> vqsort authors found that at 1e8 elements a hybrid of ips4o (samplesort, think many-way partition) beats pure vqsort.

Whoa, important caveat there: we did the hybrid with ips4o as an easy way of getting multicore support. We also mention pure scalar ips4o is slower than vqsort until it can use _19 threads_ (vs one for vqsort). It is the combination of multiple threads (to compensate for the much slower scalar code) and bandwidth friendliness that makes this hybrid useful.


Apologies, I misread that section and thought the only difference between tables 1 and 2 was the array size! Thanks for correcting my impression here, that's good to know. And very sorry for the misrepresentation.


It seems likely to me that multi-way merge is better. 256-way is a bit much, but, with appropriate queueing, you can get close. Among other reasons because, unlike multi-way partition, multi-way merge does benefit from vectorisation, and hence is more efficient. (Of course, for the 'very parallel' a la gpus, partition/radix is apparently interesting again, as you can afford the out-of-place parallel scan.)


Oh, I didn't know that about multi-way merge (regrettably I haven't really wrapped my head around SIMD merging yet). Got a source for this? https://vldb.org/pvldb/vol8/p1274-inoue.pdf mentions it but it looks more like interleaving several binary merges than true multi-way. Although I guess from a memory perspective there's no difference.

Also, not sure I've mentioned multiway Powersort to you, and it seems like a natural fit: https://www.wild-inter.net/publications/cawley-gelling-nebel... .

One thing that I'd imagine makes sense is to increase the number of merged arrays as the total size gets larger and the memory accesses get slow.


Yeah, that's logically what it is—point is just to reduce memory traffic. (And, for that matter, memory subsystem ops, which is amusing as distribution by contrast directly exploits the memory subsystem.) I haven't read anything about it, but I came up with the following scheme:

A basic efficient scheme for merging two arrays is: pop one vector from the left array, pop one from the right array, merge them with some oblivious sorting network, and then write out the low vector. Keep the high in reserve. Then repeatedly pop a vector from whichever array has the lower initial element, merge it with the reserve vector, and write out the low vector from the result.

We can imagine doing something similar to merge k inputs: pop one vector from each input, and merge them all together. Write out the low vector, keeping the remaining k-1 in reserve; then, repeatedly pop one vector from the input with the lowest initial element (can be determined with a tourney tree[0]), merge it with the k-1 reserve vectors, and write out the low result. The problem is that this is inefficient; we have to do a 1:k-1 merge. The solution is queueing. Pop k vectors, each time from the lowest remaining input. Merge the k vectors all together (if k is a power of two, then this is completely balanced), then merge the fresh k with the reserve k-1 (only slightly unbalanced), and write out the low k results, leaving the high k-1 as the reserve for the next iteration.

For very large inputs, this will be inadequate (I think I can get up to k=8 on avx512), but it can be layered using in-memory buffers; 8^2=64-way merge seems reasonable, especially given that at this point you probably want to use multiple cores.

Multiway powersort looks interesting, thanks! I was wondering how to solve that problem.

0. https://en.wikipedia.org/wiki/K-way_merge_algorithm#Tourname...


Last time I checked, there are more than two C++ compilers.


More bandwidth, and more or wider cache lines, maybe. From my crude benchmarking, pretty sure a lot of the oomph in Apple's M1 is that 128-byte cache line.

512-byte SIMD operations are great, but if I'm constantly having to copy things over to SIMD registers to do it... am I really winning?

More cores just means more cache evictions / contention in many workloads.


AMD enabled AVX-512 without increasing their power consumption. To do that their AVX-512 runs only half of max speed compared Intel when using 512 bit registers (aka max flops is same as with AVX2 256 bit registers). Still, because the new instructions in AVX-512 are beneficial in many scenarios the actual speed up is still often 10-20% in code that benefits from the new instructions.

Then 4 years later AMD has the option to bump to full-speed AVX-512 if it is really needed. This is the same they did with AVX2 initially.


No, this (AMD running at half speed) is a myth. For most instructions AMD runs the 512-bit instructions at exactly the same speed as Intel. Both Intel and AMD execute either two 512-bit instructions per cycle (most simpler instructions) or one 512-bit instruction per cycle (a few of the more complex instructions). The reason is that AMD executes four 256-bit instructions per cycle (and Intel only three, with a fourth 256-bit section that can be used only by the AVX-512 instructions). Both Intel and AMD pair 256-bit sections to execute 512-bit operations at half rate, but equal throughput (the effective throughput actually increases due to lower overhead).

The main exception is that some of the models of Intel Sapphire Rapids, only the most expensive of them, i.e. Xeon Platinum and most of the Xeon Gold models, have a second FMA unit, so only these models are able to execute two 512-bit FMA instructions per cycle, while Zen 4 can execute only one 512-bit FMA instruction + one 512-bit FADD instruction per cycle.

The second advantage of Intel is that it has double bandwidth towards the L1 cache, which is absolutely necessary for Intel, because otherwise it would have been impossible to provide the operands for two FMA instructions per cycle.

While the more expensive Intel models have this FMA throughput advantage, there are also many AVX-512 instructions where AMD has lower latencies than Intel. AMD has also higher clock frequencies when all the cores are busy, and also higher memory throughput, so the result is that for many AVX-512 workloads a Zen 4 beats easily a Sapphire Rapids.

The main applications where Intel wins are those that depend mainly on FMA, i.e. linear algebra operations on dense arrays, e.g. DGEMM or LINPACK.


A lot of the downclocking issues he was talking about then are less severe now on newer Intel cpus and AMD cpus, which changes the calculus a lot.

You could probably find a workload where your conclusion is correct but I think the vast majority of workloads would be faster with AVX-512 if you have the time to leverage it.


heavily depends on the workload

Some workloads can be accelerated via AVX-512 as shown here by Anandtech:

https://www.anandtech.com/show/17601/intel-core-i9-13900k-an...

See how AMD CPUs with AVX-512 enabled some a massive boost even with similar/less cores

I would agree that most typical workloads don't benefit much from AVX-512, it requires software support and good use-case(wide parallel SIMD)


"HPC doesn't exist and also FP performance is irrelevant" seems a wildly out of touch take from Linus.


Honestly HPC moved to GPUs for most of the heavy FP compute

for CPUs INT perf is king, even in HPC/enterprise


Unfortunately this is not true in numerics. Lots of stupid heavy cfd/fea type workloads parellize well but aren't gpu accelerated. The reasons aren't clear to me, but a lot of the popular solvers are cpu only and involve mostly fp calcs. There are a few solvers that use gpus but they tend to be less accurate in exchange.


Reasons : there is a significant amount of work needed to get codes to work on a distributed hybrid or gpu-only fashion. It's a completely different coding paradigm that needs significant studies before commercial entities adopt gpu use at scale. All-gpu solvers are starting to be developed, such as fun3d GPU[0], but features are very limited. GPU development is starting to catch up in the community, so it won't be long before a significant portion can operate heterogeneously or in gpu-only mode.

[0] https://fun3d.larc.nasa.gov/GPU_March_2021.pdf


'this is not true in numerics' - shows no evidence...

GPUs are gaining traction in FP workloads, it can be seen clearly with CPU/GPU data-center market share

Moore's law is pretty much over, we can't simply print more performance these days, we are going to see major shift to accelerators which would require some rewrites, otherwise you're going to be stuck


That would be ironic because Linus also predicted the death of discrete GPUs.


Fabian Giesen has a decent rant on AVX512. https://mastodon.gamedev.place/@rygorous/110572829749524388

Tldr it is not that it isn't useful, but the 512 bit part makes implementation prohibitively expensive.


It doesn't look like a rant to me. Quite the contrary, it is a very important myth busting against common rants regarding AVX512.


I won't say you're wrong, but the answer might depend heavily on the application.


Faster, but still 15-25% slower than Highway's vqsort according to my somewhat amateur tests:

  ------------------------------------------------------------
  Benchmark                        Time     Bytes    Items
  ------------------------------------------------------------
  Hwy/random                     2.16 s  474.1M/s  62.1M/s
  Hwy/sorted_runs                2.29 s  446.7M/s  58.5M/s
  IntelX86SIMD/random            2.90 s  353.9M/s  46.3M/s
  IntelX86SIMD/sorted_runs       2.68 s  382.4M/s  50.1M/s
  

  
https://github.com/funrollloops/parallel-sort-bench


See also "10~17x faster than what? A performance analysis of Intel's x86-simd-sort (AVX-512)" at https://github.com/Voultapher/sort-research-rs/blob/main/wri... posted here at https://news.ycombinator.com/item?id=36273544 and minor additional comments at https://news.ycombinator.com/item?id=36268877 .

When those benchmarks were first done, for random input, vqsort was faster once you get around ~20,000 items, but for 1,000 items the new AVX512 sort is 13x faster than vqsort.

As you read it, you'll see that the vqsort issue for small arrays has been fixed, and as of a few weeks ago or so, vqsort is now faster than the AVX512 sort for random.


Is there an equivalent to caniuse.com for gauging the level of instruction set extension support on x86 and ARM in the wild?


The Steam Hardware Survey "Other Settings" section lists the prevalence of various extensions among users that have taken the survey:

https://store.steampowered.com/hwsurvey/Steam-Hardware-Softw...

The data set is a mixed bag though. Some types of system are going to be over-represented.


Yes, I've recently heard some mention of this survey being skewed/buggy, which is a pity because I do not know anything better. IIRC Linux systems were underrepresented.


What kind of processes are heavily (or, at least, moderately) impacted by the performance improvement of sorting algorithms?

By my lack of knowledge (and the heat of today :0 ) can't guess what algorithms or processes do lots of sorting behind the scenes.


I worked in embedded development and sorting large lists of files was a surprisingly big bottleneck for many of our projects because of the very slow microcontrollers. Even worse we couldn't cache the results because of no memory.

So I was tasked to improve that and I had to write inline assembly to abuse specific cpu instructions that could effectively do much more char comparisons per clock cycle. We ended up not using it because of the usual reasons (not portable, hard to maintain) and our customers had to live with the 20-30s delay when entering a large directory, versus the 5-6s my code achieved.

(Sorry this has little to do with the topic at hand, I don't really know when sorting becomes a problem on a multicore 5ghz cpu)


Sorting is pretty common in the numerics world because a lot of algorithms or techniques can be optimized heavily with sorted inputs. You either get to skip steps or bisect the dataset. Sort of like how most fast fft implementations will run 10-20% faster if you pad vectors to reach a power of 2 length. A typical "preprocess pipeline" would involve splitting vectors into power of 2 sizes or padding them to maximize cache lines, normalizing (and often mapping to an integer domain), and sorting.


Databases would be me suggestion


I might be wrong but I don't think any databases operation is limited by sorting. Even naive quicksort with compiler optimization performs very fast and it is much more faster than reading from disk even with extra log n factor. Maybe things like redis sorted sets could benefit from it, but most implementations of things like these are priority queue.


Disk read access access is less and less relevant in normal business application DBs.

It is common at the far ends of scalability spectrum: low end where people use $2 VPS, and high end where big tech can afford to invest a lot in engineering effort to build complex systems that are fine tuned for many levels of memory hierarchy.

Most systems can just add memory until data fits. (of course frequently they still don't, and keep doing more expensive things like having many separate copies for different purposes built in various different expensive projects... Sometimes additionally faux justified by using cloud platforms that charge prohitively for memory)


(Hopefully) gone are the days when DBs are spending the bulk of their time blocking in I/O waiting for spinning disks to spin. They spend a lot of intelligence on keeping active tuples paged-in, and memory sizes are exponentially bigger than they used to be, and intelligent query optimization is going to do as much as it can into memory and then apply join algorithms which... are usually sorts.

And even writes, there's sorting there. An LSM or a B-epsilon-tree, etc. they're doing sorting and ordered key traversals before paging out.


It’s pretty common for most SQL queries to be reading data that is entirely cached in memory. In these cases, the performance limitations often come from sorts or hash joins.


What isn't impacted by sorting algorithms would be more of the question?

Anything using an ordered data structure -- btrees, red-black tree etc maps and sets, they're doing sorting.

Which, well, databases. Huge one.


3D graphics.


The client CPUs released until present do not have AVX-512 sadly. I'm not planning any hardware updates for the next 3 years at least. Not relevant for me at least.


Are these significantly faster than the 256 bit versions? They compare to non-simd, but that’s less interesting.

Also, it’s unfortunate that this stuff won’t run on most developer machines.


Yes, can be about 1.6x speedup: https://github.com/Voultapher/sort-research-rs/blob/main/wri... (cpp_intel_avx512).

Disclosure: I am the main author of vqsort, which is also a vectorized Quicksort but portable.


The 512 bit registers are the least important part of AVX-512. The much more important parts are the 32 registers, mask registers (these are great for quicksort), compressed displacement (which shrinks the instruction size for unrolled code), and a bunch of other goodies.


You don't need avx512 to implement a bitonic sort. I worked on several implementations with plain avx.


AVX-512's VPCOMPRESS* instructions are very useful for binary-radix sorting though.


Or SSE2+.


It can often be more than you would expect from the width increase, as lots of masking tools and instructions allow you to vectorize things more efficiently than could be done before.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: