I hate to voice a negative opinion but there are multiple red flags in this project.
There are lots of user space spinlocks in the code, for the scheduler and the mcsp queue. Questionable use of atomics and memory barrier orderings. NIH futex using atomics and a WINAPI semaphore for backoff. No cpu_relax hints. Not great test coverage.
This all could be achieved using mutexes and condition variables and it would have equally good best case performance (when using a futex based mutex), and it would be easier to test for reliability and it would have more predictable performance characteristics under high contention.
To be fair, this seems to be targeted at games for Windows and Xbox and I'm not sure how good the concurrency primitives (mutex, condvar) are for those platforms. Maybe it makes sense to spin wait on a gaming console. Or maybe the typical intended use case never hits the high contention cases or it doesn't matter if a thread is pre-empted when holding a spin lock.
I am not claiming this code is buggy, but 15 minutes of reading the code and tests does not convince me that it is not.
I've written and tested a lot of this kind of code and it is not straightforward. Sometimes I hit a problem after running a stress test on all CPU cores for 10 minutes. The tests/examples in this project run 10, 100 or 1000 times in a loop. That is inadequate to hit the pathological thread scheduling to to trigger race conditions with a high probability.
On the other hand, the project is in continuous development since 2015, created by a (game-)industry veteran and used in a (or at least one) released game. IME such projects born out of real-world requirements are much more robust (and easier to integrate) than the typical 'academic' or (worse) 'Google-scale' framework.
I agree. Absent contention those primitive are manipulated via user-space operations anyway. The condition variable even has some optimistic spinning in the NPTL implementation iirc. So basically, he is reinventing the wheel.
From that point of view, without judging anything about enkiTS and playing devil's advocate, there are lots of software proving their value in production, without having one single line of code subjected to the all best practices that are discussed here or in conferences.
So we should put aside all those YAGNI, TDD, endless review processes, and just focus on delivering actual product value.
For completeness, modern C and C++ compilers ship yet another thread pool, in OpenMP language extension. I sometimes write my own threading support for unusual use cases, but that’s hard to do correctly even with experience. For this reason, in 80% of cases I pick an off-the-shelf implementation instead of doing my own.
I checked Intel TBB and I have not found a way to see the source code, the licence or even something to download and test without buying or downloading the demo of a commercial product.
So far I only had bad experiences with Intel libraries, huge, opaque, bloated and difficult or impossible to use on different architectures.
They usually have very high performance code under the hood so this is a shame.
oneTBB is the continuation of TBB. Intel took all their OSS projects and put them under the "OneAPI" umbrella last year: https://www.oneapi.com/open-source/
This appears to be basically a callback scheduler with support for callback threading and sequencing. A fairly simple thing to code.
The catch here is that if you ever end up needing something like this, it's far more prudent to write your own version than to learn someone else's solution.
There are lots of user space spinlocks in the code, for the scheduler and the mcsp queue. Questionable use of atomics and memory barrier orderings. NIH futex using atomics and a WINAPI semaphore for backoff. No cpu_relax hints. Not great test coverage.
This all could be achieved using mutexes and condition variables and it would have equally good best case performance (when using a futex based mutex), and it would be easier to test for reliability and it would have more predictable performance characteristics under high contention.
To be fair, this seems to be targeted at games for Windows and Xbox and I'm not sure how good the concurrency primitives (mutex, condvar) are for those platforms. Maybe it makes sense to spin wait on a gaming console. Or maybe the typical intended use case never hits the high contention cases or it doesn't matter if a thread is pre-empted when holding a spin lock.
I am not claiming this code is buggy, but 15 minutes of reading the code and tests does not convince me that it is not.
I've written and tested a lot of this kind of code and it is not straightforward. Sometimes I hit a problem after running a stress test on all CPU cores for 10 minutes. The tests/examples in this project run 10, 100 or 1000 times in a loop. That is inadequate to hit the pathological thread scheduling to to trigger race conditions with a high probability.