There is not some force of nature that makes 1 thread per core "real". On pointer-chasing workloads a hyperthread is as real as anything. HT is basically another way to exploit instruction-level parallelism that your compiler left on the table. You gotta pick what suits your program.
A hyperthread logical processor is a way of obtaining better efficiency by using surplus resources left on the table by the "real" logical processor that otherwise would remain idle (read: waste).
Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.
If you're got a workload that isn't characterized by missing cache and waiting around for main memory fairly often then usually SMT is going to be pretty inefficient. As a crude rule of thumb, the single thread performance you get is going to scale as something like the square root of the number of transistors you throw at the problem. So if you want two threads of a certain performance you can either use one bit core with SMT-2 or two cores each a third or so the size. And the two cores will tend to be more power efficient two with less data moving long distances through the caches.
Now, if you are hitting main memory often it makes sense to go wild with thread and use SMT-8 like IBM's POWER cores or Sun's SPARC cores did.
And if you mostly just care about maxing out single threaded performance for user responsiveness then you do indeed have lots of unused resources most of the time and you might as well add SMT for when you're more throughput bound.
But while the design and transistor costs of adding SMT to a design are very modest, everything I've heard about the test and verification of it seems pretty hairy.
> Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.
It's as real as a Bulldozer CMT thread, and that was widely considered a "real core".
An integer ALU being pinned to a particular thread doesn't make it "real" especially when it comes at the expense of shared frontend resources like a decoder that has to alternate between servicing each thread on alternate cycles (which, as Agner Fog's Microarchitecture notes, massively bottlenecks both threads). And obviously if one thread has ILP that can be exploited, and the other "core" isn't using its ALU, it would be better to share that! And when you allow that, what you get is SMT.
At the end of the day that's all CMT is - SMT with inefficiently allocated (pinned) resources, and it had even less dedicated resources on the frontend as well. And people will absolutely die on the hill that bulldozer was a "real core". There are probably some scheduling advantages to doing CMT instead of SMT, but also performance costs as well.
So what is a "real core" anyway in this context? Is physical (unsharable, unchangeable) division of resources "inherently better" (or even inherently different) from logical/software-defined division of resources?
Then you've got this whole thing from IBM recently... and leaving aside the fact it's cache, the question IBM is fundamentally asking us is, why not just allocate more resources to "cores" that need them? Why not execution units as well, why wouldn't that be better? Why is hardware defined core better than software defined core? https://www.anandtech.com/show/16924/did-ibm-just-preview-th...
And when you look to how you would implement that for non-cache resources, isn't the simplistic answer something very similar to SMT? Not sure POWER9 is all that far off base with runtime-configurerd SMT4/SMT8 and a big fat core, maybe that's how Intel can make better use of some of the gigacores they've built. Sure it can run one thread really fast but why not 8 threads on the same resources? Or you can go the other way and have one thread issue onto multiple cores, as long as there's ILP to cover it and the performance impact of crossing cores is not large, does the difference really matter?
And yeah maybe that's still one core... but then so is a bulldozer CMT module too lol. The whole "what is a core exactly" is kind of trite, it doesn't really matter.
Especially because a hyperthreaded core can usually run 2-4 of any particular instruction in parallel, and 6-10 total instructions in parallel. Even if your workloads never stall, they can still get a reasonable core worth of resources each, just a smaller core.