(Disclaimer: I don't design ISAs) This goes over a decent amount of nuance that is often lost in internet arguments over ISA design. I frequently hear (on other sites of course, nobody here would be stupid enough to argue this) people going "RISC-V is the best because it has the simplest decode" or "x86-64 has an easy way to do this kind of conditional branch" and everyone is just talking about some facet of the problem without really thinking about the broader picture. Or, worse, they actually have no idea what the problems are for other people, so they'll wave away serious problems in ares they're not familiar with: "oh we can just fuse the uops on big cores", "decoders don't actually matter these days", "nobody will emit that sequence so it doesn't matter if it's fast". To be fair, a lot of the actual practical results are not disclosed widely by the people who make the decisions, so it's easy to fool yourself into whatever you want to believe without the numbers to back it up.
Instruction fusion is a perfectly cromulent approach. These days RISC-V extensions are even written with instruction fusion in mind, such as the recently proposed Zicond - which just adds a couple "conditionally move value or zero" three-register insns. It turns out that this is enough to support lots of "conditional insn" patterns that other ISA's have to encode explicitly.
Look I didn't want to name names but if you willingly volunteer to serve as an example of what I was talking about I am more than happy to let you do that.
...RISC-V spec literally says things like "we define the canonical sequence to be MULH/MUL, in this order. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies".
I'm not going to respond to your specific claim directly, because I just said I wouldn't do that above. If you really want to hear my opinions bring this up in some other thread, or just wait for someone else to make the argument ;) But I would like to ask you if what you're saying is validated by actual silicon, and if so, under which constraints. Does "microarchitectures can then fuse these" actually pan out? Do the (size, mainly?) savings actually help in the contexts it is claimed it targets (embedded?). How do other contexts (server, desktop) feel about this? Is it useful for them? Perhaps it is actively harmful for what they want to do?
Not even necessarily fuse in that case, just cache the operands and result in internal registers and don't run the multiply again if the operands are the same. Same for DIV/REM.
Designing for fusion is valid, but RISC-V has a lot of cases that boil down to "use a 12-byte fused instruction where other architectures do it in 4 bytes".
The complaining is about number of dynamic instructions ("path length"), which can hit you if you don't fuse. Of course, path length might not actually be the bottleneck to raw performance, but it's an easy metric to argue, so a lot of people latch on to it.
Dunno about "great" - "For 6 out of 10 mini-app+compiler pairs, Arm has a shorter path length, with the overall average difference when weighting each benchmark equally being 2.3% longer for RISC-V."
No, winning 4 and losing 6, by a small margin, isn't "being worse than arm". The paper's authors even explicitly conclude it is not losing to ARM.
This is even ignoring whether code is within or outside loops, counting fuseable instructions as always non-fused, and not considering any instructions from extensions after 2019's ratified (actually unchanged from 2017) rv64g... any of those would have a favorable effect on RISC-V.
This is an excellent result for RISC-V, that clears any doubts in terms of path length. On top of what we already know about RISC-V leading in code density in 64bit.
Might not be "worse" (I'd definitely agree that the difference is plenty small enough to be considered equal within error bounds), but is certainly not something worthy of RISC-V being noted as doing "great" either.
Excluding extensions is perhaps a significant question, but, for example, Debian RISC-V currently targets rv64gc, which should have the same instruction counts as rv64g does, so software compiled for Debian can't use the later extensions for most code anyway. (never mind that ARMv8 also has excluded extensions, namely NEON, which is always present on ARMv8 and is not designed to be ignored)
And, of course, even being better than ARM is not equivalent to being the best it could be; ARMv8 isn't some attempt at a magical optimal instruction set, it's designed for whatever ARM needed, and that includes being able to efficiently share hardware with ARMv7 for backwards compatibility.
Also the difference in number of instructions on real programs is in the 10% range, which could well be compensated by other factors. For example, keeping to simpler instructions might well result in a 10% higher clock speed and lower silicon area too, equalising matters if not gaining an advantage.