How to Design an ISA

saagarjha · on Jan 18, 2024

(Disclaimer: I don't design ISAs) This goes over a decent amount of nuance that is often lost in internet arguments over ISA design. I frequently hear (on other sites of course, nobody here would be stupid enough to argue this) people going "RISC-V is the best because it has the simplest decode" or "x86-64 has an easy way to do this kind of conditional branch" and everyone is just talking about some facet of the problem without really thinking about the broader picture. Or, worse, they actually have no idea what the problems are for other people, so they'll wave away serious problems in ares they're not familiar with: "oh we can just fuse the uops on big cores", "decoders don't actually matter these days", "nobody will emit that sequence so it doesn't matter if it's fast". To be fair, a lot of the actual practical results are not disclosed widely by the people who make the decisions, so it's easy to fool yourself into whatever you want to believe without the numbers to back it up.

zozbot234 · on Jan 18, 2024

Instruction fusion is a perfectly cromulent approach. These days RISC-V extensions are even written with instruction fusion in mind, such as the recently proposed Zicond - which just adds a couple "conditionally move value or zero" three-register insns. It turns out that this is enough to support lots of "conditional insn" patterns that other ISA's have to encode explicitly.

saagarjha · on Jan 18, 2024

Look I didn't want to name names but if you willingly volunteer to serve as an example of what I was talking about I am more than happy to let you do that.

Joker_vD · on Jan 18, 2024

...RISC-V spec literally says things like "we define the canonical sequence to be MULH/MUL, in this order. Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies".

saagarjha · on Jan 19, 2024

I'm not going to respond to your specific claim directly, because I just said I wouldn't do that above. If you really want to hear my opinions bring this up in some other thread, or just wait for someone else to make the argument ;) But I would like to ask you if what you're saying is validated by actual silicon, and if so, under which constraints. Does "microarchitectures can then fuse these" actually pan out? Do the (size, mainly?) savings actually help in the contexts it is claimed it targets (embedded?). How do other contexts (server, desktop) feel about this? Is it useful for them? Perhaps it is actively harmful for what they want to do?

brucehoult · on Jan 18, 2024

Not even necessarily fuse in that case, just cache the operands and result in internal registers and don't run the multiply again if the operands are the same. Same for DIV/REM.

mbitsnbites · on Jan 19, 2024

In that case you don't need a canonical order. The instructions do not even have to be next to each other.

o11c · on Jan 18, 2024

Designing for fusion is valid, but RISC-V has a lot of cases that boil down to "use a 12-byte fused instruction where other architectures do it in 4 bytes".

L1i matters, people!

_chris_ · on Jan 18, 2024

> L1i matters, people!

RISC-V consistently wins on L1i footprint.

The complaining is about number of dynamic instructions ("path length"), which can hit you if you don't fuse. Of course, path length might not actually be the bottleneck to raw performance, but it's an easy metric to argue, so a lot of people latch on to it.

snvzz · on Jan 19, 2024

>The complaining is about number of dynamic instructions ("path length"), which can hit you if you don't fuse.

Ironically, RISC-V does great there[0]. Note this is despite these researchers did not even consider fusion.

0. https://dl.acm.org/doi/pdf/10.1145/3624062.3624233

dzaima · on Jan 20, 2024

Dunno about "great" - "For 6 out of 10 mini-app+compiler pairs, Arm has a shorter path length, with the overall average difference when weighting each benchmark equally being 2.3% longer for RISC-V."

snvzz · on Jan 20, 2024

While applying the worst possible reading to RISC-V, and despite not considering fusion, it is not worse than ARM.

That's awesome.

dzaima · on Jan 20, 2024

Isn't shorter path length the goal here? And ARM is better by both those metrics. Am I misunderstanding something?

ARM of course would also benefit from fusion too; but camel-cdr's mention of it being only rv64g is a pretty significant caveat.

snvzz · on Jan 21, 2024

Yes, shorter path is the goal.

No, winning 4 and losing 6, by a small margin, isn't "being worse than arm". The paper's authors even explicitly conclude it is not losing to ARM.

This is even ignoring whether code is within or outside loops, counting fuseable instructions as always non-fused, and not considering any instructions from extensions after 2019's ratified (actually unchanged from 2017) rv64g... any of those would have a favorable effect on RISC-V.

This is an excellent result for RISC-V, that clears any doubts in terms of path length. On top of what we already know about RISC-V leading in code density in 64bit.

dzaima · on Jan 21, 2024

Might not be "worse" (I'd definitely agree that the difference is plenty small enough to be considered equal within error bounds), but is certainly not something worthy of RISC-V being noted as doing "great" either.

Excluding extensions is perhaps a significant question, but, for example, Debian RISC-V currently targets rv64gc, which should have the same instruction counts as rv64g does, so software compiled for Debian can't use the later extensions for most code anyway. (never mind that ARMv8 also has excluded extensions, namely NEON, which is always present on ARMv8 and is not designed to be ignored)

And, of course, even being better than ARM is not equivalent to being the best it could be; ARMv8 isn't some attempt at a magical optimal instruction set, it's designed for whatever ARM needed, and that includes being able to efficiently share hardware with ARMv7 for backwards compatibility.

snvzz · on Jan 24, 2024

If RISC-V is not worse (it is not) and yet it is much simpler (it is), that is a huge win.

Simplicity has enormous value.

camel-cdr · on Jan 20, 2024

it's also targeting just rv64g

snvzz · on Jan 20, 2024

Right. Bitmanip would also, on its own, reduce instruction count considerably.

brucehoult · on Jan 18, 2024

Also the difference in number of instructions on real programs is in the 10% range, which could well be compensated by other factors. For example, keeping to simpler instructions might well result in a 10% higher clock speed and lower silicon area too, equalising matters if not gaining an advantage.

Teongot · on Jan 18, 2024

From the article

> When it started testing simulations of early Pentium prototypes, Intel discovered that a lot of game designers had found that they could shave one instruction off a hot loop by relying on a bug in the flag-setting behavior of Intel's 486 microprocessor. This bug had to be made part of the architecture: If the Pentium didn't run popular 486 games, customers would blame Intel, not the game authors.

Does anybody know the details of this?

loup-vaillant · on Jan 18, 2024

From https://lobste.rs/s/v8xovv/how_design_isa#c_wepeeu

> Sadly, my source for this was a former Intel chief architect, and I don’t think he ever said it anywhere that was recorded. […]

EdwardCoffin · on Jan 18, 2024

Bob Colwell touches on an issue that sounds like it is this one in his talk Engineering Lessons from the Pittsburgh Steelers [1] at around 1:05:02 (link prepositioned). I'll bet that he went into more detail in his book The Pentium Chronicles, but I don't have it handy at the moment, so can't cite a page. It does not seem to be in his oral history [2] (PDF), which I had thought contained pretty much all the meat of his chronicles though.

If you're more interested in just concrete examples of this kind of thing, apparently IBM's System/360 team ran into heaps of this kind of issue when emulating the IBM 1401. It's mentioned in Frederick P. Brooks, Jr.'s book The Mythical Man-Month in the Formal Definitions section of Chapter 6, and I think he probably discussed it in more detail in his (and Blaauw's) book Computer Architecture.

Edit: pasting in a cleaned up version of the relevant part of the transcript:

1:04:56 It's required to run every one of those well, how do I know it does? I can't test them all, so you say well you as long as you design to the architecture spec that should be enough. Right? Ha no for example, inside the architecture spec there are places where it says this condition flag is undefined as a result of this operation so you'll do – I don't know what it was anymore, add operation or something, no, it can't be add pick a different [thing] – there was some instruction that would say I do not guarantee what the carry flag will look like when I'm finished and you go as an architect hey that's cool it means I can do it either way whatever way is easiest. No it doesn't.

Yeah if you think that you're going to get in big trouble. Because what what will happen is – and this literally happened which is why I know about this – you put the chip out and then you discover oh it was easiest for my team to set the bit to a 1 didn't matter because it was undefined right I get to pick, but all the previous chips were setting it to a zero although they were calling it undefined. Now you're in trouble, because what you're going to discover some goofy app out there required that the bit be a zero after that operation even though the book said it was undefined. And they didn't notice because up until now it always was a zero. But your chip comes out, the software doesn't work anymore. Guess who's at fault? You are. Can you go "hey look at what the book says, can't you read?" and they'll say "I don't care what you say your chip doesn't work my software, you're a loser, your chips busted."

[1] https://youtu.be/jwzpk__O7uI?si=iy23ZM5tQX-hI87C&t=3903

[2] https://www.sigmicro.org/media/oralhistories/colwell.pdf

formerly_proven · on Jan 18, 2024

Classic example of Hyram’s law - adherence to a non-trivial specification is very difficult to check so the real spec is generally “works with that implementation”.

tambourine_man · on Jan 18, 2024

That YouTube link was awesome, thanks

carra · on Jan 18, 2024

It seems most people designing an ISA will do RISC, but when I designed the ISA for the Vircon32 CPU I instead took 32-bit x86 as a basis. I had done ASM in MS-DOS so it was what I knew best. I simplified it a lot in various ways but one of them was key: it cannot split registers to use 8 or 16 bit data. This alone was enough for me to reduce the opcode list to just 64 (6 bits, so no unused opcodes). And that is even adding a few math instructions to handle floats.

mbitsnbites · on Jan 18, 2024

Vircon32 looks very nice!

So what did you do w.r.t. addressing modes? Do you have memory-memory operations? Is it a load/store architecture?

I found the CPU specification (https://github.com/vircon32/Vircon32Documents/blob/main/Spec...), but I couldn't find an ISA specification.

carra · on Jan 19, 2024

Thank you! The CPU document should cover everything about the ISA itself. There are 8 addressing modes, which are covered in pages 19 and 20. MOV is the only instruction that explicitely uses them. Other instructions, at most, can only choose to either use registers or an immediate value.

There are no capabilities like DMA, so memory-memory operations are not really possible with the only exception of instruction MOVS which acts as a supposed MOV [DR], [SR].

o11c · on Jan 18, 2024

One particular thing that I keep seeing come up: it would be highly advantageous for the ISA to declare "after a jump, certain registers are clobbered". This should certainly include flags (at least, the arithmetic kind - clearly, things like DF/IF/TF are fundamentally a different kind of state even if the (rare!) explicit get-flags and set-flags instructions include both), but there's a decent argument that a couple accumulator-like registers should also be clobbered (but not if used for the RETURN value).

Resumable instructions are clearly also a big deal.

cpeterso · on Jan 18, 2024

Andrew Waterman's PhD thesis "Design of the RISC-V Instruction Set Architecture" has a nice comparison of ISA designs and differences of RISC-V, MIPS, SPARC, Alpha, ARMv7/8, Thumb, OpenRISC, and x86/x86-64.

https://people.eecs.berkeley.edu/~krste/papers/EECS-2016-1.p...

mbitsnbites · on Jan 18, 2024

This was a very good read! Many spot-on observations.

I actually started the design of the MRISC32 ISA before I knew about RISC-V (I even called it VRISC first, for "Vector-RISC", but had to give that name up for obvious reasons).

Initially I attacked the problem from a software developer perspective, trying to keep an open mind towards various possible solutions, but over the years I have learned lots of things and have scrapped many ideas that simply would not work well in hardware. Developing an FPGA implementation in parallel with developing the ISA certainly helped alot.

One of the luxuries of running it as a one-man open source project is that you are not bound by business deadlines and goals, so it's perfectly OK to change your mind half-way in, and let ideas and concepts mature as you learn more. And unlike the very large body that is RISC-V, you don't have to struggle with competing interests either.

matu3ba · on Jan 18, 2024

My conclusion from this: There won't be a close-to-optimal ISA for high perf (big cores) in the near-to-mid future unless there is a completely new market creates demand. + Nobody bothers documenting publicly semantics and time information ?except Agner Fog?, let alone design process. + Investment costs are too high. + No sane way to lower without efficient emulation compatibility inheriting the very same design flaws. + Formal models with explicit time information help, but few economic incentive and a lot incentives against beyond simulations + Smaller ones like RISC5 may be feasible, see sel4 project (embedded/security domains)

I wonder if learning an optimal ISA from given ones would require optimizing 1. time semantics, 2. instruction semantics, 3. source code under uncertain distributions from given examples or if it is completely intractable.

childintime · on Jan 18, 2024

I've designed an experimental ISA that is highly similar to RISCV, but uses 2 operand instructions. It fits comfortably in a 16b instruction format and has room for a secondary instruction set. Technically it doesn't have 32b instructions, as for example jump offsets are cumulative (and would likely be fused). Another interesting feature: the instruction set doesn't depend much on the register width. I'm guessing (emphasis on guessing) the code density is around 50% better. Too good to be true?

But the big question I can't answer is: is this somehow a really bad ISA for high end CPU's? I think it's likely, as likely 30% more instructions are needed.

zozbot234 · on Jan 18, 2024

It's not necessarily an inefficient ISA wrt. insn count, since you could trivially fuse the sequence Rd ← Rs1, Rd ← op(Rd, Rs2) to the usual 3-address form Rd ← op(Rs1, Rs2). The better argument AIUI is that it's not really worth it; you might think that you can get away with 16bit insns only, but the encoding space gets quite tight already w/ 16 general registers. Moving to 32bit insns with a 16b compressed format is an easy choice, and then you can comfortably support 32 registers and three-address insns.

Even the SuperH folks seem to have realized this since later versions do use 32bit insns with 16bit as a special case, much like Thumb2 and RISC-V. Some ISAs have 24bit insns but these are rather clunky in other ways.

childintime · on Jan 18, 2024

It already uses 32 registers, just like RISCV, but only half of those can be used to access memory. It mostly cuts down on the number of load/store and branch encodings. The address registers are "intelligent" and can auto-de/increment, while branches have been mostly eliminated from the ISA and don't use many encodings.

But sure, there is always demand for new instructions, and there is limited space. I imagine things like SIMD could be shoved into the secondary instruction set. Though I imagined that secondary ISA would by default be a Forth jump list, it would come virtually for free.

jsnell · on Jan 18, 2024

That sounds very much like the SuperH ISA.

childintime · on Jan 18, 2024

Raymond Chen wrote about it: https://devblogs.microsoft.com/oldnewthing/20190805-00/?p=10...

This is an intense 15 part series, from before the author became well known. It takes one through the SuperH ISA from the viewpoint of a compiler writer, for Microsoft Windows CE.

_0ffh · on Jan 18, 2024

As a kid I designed a few imaginary CPUs, complete with instruction codes with bit fields for memory and register addresses and so on. Fun, but definitely not with a lot of scientific rigour. Never implemented one though, until relatively recently.

z500 · on Jan 18, 2024

Me too! Having your own working CPU design running actual programs is a whole different rush. Hopefully this year I can get mine running on an FPGA.

_0ffh · on Jan 19, 2024

Great to hear! I got a design running on a Xilinx devboard, though it's a very specialized architecture, not the general purpose CPUs I used to do as a kid.

z500 · on Jan 19, 2024

What's the architecture specialized for? Like a GPU or something?

_0ffh · on Jan 19, 2024

User programmable real time signal processing. It runs a fixed one instruction per cycle (but it can issue multiple operations per instruction) and the machine code is deliberatly not turing complete similar to eBPF. Loops must have a fixed number of iterations and are unrolled by the compiler. In sum that means that program runtime in cycles equals length of program in instructions (plus a couple more to clear out the pipeline) so it is trivial to guarantee that any given program meets the timing requirements of the system.

z500 · on Jan 19, 2024

Oh that sounds cool. I guess that would make it a ZISC? What kind of applications did you have in mind for it?

_0ffh · on Jan 19, 2024

I think it's not quite NISC. There's no microcode alright, but there is a predefined instruction set that is decoded into the appropriate control signals. OTOH instruction scheduling and hazard control is indeed handled by the compiler back-end, which makes it quite complex. The compiler may take up to a few seconds to find a compact and valid instruction sequence for a given program, but it works well enough. I'm currently working on a browser-based graphical programming environment. The intended application is audio signal processing and generation. Might be anything from guitar effects to sound synthesis from a MIDI source, depending on the available hardware ressources (ADC and MIDI-UART, respectively).

CalChris · on Jan 18, 2024

> When I started, I had very little idea about what makes a good ISA, and, as far as I can tell, this isn’t formally taught anywhere.

ISA principles are covered in Hennessy and Patterson's Computer Architecture: A Quantitative Approach. But then they've been relegated there out to an appendix. In addition to not being formally taught, they're de-emphasized.

Symmetry · on Jan 18, 2024

I very well balanced article. My biggest quibble would have been mentioning the self-synchronizing[1] nature of the variable length instructions that make RISC-V easier to decode in batches than x86.

[1]https://en.wikipedia.org/wiki/Self-synchronizing_code

CodesInChaos · on Jan 26, 2024

How is RISC-V self synchronizing? The encoding described in Chapter 1.5 Base Instruction-Length Encoding isn't self-synchronizing in general.

saagarjha · on Jan 19, 2024

Interestingly, x86 is also kind of "naturally self-synchronizing" for real-world code, as you'll notice if you try to disassemble at an arbitrary address in GDB. It's not perfect of course–you can always craft a code sequence that is malicious–but in general it's a useful property, if not for instruction decode in a processor.

ash · on Jan 22, 2024

Very good article! One of the gems from it:

> If you buy an NVIDIA GPU, you do not get a document explaining the instruction set. It, and many other parts of the architecture, are secret. If you want to write code for it and don't want to use NVIDIA's toolchain, you are expected to generate PTX, which is a somewhat portable intermediate language that the NVIDIA drivers can consume. This means that NVIDIA can completely change the instruction set between GPU revisions without breaking your code. In contrast, an x86 CPU is expected to run the original PC DOS (assuming it has BIOS emulation in the firmware) and every OS and every piece of user-space software released for PC platforms since 1978.

childintime · on Jan 18, 2024

No matter how interesting and alive the topic seems to be, the situation right now is that ISA design is likely on pause for the next 25 years, or dead, as RISCV is sucking all of the oxygen out of the room, and leaves virtually no space for competition.

Maybe a few minor issues would end up corrected, or maybe China will fork the project to do so, but that's about it. Compatibility will reign, and RISCV is probably the last ISA you'll need to know. Thankfully.

mbitsnbites · on Jan 18, 2024

I think you're onto something, but I'm not entirely sure that that's exactly how it's going to play out.

First of all I don't think that "compatibility will reign". It's more like once the industry really starts picking up RISC-V, fragmentation will reign (at least for a decade or so).

It's also quite likely (IMO) that we'll see a "next generation" rather sooner than later, i.e. "RISC-VI". RISC-V, with some agreed upon extensions, may become the norm for Android, mobile, automotive and so on, but for the high end (servers, gaming, etc) I think that the industry will push for a different philosophy than the RISC-V authors originally envisioned - and that could become a new "revision" if you will.

zozbot234 · on Jan 18, 2024

The baseline RISC-V ISA is tiny, it only really has a handful of integer insns. So as long as you're OK with RISC-V's choices wrt. basic principles such as insn length (32-bit, extensible) and number of general-purpose registers (you get to choose 16 or 32) there's almost no reason not to build on that work. Literally everything else is up for grabs, and could become a new standard extension if the argument for it is strong enough. That leaves out special-case approaches like VLIW with a higher insn length and more integer registers (also VLIWish things like "The Mill") but these are rare.

brucehoult · on Jan 18, 2024

Completely right.

There is very little room to complain about what RISC-V does have, in RV32I/RV64I or even RV32G/RV64G. The core ISA works just fine and its primary attribute is that everyone is legally free to use and build on it, and that there is a large and growing body of software that runs on it.

The complaints are about things it doesn't have. No carry bit. No complex addressing modes. That kind of thing. If someone proves that those actually matter, for example by building a CPU with custom instructions that blows away everyone else's, then those can be added to RISC-V. There should never be any reason to need a "RISC-VI".

I'm actually quite sympathetic to Qualcomm's proposal to add some of the things Aarch64 has, as an optional but standardised extension. On the other hand I'm completely against the second part of their proposal, to do a "big bang" replacement of the C extension with their extension in e.g. the RVA23 profile.

mbitsnbites · on Jan 19, 2024

I think that companies will continue to suggest a "big bang" that involves supporting integer instructions with three source operands and possibly dropping compressed instructions.

Maybe it's out of selfishness (e.g. because they are repurposing an existing microarchitecture for RISC-V), or maybe it's because that's how they want to build their hardware (e.g. for their particular performance target decoding a plain-old 32-bit instruction may be more silicon/power efficient than to fuse 2-3 16-bit instructions).

Whatever the reasons, I think that RISC-V will have to live with this critique for as long as it lives.

I'm not saying that those are poor design choices, but they will always be pain points (for small cores and big cores alike - but for different reasons).

I don't think that you should underestimate the drive to modify an architecture if it does not fit your needs - especially if it is a free and open architecture like RISC-V. For example LoongArch has already happened, and I can easily see how something similar can happen if a major player decides to move from x86 or ARM to "something else" (e.g. if NVIDIA wants full control over their next gen super AI solution).

snvzz · on Jan 20, 2024

>(e.g. because they are repurposing an existing microarchitecture for RISC-V)

I don't think we'll see a repeat of Qualcomm's attempt. This was a very special situation that they ended up with the NUVIA purchase/ARM lawsuit fiasco.

Fortunately, RISC-V foundation handled the situation well. As RISC-V continues to grow exponentially, an unlikely later attempt will meet even stronger resistance, not just from the set precedent, but from the larger already deployed ecosystem of software and hardware.

saagarjha · on Jan 19, 2024

Well there is the chance that you might not want to advertise your processor as supporting "RISC-V" if it means that someone else's extensions are not the same as yours.

snvzz · on Jan 22, 2024

Custom extensions go into custom space.

RISC-V cares about its trademarks; a non-compliant core would not be allowed to use them.

saagarjha · on Jan 22, 2024

That's not what I mean. I'm saying that if someone makes a program that uses custom extensions, they might label it as "RISC-V" but it doesn't actually work on my machine. Of course other ISAs also have extensions, but they are standard and programs typically check for them before using them. Does RISC-V have a way of looking for which extensions are available? Especially if there are a lot of entities that can implement them?

snvzz · on Jan 22, 2024

The ISA spec does indeed have a sane, easy to understand method.

The rest is a software problem. I know that e.g. Linux exposes the required information about the CPU's ISA via a dedicated syscall.

I do not know whether there's some ELF header or the like.

camel-cdr · on Jan 22, 2024

It does? Isn't it read the device tree isa string, the csr bits for the most common things, or catch the trap on illegal instruction?

mbitsnbites · on Jan 19, 2024

Yes, but that's a big "if". When you start looking at it from the perspective "I don't need binary compatibility with microcontrollers", you realize that opcode space has been wasted on things that you don't need. It's similar to how x86 has wasted several single-byte encodings on instructions that are never used in modern programs. The extensions that you end up wanting to do overlap with the base ISA, but with slightly different semantics.

It's manageable and you can live with it, but already from the start you have an unnecessary legacy that needs to be handled.

If there is enough consensus in the industry, a new revision may be the best way forward.

IshKebab · on Jan 18, 2024

This is a really great article. Good collection of (probably minor) mistakes in RISC-V, and I did not know that Apple has optional TSO memory ordering!

Symmetry · on Jan 18, 2024

Mistakes if you want a high performance design but certainly the right approach for both academic projects and for the embedded systems where RISC-V has seen most of its commercial adoption.

snvzz · on Jan 18, 2024

There's been interesting related discussion in lobsters[0]. Turns out many of these "mistakes" are actually sensible decisions.

0. https://lobste.rs/s/v8xovv/how_design_isa

IshKebab · on Jan 18, 2024

Thanks for the link. I don't think you can conclude that from the discussion at all.

The comment (https://lobste.rs/s/v8xovv/how_design_isa#c_pluxuy) about JALR doesn't convince me. Yes you can use other registers for millicode and coroutines, but that is also baked into the ABI. See "Return-address stack prediction hints encoded in the register operands of a JALR instruction." in the unprivileged spec.

It's an argument for supporting 2 link registers, not 32. I suppose you could argue "but there might be a future use that needs 3!" but it's definitely not clear cut.

I haven't seen a solid defence for the lack of conditional move or advanced addressing modes either.

Also the inclusion of compressed instructions in RVA22/23 seems to be a mistake: https://lists.riscv.org/g/tech-profiles/topic/101741936#297

I don't think Qualcomm have published their proposal & benchmarks in full, but I can the pain of compressed instructions is very very very high so it would be worth removing them from RVA22/23 even if there is a slight performance penalty. Though it is probably too late realistically, since the profiles are meant to be backwards compatible.

Overall I still think these are pretty minor mistakes and RISC-V is a nice ISA.

brucehoult · on Jan 18, 2024

> I haven't seen a solid defence for the lack of conditional move or advanced addressing modes either.

That's backwards. It's the people who want those who need to prove they make a significant difference, and not just with hand-waving but with actual chips and data. "Everyone else does it" is not data.

Thus far, RISC-V cores come in very competitive and even faster than Arm cores with similar µarch e.g. SiFive U74 vs Arm A55, or THead C910 vs Arm A72.

> the pain of compressed instructions is very very very high

Only for people who bought a company with a fast Aarch64 core that Arm is suing them over using so are trying to convert it to be a RISC-V core instead, with minimal work.

The companies that are designing fast and wide RISC-V cores from scratch (Ventana, Tenstorrent, Rivos, ...) are saying the C extension is no big deal to implement, and of course does have real advantages in static and dynamic code size, icache size / performance / bandwidth etc.

The discussion you referenced has Qualcomm claiming Rivos is also against the C extension, and then a Rivos person coming back saying "Hang on just a minute ... we're fine with C, we're just open-minded enough to want to see real data on your proposal".

IshKebab · on Jan 19, 2024

> THead C910

This has an extensions that support conditional move and indexed memory access :-D

See https://sourceware.org/binutils/docs/as/RISC_002dV_002dCusto...

In any case you aren't going to be able to see the benefit by comparing totally different cores. There are too many other factors. The article says:

> Arm considered eliminating predicated execution entirely, but conditional move and a few other conditional instructions provided such a large performance win that Arm kept them.

It would definitely be great to see numbers here but I see no reason to doubt that.

> Only for people who bought a company with a fast Aarch64 core that Arm is suing them over using so are trying to convert it to be a RISC-V core instead, with minimal work.

I'm not sure what you are talking about here. I was referring to the fact that the C extension means uncompressed instructions may not be naturally aligned which leads to all sorts of complexities, e.g. fetching instructions that are split over a page boundary, different PMA regions, different PMP regions, etc. It adds a lot of complexity to CHERI too. It's not a "big deal" to implement, but it does add significant complexity which would have been nice to avoid if it wasn't actually necessary.

camel-cdr · on Jan 19, 2024

> This has an extensions that support conditional move and indexed memory access :-D

RISC-V has Zicond now.

peter_d_sherman · on Jan 18, 2024

>"Bjarne Stroustrup said, "There are only two kinds of languages: the ones people complain about and the ones nobody uses." "

That has to go into my list of favorite quotes!

simon_o · on Jan 19, 2024

That quote is programming language design's "We should improve society somewhat" – "Yet you participate in society! Curious! I am very intelligent."

I. e. it's stupid and embarrassing to watch people use it.

peter_d_sherman · on Jan 19, 2024

Well, I guess that I am stupid and embarrassing, then! :-) <g> :-)

You know, "I resemble that remark!" :-)

(But that doesn't change the fact that I actually like the Bjarne Stroustrup quote!) <g> :-) <g>)

Now let's understand your comment a little bit better. Your comment apparently arises from a Meme, specifically this one:

https://knowyourmeme.com/memes/we-should-improve-society-som...

While that meme is indeed entertaining(!) -- it is in no way actually relevant to the Bjarne Stroustrup quote!

It is a "Motte and Bailey" AKA, "bait-and-switch" argument/comparison.

Propagandistic agenda-driven AI chatbots seem to do this a lot -- but I'll be charitable (this time!) and assume, for the purposes of discussion, that you are human...

Perhaps we should all learn about what a "Motte and Bailey" argument/comparison/logical fallacy, is:

https://rationalwiki.org/wiki/Motte_and_bailey

>"Motte and bailey (MAB) is a combination of bait-and-switch and equivocation".

https://en.wikipedia.org/wiki/List_of_fallacies#:~:text=Equi...

https://pressbooks.ulib.csuohio.edu/eng-102/chapter/fallacie...

Phrased another way (in Billy Madison terms): "Mr. Madison... Everyone in this room is now dumber...":

https://www.imdb.com/title/tt0112508/characters/nm0235999#:~....

Incidentally, another one of my favorite Bjarne Stroustrup quotes is:

"Proof by analogy is fraud". :-)

https://www.stroustrup.com/quotes.html#:~:text=Proof%20by%20...

simon_o · on Jan 31, 2024

You seem to have serious issues, and I hope that you can work them out.

That's the prerequisite for me to engage any further with you.

aredox · on Jan 18, 2024

Is anyone trying to write ISA for specific programing languages, like the lisp machines of old?

Would RiskV be amenable to have language-specific customizations or peripheral accelerators?

JonChesterfield · on Jan 18, 2024

I think I read something about Apple making reference counting faster using hardware.

Machine learning asics are prone to having instructions that correspond to activation functions. tanh, sigmoid etc.

A reasonable approach is to take a risc ISA and add domain specific instructions onto it. That gets you straightforward codegen and implementation for the 90% case and magic instructions to make the important path very fast.

kmacleod · on Jan 18, 2024

My concept is SmallTalk-ish. Control flow, object frames, and GC in one unit and all higher level math, array, function, FPGA, GPU cell are in one or more adjoined units. The control unit keeps the higher level units fed in a hyper-threading sense. Both unit sets are independently scalable so a unit-pair can be as small as a 8x8 pixel block in a display in a mesh network, an 8-, 12-, or 16-bit device controller, a 32-bit display GPU driver for the pixel block units, etc. All cores talk to other control cores through message passing, with object ID mapping at the interface. Security for small cores is also handled generally at interfaces through capabilities (access to certain core object mappings). The highest levels run as processes on existing OSs.

mbitsnbites · on Jan 19, 2024

I would say that all modern "general purpose" ISAs (and microarchitectures) are optimized for C (and C++). That's what most OS:es and high profile applications are written in (web browsers - and by extension electron apps, compilers, games, content creation, etc).

phi-go · on Jan 18, 2024

RISC-V is designed to provide op-code space for custom instructions, so it can be extended for specific tasks.

DeathArrow · on Jan 18, 2024

Out problem is not lacking ISAs. We have plenty of ISAs and we can make lots, pretty fast. Our problem is with the actual implementation in silicon.

jabl · on Jan 18, 2024

Perhaps an even bigger(?) problem is building out the software ecosystem. To be a serious alternative to something like ARM, or even RISC-V, you need support in OS kernels (at least Linux, depending on which market you're going for), LLVM and GCC backends and other toolchain support, backends for various widely used JIT compilers (openjdk, V8, spidermonkey, etc etc), optimized kernels for certain common operations (BLAS, FFT, video encoding, crypto), and so on and so on. A huge amount of manpower in getting all these pieces into shape.

throw0101d · on Jan 18, 2024

> A huge amount of manpower in getting all these pieces into shape.

Not wrong, but a lot easier in recent years as compared to the past, as historically there were a lot of closed-source systems. Nowadays, with the popularity and ubiquity of open source, one can 'brute force' writing for Linux, GCC, and LLVM, and you've probably a sizeable portion of use cases covered.

(You may have difficultly in getting things into mainline if you've got a niche ISA, so there's continuing overhead of maintaining patches.)

JonChesterfield · on Jan 18, 2024

Quite a lot of the compiler is a deterministic function of the ISA. I know of a company selling "we'll generate a backend from your ISA" as a product, though I suspect there's more manual typing in the background than they'd like. Compiler backends are currently mostly written by engineers but they probably shouldn't be.

londons_explore · on Jan 18, 2024

ISA's today don't need to be human understandable.

ISA's of old would have an "ADD" instruction that added two numbers.

New ISA's should have crazy complex instructions that implement things helpful to make javascript/python/whatever run fast.

We should set mostly-automated design-space-search programs off to consider millions of autogenerated new instructions and simultaneously figure out how that would impact compilers, CPU design, power consumption, performance, etc.

saagarjha · on Jan 18, 2024

There is no "one" thing that makes JavaScript or Python run fast. And even if there was, how do you know the Python of 10, 20, 30 years from now still does that? What happens if someone buys your processor but they aren't running JavaScript on it? We have accelerator blocks for specialized tasks (media decoding, matrix multiplication, …) already. Doing this for other things is hard. The post itself talks about a failed attempt called Jazelle that was only really useful under a very specific set of constraints.

rbanffy · on Jan 18, 2024

> There is no "one" thing that makes JavaScript or Python run fast.

IIRC, The M series processors were explicitly designed with extra instructions for commonly used JS functionality. Compilers for that platform can rely on such instructions for other languages as well.

Also, SPARC was designed to run C code fast - a function call would require little more than moving the register window to preserve the caller's context. Because of that, calling a function had a very small penalty compared to other architectures of the time.

duskwuff · on Jan 18, 2024

> The M series processors were explicitly designed with extra instructions for commonly used JS functionality.

I think I know what you're thinking of and it's a little overblown. It's a single instruction - FJCVTZS - which performs a specific type of floating-point to integer conversion. It's a standard ARMv8.3 instruction, not an Apple extension.

https://developer.arm.com/documentation/dui0801/l/A64-Floati...

saagarjha · on Jan 19, 2024

Compilers don't typically emit those instructions, it's people working on runtimes that need its semantics who do. Also, my understanding (to be fair I haven't really used SPARC much) is that register windows didn't really pan out as being efficient in practice. I think because people couldn't program for it?

znpy · on Jan 18, 2024

Older ARM cpus had a whole extension named Jazelle to execute jvm bytecode in hardware.

https://en.wikipedia.org/wiki/Jazelle

charlieyu1 · on Jan 18, 2024

Interesting. In a hypothetical future we may have CPUs running Python/JS bytecodes directly.

We are nowhere near that point though. Probably would only happen if we have a grand unification of CPU architectures.

saagarjha · on Jan 18, 2024

Yes, this is what I was referring to :)

tlb · on Jan 18, 2024

It's feasible to have compilers automatically figure out how to use novel ISA features. But designing a microarchitecture & logic based on a proposed ISA is still years of expert human labor, so it's hard to evaluate the performance (both ILP and clock speed) of a novel ISA.

rbanffy · on Jan 18, 2024

It's usually a better approach to link to libraries that decide at runtime (and save the decision somewhere) which is the optimal code path for a given architecture.

londons_explore · on Jan 18, 2024

I dunno - in some architectures, a new instruction is just a couple of lines of verilog to implement in the silicon:

https://github.com/YosysHQ/picorv32/blob/master/picorv32.v#L...

(this is the 'subtract' alu instruction decoder in pico riscv for example)

rcxdude · on Jan 18, 2024

That is basically a toy example though. With a design like that you do just do the simple thing, but it's not fast or efficient

hardware2win · on Jan 18, 2024

>There are good reasons why 32-bit Arm failed to compete with Intel for performance, and why x86 has failed to displace Arm in low-power markets. The things that you want to optimize for at different sizes are different.

What if x86 CPUs werent designed for low power? e.g due to focus on competitivness in perf focused markets

Saying that one isa is faster or more energy efficient is like saying that c++ syntax is faster than java syntax.

While there are lang features that enable stuff, then almost everything is up to the implementation - compiler, libraries, runtime and the programs code.

One letters arent faster than the other. ISA doesnt imply perf. characteristics of the end product.

Read this:

https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

Even if you start talking about decoders, then:

>Another oft-repeated truism is that x86 has a significant ‘decode tax’ handicap. ARM uses fixed length instructions, while x86’s instructions vary in length. Because you have to determine the length of one instruction before knowing where the next begins, decoding x86 instructions in parallel is more difficult. This is a disadvantage for x86, yet it doesn’t really matter for high performance CPUs because in Jim Keller’s words:

rbanffy · on Jan 18, 2024

> Saying that one isa is faster or more energy efficient is like saying that c++ syntax is faster than java syntax

Kind of.

ISAs don't exist in a vacuum - for a given transistor budget, they'll force chip design choices that will drive power consumption and performance. Decoding instructions is one thing, but reordering them quickly and efficiently is more impactful for both power (if it can be done with fewer transistors) and performance (if it can be done better/faster so that more instructions from more instruction flows can be retired at the same time).

I designed a beautiful ISA in college. It was a (mostly) stack machine with instructions designed to make a FORTH compiler extremely easy to implement. Unfortunately, it wouldn't be easy to evolve it past the point processors got faster than memory (I did not see that coming). It would, as originally designed, end up being unavoidably slow unless some fairly complicated caching were to be implemented.

Another interesting example is the Intel 432 and its bit-aligned instructions. A lot of silicon that could be better used elsewhere was dedicated to fetching instructions. It was also slower to implement.

On the x86 not being designed for efficiency, Intel has a whole line of CPUs designed for low-power environments. At some point, there was even a Motorola phone running Android on x86.

snvzz · on Jan 18, 2024

>On the x86 not being designed for efficiency, Intel has a whole line of CPUs designed for low-power environments. At some point, there was even a Motorola phone running Android on x86.

Note it failed in the market, and was never competitive.

rbanffy · on Jan 22, 2024

It failed on mobile phones. Atom and its descendants are in just about every cheap Windows tablet and small laptop, where being an x86 is advantageous.

snvzz · on Jan 24, 2024

Windows is (notably) not Android, and laptops/tablets do not run on the same power constraints a mobile phones does.

Conversation went off the rails.

jpcfl · on Jan 18, 2024

> Saying that one isa is faster or more energy efficient is like saying that c++ syntax is faster than java syntax.

I think that could be a valid statement. APIs can influence performance by constraining the implementation. For instance, the syntax for constructing an object in C++ will, generally speaking, always yield faster code than Java, because Java objects are almost always allocated on the heap, while C++ objects can be allocated on the stack. Compare:

  // C++
  MyObj o{};

  // vs. Java
  MyObj o = new MyObj();

Sure, it's possible to write a Java allocator/GC that will yield similar performance to the C++ code, but in general, that will practically never be the case. The syntax of the language has constrained the implementation so that Java will practically always be slower. Presumably, similar design choices in an ISA could have the same effect.

hardware2win · on Jan 18, 2024

>Sure, it's possible to write a Java allocator/GC that will yield similar performance to the C++ code,

Didnt you just agree with me that it is dependent on the impl/end product?

Because what would be the reasons in isa world to make it not desirable

>The syntax of the language has constrained the implementation so that Java will practically always be slower.

The most interesting question is: by how much?

1% 3%? 30?

o11c · on Jan 18, 2024

Java code is generally 2x slower than C++ code, unless you're operating entirely on primitive types. The JIT usually can't remove enough of the gratuitous memory accesses the language forces.

hardware2win · on Jan 18, 2024

Youre talking about java end to end, we are talking about just syntax.

jpcfl · on Jan 18, 2024

> Didnt you just agree with me that it is dependent on the impl/end product?

No. I'm saying that it may be theoretically possible to tune performance in some cases, but not practical.

Pet_Ant · on Jan 18, 2024

The more bondage your language has the more information your compiler to make optimisations.

GrumpySloth · on Jan 18, 2024

> Saying that one isa is faster or more energy efficient is like saying that c++ syntax is faster than java syntax.

Which may be true, if we’re talking about compilation speed, although in this case the reverse is true.