The M1 doesn't smoke Intel chips just because it's ARM - the latest chips from B...

xrd · on May 24, 2021

That's what I'm a little confused about. It isn't just because it is RISC, it's Apple magic? It seems weird that you can emulate other instruction sets with RISC underneath and get the performance they do. I assumed if you could recompile to the native instruction set you would get a really optimized app, but it seems like the interesting work always operates at a different layer. Fascinating stuff.

NobodyNada · on May 24, 2021

I am absolutely not an expert on microarchitecture, but I’ve had the same questions and tried my best to figure out answers. Here’s my understanding of the situation:

> It isn't just because it is RISC, it's Apple magic?

It’s both. We’ve known for decades that RISC was the “right” design, but x86 was so far ahead of everyone else that switching architectures was completely infeasible (even Intel themselves tried and failed with Itanium). It would have taken years to design a new CPU core that could match existing x86 designs, and breaking backwards compatibility is a non-starter in the Windows world. So we ended up with a 20-year-long status quo where ARM dominated the embedded world (due to its simplicity and efficiency) and x86 dominated the desktop world due to its market position.

However, with Apple, all the stars lined up perfectly for them to be able to pull off this transition in a way that no other company was able to accomplish.

- Apple sells both PCs and smartphones, and the smartphone market gave them a reason to justify spending 10 years and billions of dollars on a high-performance ARM core. The A series slowly evolved from a regular smartphone processor, into a high-end smartphone processor, and then into a desktop-class processor in a smartphone.

- Apple (co-)founded ARM, giving them a huge amount of control over the architecture. IIRC they had a ton of influence on the design of AArch64 and beat ARM’s own chips to market by a year.

- Intel’s troubles lately have given Apple a reason to look for an alternative source of processors.

- Apple’s vertical integration of hardware and software means they can transition the entire stack at once, and they don’t have to coordinate with OEMs.

- Apple does not have to worry about backwards compatibility very much compared to a Windows-based manufacturer. Apple has a history of successfully pulling off several architecture transitions, and all the software infrastructure was still in place to support another one. Mac users also tend to be less reliant on legacy or enterprise software.

> It seems weird that you can emulate other instruction sets with RISC underneath and get the performance they do.

As far as I understand it, the only major distinction between RISC and CISC is in the instruction decoder. CISC processors do not typically have any more advanced “hardware acceleration” or special-purpose instructions; the distinction between CISC and RISC is whether you support advanced addressing modes and prefix bytes that let you cram multiple hardware operations into a single software instruction.

For instance, on x86 you can write an instruction like ‘ADD [rax + 0x1234 + 8*rbx], rcx’. In one instruction you’ve performed a multi-step address calculation with two registers, read from memory, added a third register, and written the result back to memory. Whereas on a RISC, you would have to express the individual steps as 4 or 5 separate instructions.

Crucially, you don’t have to do any more actual hardware operations to execute the 4 or 5 RISC as compared to the one CISC instruction. All modern processors convert the incoming instruction stream into a RISCy microcode anyway, so the only performance difference between the two is how much work the processor has to spend decoding instructions. x86 requires a very complex decoder that is difficult to parallelize, whereas ARM uses a much more modern instruction set (AArch64 was designed in 2012) that is designed to maximize decoder throughput.

So this helps us understand why Apple can emulate x86 code so efficiently: the JIT/AOT translator is essentially just running the expensive x86 decode stage ahead of time and converting it to a RISC instruction stream that is easier for a processor to digest. You’re right, though, that native code can always be more tightly optimized since the compiler knows much more about the program than the JIT does and can produce code bettor tailored to the quirks and features of the target processor.

YetAnotherNick · on May 24, 2021

> We’ve known for decades that RISC was the “right” design, but x86 was so far ahead of everyone else that switching architectures was completely infeasible

All the experts I listened or read to, they told that instruction set doesn't matter and it is the insignificant thing. The part that matters is branch and data prediction, and caching. Also, even intel transforms an instruction into RISC like microinstructions internally.

> Apple does not have to worry about backwards compatibility very much compared to a Windows-based manufacturer

Windows is literal shit in backwards compatibility too. Try to run any windows 7 or before program in windows 10 and most of the time it won't work. Also, windows can also run in ARM and unlike mac the ARM windows didn't had emulation for years.

jasonwatkinspdx · on May 25, 2021

> All the experts I listened or read to, they told that instruction set doesn't matter and it is the insignificant thing. The part that matters is branch and data prediction, and caching. Also, even intel transforms an instruction into RISC like microinstructions internally.

That's commonly repeated, but is a misunderstanding. Up until this point the difference was mostly that an x86 decoder took up more chip area, which given Intel's historical leads in process tech was no big deal to them.

However now we're pushing chips to go wider than ever. Intel and AMD haven't been able to push past a 4x superscalar decoder. The instruction set just has too many potential chained dependencies to make it work. You'd have to slow cycle time or introduce additional pipeline stages such that performance in the net is worse. Meanwhile M1 decodes at 8x.

This dovetails into what you're saying about stalls caused by prediction and caching. Once the stall is resolved M1 can race ahead, assigning work into the shadow registers at potentially twice the peak rate.

You're being a bit hyperbolic about Windows backwards compatibility. Much of the enterprise software world is still running programs that were written against windows XP just fine, and MS is not going to rock that boat any time soon.

The big difference with Apple's transition is precisely due to the translation (note not emulation). I've lived through 3 of their ISA changes now and they've all been nearly seamless. The big difference is Mac users have been ok with sunsetting the old apps ~5 years after the transition, something that's a total nonstarter in Windows land.

Rosetta2 is so stinking fast I have not even had to think one whip about what's native vs translated.

leucineleprec0n · on May 25, 2021

"That's commonly repeated, but is a misunderstanding. Up until this point the difference was mostly that an x86 decoder took up more chip area, which given Intel's historical leads in process tech was no big deal to them. However now we're pushing chips to go wider than ever. Intel and AMD haven't been able to push past a 4x superscalar decoder. The instruction set just has too many potential chained dependencies to make it work. You'd have to slow cycle time or introduce additional pipeline stages such that performance in the net is worse. Meanwhile M1 decodes at 8x."

Thank you, for the love of Christ why people regurgitate this [half-truth about the decoders] consciously without realizing what they are implying is beyond me. Sure, in a world where Apple, ARM. et. al were slow, maybe it would be a relevant defense. But they're playing ball, and MS/Intel haven't been up to bat with it the home turf truly on the line for years. Likely Intel will shift over to fabs for third-parties, MS is fine without Windows were it to fade out to Chrome/MacOS in time (unlikey but still).

I think there's another variable on MS's end too. Paging. The M1 supports 16K paging/allocation unit sizes, right? I strongly suspect this + the ssd speed & memory compression plays a substantial role in the reported "differential use of ram" which probably also explains the swap rates that everyone keeps coping about (obviously, they are built this way for the most part). On performance though, I don't really put much stock in EclecticLightCo's thing on QoS, at least, not any stock in it as though I genuinely subscribe to the school of thought praising Apple for perfecting heterogeneous core scheduling, which is really just a bit much.

And yeah, I went from 2020 x86 MBP to 2020 M1 MBA. Seamless, and I really haven't thought about emulation other than for instance the apparent memory usage that may be a bit more accentuated with Rosetta.

NobodyNada · on May 24, 2021

> All the experts I listened or read to, they told that instruction set doesn't matter and it is the insignificant thing. The part that matters is branch and data prediction, and caching. Also, even intel transforms an instruction into RISC like microinstructions internally.

I've heard this before, but I've also seen sources which indicate that x86 instruction decoding is definitely a bottleneck [1-5]. The M1 has a significantly wider pipeline/OoO window/reorder buffer than any other processor, and most sources seem to agree that this is because the simplicity of the ARM ISA allowed Apple to build an 8-wide instruction decoder (as compared to around 4-wide for x86 chips). [1] also mentions that Apple's impressive branch-prediction capabilities are at least partially because ARM's 4-byte-aligned instructions greatly simplify the design of the branch predictor.

So yes, it's true that an x86 processor really runs RISC-like uops under the hood. However, the best out-of-order execution pipeline in the world is limited by how far ahead it can see, and that depends on how fast the instruction decoder can feed it instructions.

Once again though, I am not a microarchitecture expert. I just read bits of information from people who do know what they're talking about and try to form it into a coherent mental model. If you have knowledge or sources that disagree with me, I would be happy to be proven wrong :)

[1]: https://news.ycombinator.com/item?id=25264384 [2]: https://www.agner.org/optimize/blog/read.php?i=25 [3]: https://news.ycombinator.com/item?id=26782213 [4]: https://www.quora.com/Why-dont-intel-or-AMD-design-an-x86-CP... [5]: https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

astrange · on May 27, 2021

> It’s both. We’ve known for decades that RISC was the “right” design, but x86 was so far ahead of everyone else that switching architectures was completely infeasible (even Intel themselves tried and failed with Itanium).

Neither ARM nor Itanium are RISC. RISC/CISC don't actually exist - CISC just means "x86" (variable length instructions, memory operands, 2-operand instructions) and RISC means "MIPS or PowerPC" (load store, fixed length 3-operand instructions, weird hardware exposures like delay slots.)

ARM is a load-store architecture and has a lot of registers so it's closer to MIPS but it has complex addressing modes and more instructions. Itanium is VLIW which is almost the opposite of how the M1 works.

Plus ARMv8 in the M1 is a total redesign so it's not exactly the same as older ARMs.

> Crucially, you don’t have to do any more actual hardware operations to execute the 4 or 5 RISC as compared to the one CISC instruction.

This isn't true because you can do a lot of that stuff in one step; just put an adder in the memory access unit. Some complex instructions really are worth putting in the ISA.

x86 uses this to its advantage; the µops can be very long and are not RISCy. RISC is actually harder to deal with here because it's easy to split up instructions into µops, but it's hard to fuse them together again. That's why ARM having condition codes and more complex memory operands is a win.

x86's variable length instructions also fit in memory better, which is good for performance, but they're worse on security because they're harder to parse.