At $previous_job we shifted a large workload from Intel to Graviton which was projected to save ~$1.7m annually while keeping roughly equivalent performance (after some tuning).
I've seen first hand validation on massive workloads moving to Graviton based instances. This includes low latency high TPS java services and offline big data compute on EMR.
All combined the hype is quite real. Heck, even moving an intel based service to newer nitro based EC2 instances resulted in a drastic performance improvement. Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.
m5 instances use the Nitro system. In addition, m5.24xlarge is a quite quirky instance type: It uses 2 CPU's with 24 cores each in a NUMA configuration. Half of the RAM is attached to each CPU, and access from the other CPU is much slower. In addition, the CPU cores use a microarchitecture from 8 years ago, meaning the cores are quite slow in practice.
All of this means that a lot can go wrong when running code on those instances, resulting in lower performance. It is either advised to run separate processes on each NUMA domain, or use NUMA aware code (which Java almost never is). In addition, the code (or the system) should be highly scalable to multiple CPU cores.
In addition, the cores are old enough to suffer from Spectre/Meltdown related patches/workarounds, decreasing especially syscall performance.
In our case the instance type is about the only workhorse for the given job. High TPS (scales well to the core count) and needs a large on-disk configuration for low latency key value retrieval of data deployed on disk.
I did slightly misspeak on the instance move having seen your reply. We moved from m5.24xlarge to m6i.16xlarge. Sorry for the confusion.
That said, you shared some interesting information. I'd love to read up more on this, any specific place I can dig in a bit deeper regarding the finer specifics of these instance types and architecture?
Yes, I'm aware. The service in question wasn't easily able to be moved so we moved to m6i which isn't ARM based but does leverage nitro. We saw substantial improvements in that configuration too. Not sure what is different because you said m5 use nitro as well but my assumption was m6i with reduced hypervisor overhead from nitro was why we saw improvement.
m6i is a much newer CPU architecture, based on Intel Ice Lake rather than Skylake. It is quite significantly faster just from that alone. In addition, the CPU has about 10% higher clock speed.
The 16xlarge version is also a 32 core single socket CPU, meaning there should be no issues with NUMA. I would expect it to be much better than m5.24xlarge in most applications when taking the much faster single-threaded performance into account. Of course, nothing beats benchmarking and measuring yourself though.
I have personally seen issues with NUMA systems and code that theoretically parallelizes very well. Any synchronized mutable state becomes an issue with these kinds of systems. For example, I have had an issue where third party code would use the C "rand" function for randomness. Even though this was not used in a hot code path, on m5.24xlarge >90% of the execution time would be spent on just on the lock guarding the internal random state. On a "normal" system with fewer cores this never showed up while profiling.
> Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.
I wonder if this is actually an Intel issue or if there are some other optimizations at play, such as in the virtualization layer.
Because at one point I wanted to try Jetbrains' new "gateway" product, which basically runs a remote IDE and only shows the GUI locally. I was curious on one hand, but I also wanted a machine with a bit more oomph for my occasional compilation needs (rust on Linux, fwiw). I was really unimpressed, the c6i was comparable to my local slim laptop running an 11th gen i7u part. My similar slim AMD 5650U laptop is actually faster. IIRC, the c6i.metal wasn't particularly faster on this kind of single threaded work.
The difference is in the pricing and the fact the core are "whole"
On intel aws, you pay per HyperThread. On Graviton, you pay per core.
But on this kind of workload and with modern schedulers, HT bump is rather limited. So in practice you are paying twice the price for the same number of cores.
This is the biggest contributing factor to that difference and i keep being surprised noone mention it.
Not sure what you're talking about - in aws x86 you pay by core (well as much as pay by core with arm anyways, you can't just buy a 1gb server with 64 cores)
AWS x64 'cores' are the virtual cores you see on hyperthreaded CPUs and map 2:1 to physical cores on the CPU, but the AWS ARM offering doesn't have hyperthreading, so the virtual cores map 1:1 to cpu cores.
You can disable hyperthreading on the x64 instances at the cost of halving the number of cores you have available in the instance that you paid for.
1.7 million may be a lot or may be a tiny drop depending on your overall (like for like) spend. The % saved is the important metric here not the aggregate dollar amount.
Did you replace 5m of EC2 with 3.3m of EC2 and save 1.7m (impressive) or did you replace 50m of EC2 with 48.3m of EC2 (not really so impressive)?
The same fixed amount of money is less noticeable in a larger organization. It's more likely that something more effective could have been done with the engineers' time.
On the other hand, if the larger organization hired an AWS specialist, as many do, the optimization might be "free" because the specialist wouldn't have been effective outside of their area.
I'd say exact value versus % can be more meaningful in many non-billion+ companies.
Understanding you saved $1M/yr means you're closer to profitability (I understand in the VC world all you really care about is % growth, but that's up for debate if that's how everything should be) or able to hire more engineers.
I've been impressed not only with cloud arm performance but also usability. I had a hobby project that was using minecraft, javascript, bash, and such. IIRC absolutely nothing needed to be changed going from x86 to Arm. I got better performance (for my odd application) and better price.
In my experience it often does go smoothly, but otoh sometimes not. The typical reason for that is when a container image dependency is not published for ARM. And then the docker tooling does not make it particularly easy for you to substitute your own ARM binary container image for a public hub-published x64-only one. And then sometimes the image won't build at all on ARM for <reasons> of the kind that waste days of your time and turn out to require building an entire Linux distribution to fix...
I didn't bother taking the time to compile from scratch and figure out a way to inject it into my container image. It was a smaller client of mine. But it seems real simple to setup a second pipeline in their CI system and move on.
One person is compiling with -march=native (Intel) and using hand-tuned AVX-512 where appropriate. The other is running an application, and wants good performance for the money without too much vendor lock-in.
In other words, if you care about performance go with Intel? As for vendor lock-in, how is relying on AWS specific hw much better (see other comments below)?
If you care about performance to the extent of inlining hand-tuned assembly, you almost certainly care enough to do your own perf benchmarking on your own workloads.
Presumably, you aren't doing anything special in those ARM builds, so you can use an x86 build just as easily (or a non-Amazon ARM chip) depending on cost.
There is not some force of nature that makes 1 thread per core "real". On pointer-chasing workloads a hyperthread is as real as anything. HT is basically another way to exploit instruction-level parallelism that your compiler left on the table. You gotta pick what suits your program.
A hyperthread logical processor is a way of obtaining better efficiency by using surplus resources left on the table by the "real" logical processor that otherwise would remain idle (read: waste).
Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.
If you're got a workload that isn't characterized by missing cache and waiting around for main memory fairly often then usually SMT is going to be pretty inefficient. As a crude rule of thumb, the single thread performance you get is going to scale as something like the square root of the number of transistors you throw at the problem. So if you want two threads of a certain performance you can either use one bit core with SMT-2 or two cores each a third or so the size. And the two cores will tend to be more power efficient two with less data moving long distances through the caches.
Now, if you are hitting main memory often it makes sense to go wild with thread and use SMT-8 like IBM's POWER cores or Sun's SPARC cores did.
And if you mostly just care about maxing out single threaded performance for user responsiveness then you do indeed have lots of unused resources most of the time and you might as well add SMT for when you're more throughput bound.
But while the design and transistor costs of adding SMT to a design are very modest, everything I've heard about the test and verification of it seems pretty hairy.
> Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.
It's as real as a Bulldozer CMT thread, and that was widely considered a "real core".
An integer ALU being pinned to a particular thread doesn't make it "real" especially when it comes at the expense of shared frontend resources like a decoder that has to alternate between servicing each thread on alternate cycles (which, as Agner Fog's Microarchitecture notes, massively bottlenecks both threads). And obviously if one thread has ILP that can be exploited, and the other "core" isn't using its ALU, it would be better to share that! And when you allow that, what you get is SMT.
At the end of the day that's all CMT is - SMT with inefficiently allocated (pinned) resources, and it had even less dedicated resources on the frontend as well. And people will absolutely die on the hill that bulldozer was a "real core". There are probably some scheduling advantages to doing CMT instead of SMT, but also performance costs as well.
So what is a "real core" anyway in this context? Is physical (unsharable, unchangeable) division of resources "inherently better" (or even inherently different) from logical/software-defined division of resources?
Then you've got this whole thing from IBM recently... and leaving aside the fact it's cache, the question IBM is fundamentally asking us is, why not just allocate more resources to "cores" that need them? Why not execution units as well, why wouldn't that be better? Why is hardware defined core better than software defined core? https://www.anandtech.com/show/16924/did-ibm-just-preview-th...
And when you look to how you would implement that for non-cache resources, isn't the simplistic answer something very similar to SMT? Not sure POWER9 is all that far off base with runtime-configurerd SMT4/SMT8 and a big fat core, maybe that's how Intel can make better use of some of the gigacores they've built. Sure it can run one thread really fast but why not 8 threads on the same resources? Or you can go the other way and have one thread issue onto multiple cores, as long as there's ILP to cover it and the performance impact of crossing cores is not large, does the difference really matter?
And yeah maybe that's still one core... but then so is a bulldozer CMT module too lol. The whole "what is a core exactly" is kind of trite, it doesn't really matter.
Especially because a hyperthreaded core can usually run 2-4 of any particular instruction in parallel, and 6-10 total instructions in parallel. Even if your workloads never stall, they can still get a reasonable core worth of resources each, just a smaller core.
Terminology nit: Hyper-Threading(tm) is the Intel-ese marketing term for their implementation. The computer architecture term is SMT (simultaneous multithreading).
SMT is synonym with multi-processing and applies to full cores as well as shared cores. AFAIK, there is no non-marketing term for Hyper-Threading, it's a very specific architecture.
That's a pure AWS choice (you can use them only as real core) as is the fact that AWS has decided to provide large discounts for their graviton instance as a way to limit portability to other clouds.
If you have been able to convert your deployment from x64 to aarch64, you can do the same in the other direction, or choose another cloud provider that provides aarch64 stuff. It is really easy to build container images in multiple architectures nowadays.
It doesn't appear to be in preview in the USA yet. It appears only to be available in the Hangzhou-I zone, and you can only obtain access to a single instance.
We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.
Until AWS breaks out their margins between X86 and ARM (not going to happen) or total AWS margins start to go down, we don’t really publicly know if they are “dumping” or if they are just passing on the savings to the consumer.
The cost of the CPU is small part of a server TCO. Graviton instances could be cheaper because the platform is cheaper, uses less power and needs less cooling - I think we know from Apple Silicon that ARM chips can have these advantages.
Disclaimer: I work for AWS but I don't have any internal knowledge about Graviton pricing and non-public performance data.
It could go a third way, where they use them internally, too, and they're cheaper to operate.
So a certain percentage of the costs could be amortized through internal usage.
Edit:
> We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.
Oh, and for this :-)
We do know that Apple's per unit costs must be cheap, since the iPhone SE 2022 retails at $429 for an entire phone, and it comes with A15, which is quite competitive with Android flagship chipsets from 2023. Since they can squeeze in an entire phone + profits in $429, the chipset itself can't be that expensive.
Yes, R&D is probably very expensive but they can spread that around to millions of units, just like Amazon :-)
Of course, Apple could also be dumping chipsets, in which case, I don't know ¯\_(ツ)_/¯
Apple even passed some of those cost savings on to consumers, most of the comparable M1 models were ~ $100 cheaper than the much slower Intel models they replaced. And Apple being Apple, if they passed on that much, their total cost savings almost certainly were significantly higher.
> We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.
We also know that Intel server CPUs aren't cheap to buy and include a significant profit margin for Intel.
More to the point lots about Graviton seems to be focused on keeping the total cost of ownership down - clock speed for example to limit power consumption.
Perhaps AWS is pricing aggressively to drive adoption. I'd be astonished if they are offering a service like this at scale at below their required return on capital.
My point was more about `limit portability to other clouds` than the pricing issue, even if it is known that Amazon used is economic power in the past to push competitors out of the market.
By the way for anyone curious how the numbers in the article compare to a modern desktop computer, a 13th-generation Intel Core does 169ns/url, twice as fast as either of these server rigs. If I cap mine to 3GHz to reflect more realistic server scenarios, 310ns/url, a bit quicker than the Graviton 3. Even the "efficiency cores" on a desktop are faster than these server parts, at 286ns/url.
Cloud servers have many, many more cores, cache and memory channels than desktop chips. You can’t clock them that high, nor would hyperscalers want to. They would cost more in power and cooling. Cloud is about multi factor optimization.
To me it's clear in the post: they benchmarked how fast is their piece of code on c6i and c7g instances. It's not to prove Graviton 3 is faster than Intel in general, just one in many cases.
I remember just adding one letter to the instance type in EMR to switch from Intel to ARM... and saw a 20% speedup as well as additional 20% hourly cost saving, all in all 1/3 cost decrease from a single character.
We saw similar gains going from Intel to Graviton 2 on EMR: 15% to 40% speedups depending on the job. Saw similar gains in EMR serverless switching from x86 to arm as well.
The only issue I've had is actually getting enough on-demand instances in certain regions during peak times.
I can't wait for Graviton 3 to be available on EMR.
Just chiming in: also switched a lot of stuff at work from c5/c6/r5/r6 Intel to c6g/r6g Graviton. We are since seeing notably lower average load - read: more available capacity despite same number of instances - while also having a lower monthly cost.
Is this affected at all by "other people's code" running on the same hardware simultaneously?
I am not an expert on the various instance types and/or bursting, but I assume that ARM hardware is not as underprovisioned just by lack of fellow users.
Isn't the claim that ARM has nothing comparable to AVX512 technically incorrect? AFAIK SVE and SVE2 are definitly "comparable" even tough the actual vector size depends on the actual hardware of the processor.
In current commercially available cores SVE and SVE2 are comparable to AVX but not AVX512 specifically, because every SVE design out there that isn't in a Fujitsu supercomputer is limited to 256 bits. There might also be differences on scatter/gather that are important in some places.
When you control the pricing and want to drive people to your own hardware, of course it's gonna be cheaper. Gotta exercise that market power over your suppliers
As someone who knows very little about CPUs but is aware of the energy savings that can be had on ARM, along with reports like this about speed and my personal experience of my M1 Pro MBP which feels like I won't need a new computer for a decade, can anyone explain to me why we're not seeing much wider adoption of ARM on servers? I'm primarily thinking of the lack of ARM runners on GitHub and the almost unicorn-level rarity of ARM options on VPS hosts.
Most VPS hosting services don't build their own hardware the way some of the giants can and rely on available offering from server manufacturers. Some even use consumer grade regular tower pc. The offering is growing but is still not as large as intel/amd one.
I think this tells us that price/performance is not a critical factor to these vendors, most of the time. Perhaps because they have high margins, or because the CPU is not a significant part of their input costs?
When I was trying to launch c7gs I was having trouble but then I just added a capacity reservation and was able to launch fine. This was in us-west-2 about 6 months ago. Not sure if there are still problems in that region or if this hack still works.