Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ARM vs. Intel on Amazon’s Cloud: A URL Parsing Benchmark (lemire.me)
142 points by nnx on March 2, 2023 | hide | past | favorite | 89 comments


At $previous_job we shifted a large workload from Intel to Graviton which was projected to save ~$1.7m annually while keeping roughly equivalent performance (after some tuning).


> which was projected to save ~$1.7m annually

Did it?


I've seen first hand validation on massive workloads moving to Graviton based instances. This includes low latency high TPS java services and offline big data compute on EMR.

All combined the hype is quite real. Heck, even moving an intel based service to newer nitro based EC2 instances resulted in a drastic performance improvement. Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.


m5 instances use the Nitro system. In addition, m5.24xlarge is a quite quirky instance type: It uses 2 CPU's with 24 cores each in a NUMA configuration. Half of the RAM is attached to each CPU, and access from the other CPU is much slower. In addition, the CPU cores use a microarchitecture from 8 years ago, meaning the cores are quite slow in practice.

All of this means that a lot can go wrong when running code on those instances, resulting in lower performance. It is either advised to run separate processes on each NUMA domain, or use NUMA aware code (which Java almost never is). In addition, the code (or the system) should be highly scalable to multiple CPU cores.

In addition, the cores are old enough to suffer from Spectre/Meltdown related patches/workarounds, decreasing especially syscall performance.


In our case the instance type is about the only workhorse for the given job. High TPS (scales well to the core count) and needs a large on-disk configuration for low latency key value retrieval of data deployed on disk.

I did slightly misspeak on the instance move having seen your reply. We moved from m5.24xlarge to m6i.16xlarge. Sorry for the confusion.

That said, you shared some interesting information. I'd love to read up more on this, any specific place I can dig in a bit deeper regarding the finer specifics of these instance types and architecture?


Just to note: m6i instances are Intel-based.

As for getting information on AWS instances, the best way in my opinion is just to spin up the instance and look up which exact CPU model it uses. Then you can go for example to WikiChip (https://en.wikichip.org/wiki/WikiChip) to see more information about the CPU. Other good sources include Anandtech (for example https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...) and Chips and Cheese (for example https://chipsandcheese.com/2022/05/29/graviton-3-first-impre...).

Things like NUMA configuration can be inspected with tools like numactl.


Yes, I'm aware. The service in question wasn't easily able to be moved so we moved to m6i which isn't ARM based but does leverage nitro. We saw substantial improvements in that configuration too. Not sure what is different because you said m5 use nitro as well but my assumption was m6i with reduced hypervisor overhead from nitro was why we saw improvement.


m6i is a much newer CPU architecture, based on Intel Ice Lake rather than Skylake. It is quite significantly faster just from that alone. In addition, the CPU has about 10% higher clock speed.

The 16xlarge version is also a 32 core single socket CPU, meaning there should be no issues with NUMA. I would expect it to be much better than m5.24xlarge in most applications when taking the much faster single-threaded performance into account. Of course, nothing beats benchmarking and measuring yourself though.

I have personally seen issues with NUMA systems and code that theoretically parallelizes very well. Any synchronized mutable state becomes an issue with these kinds of systems. For example, I have had an issue where third party code would use the C "rand" function for randomness. Even though this was not used in a hot code path, on m5.24xlarge >90% of the execution time would be spent on just on the lock guarding the internal random state. On a "normal" system with fewer cores this never showed up while profiling.


> Moved from m5.24xlarge --> m6g.8xlarge with better service performance and improved latency characteristics. Intel is in trouble in my opinion.

I wonder if this is actually an Intel issue or if there are some other optimizations at play, such as in the virtualization layer.

Because at one point I wanted to try Jetbrains' new "gateway" product, which basically runs a remote IDE and only shows the GUI locally. I was curious on one hand, but I also wanted a machine with a bit more oomph for my occasional compilation needs (rust on Linux, fwiw). I was really unimpressed, the c6i was comparable to my local slim laptop running an 11th gen i7u part. My similar slim AMD 5650U laptop is actually faster. IIRC, the c6i.metal wasn't particularly faster on this kind of single threaded work.


The difference is in the pricing and the fact the core are "whole"

On intel aws, you pay per HyperThread. On Graviton, you pay per core.

But on this kind of workload and with modern schedulers, HT bump is rather limited. So in practice you are paying twice the price for the same number of cores.

This is the biggest contributing factor to that difference and i keep being surprised noone mention it.


Not sure what you're talking about - in aws x86 you pay by core (well as much as pay by core with arm anyways, you can't just buy a 1gb server with 64 cores)


AWS x64 'cores' are the virtual cores you see on hyperthreaded CPUs and map 2:1 to physical cores on the CPU, but the AWS ARM offering doesn't have hyperthreading, so the virtual cores map 1:1 to cpu cores.

You can disable hyperthreading on the x64 instances at the cost of halving the number of cores you have available in the instance that you paid for.


Yes, but this doesn't really matter - just multiply the cost by two in your accounting. Its not like you one get one core with 15 hyperthread cores


"Intel is in trouble" since the calxeda days and ARM is still insignificant to this day.


> ARM is still insignificant to this day.

Is it? The phone you use probably uses ARM. If you buy a mac now, it's probably gonna be ARM. It's very much different from the calxeda days!


OnlineOrNot (my company) saved about 30% moving DBs from Intel to ARM, so it sounds legit


1.7 million may be a lot or may be a tiny drop depending on your overall (like for like) spend. The % saved is the important metric here not the aggregate dollar amount.

Did you replace 5m of EC2 with 3.3m of EC2 and save 1.7m (impressive) or did you replace 50m of EC2 with 48.3m of EC2 (not really so impressive)?


> or did you replace 50m of EC2 with 48.3m of EC2 (not really so impressive)?

I'm failing to comprehend how that's not impressive. Bean counters would still love this type of savings.


The same fixed amount of money is less noticeable in a larger organization. It's more likely that something more effective could have been done with the engineers' time.

On the other hand, if the larger organization hired an AWS specialist, as many do, the optimization might be "free" because the specialist wouldn't have been effective outside of their area.


I'd say exact value versus % can be more meaningful in many non-billion+ companies.

Understanding you saved $1M/yr means you're closer to profitability (I understand in the VC world all you really care about is % growth, but that's up for debate if that's how everything should be) or able to hire more engineers.


I've been impressed not only with cloud arm performance but also usability. I had a hobby project that was using minecraft, javascript, bash, and such. IIRC absolutely nothing needed to be changed going from x86 to Arm. I got better performance (for my odd application) and better price.


In my experience it often does go smoothly, but otoh sometimes not. The typical reason for that is when a container image dependency is not published for ARM. And then the docker tooling does not make it particularly easy for you to substitute your own ARM binary container image for a public hub-published x64-only one. And then sometimes the image won't build at all on ARM for <reasons> of the kind that waste days of your time and turn out to require building an entire Linux distribution to fix...


Yeah like New Relic not packaging its PHP agent (which they make sound trivial and I'm pretty sure it is)

https://github.com/newrelic/newrelic-php-agent/issues/323

I didn't bother taking the time to compile from scratch and figure out a way to inject it into my container image. It was a smaller client of mine. But it seems real simple to setup a second pipeline in their CI system and move on.


Intel had published results and “surprisingly” they are the opposite https://www.intel.com/content/www/us/en/products/performance...


One person is compiling with -march=native (Intel) and using hand-tuned AVX-512 where appropriate. The other is running an application, and wants good performance for the money without too much vendor lock-in.


if(AMD){ slowPath(); }

"Wow Intel's so much faster..!"


In other words, if you care about performance go with Intel? As for vendor lock-in, how is relying on AWS specific hw much better (see other comments below)?


If you care about performance to the extent of inlining hand-tuned assembly, you almost certainly care enough to do your own perf benchmarking on your own workloads.


Presumably, you aren't doing anything special in those ARM builds, so you can use an x86 build just as easily (or a non-Amazon ARM chip) depending on cost.


The Graviton are also real cores in the VM, not like the Intel/AMD that have hyper threading.


There is not some force of nature that makes 1 thread per core "real". On pointer-chasing workloads a hyperthread is as real as anything. HT is basically another way to exploit instruction-level parallelism that your compiler left on the table. You gotta pick what suits your program.


Or to put it another way:

A hyperthread logical processor is a way of obtaining better efficiency by using surplus resources left on the table by the "real" logical processor that otherwise would remain idle (read: waste).

Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.


> Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.

That’s cool but when a vcpu is a hyper thread (rather than a physical core with two hyper threads) that’s the reality of your cloud experience.


If you really need or want an entire physical CPU core, you pay extra for physical hardware instead of a virtualization.

Yes, it's expensive, but the performance of your server is a simple question of how much money you're willing to part with.


The point is that you can get that with graviton too, which is a much smaller transition.


If you're got a workload that isn't characterized by missing cache and waiting around for main memory fairly often then usually SMT is going to be pretty inefficient. As a crude rule of thumb, the single thread performance you get is going to scale as something like the square root of the number of transistors you throw at the problem. So if you want two threads of a certain performance you can either use one bit core with SMT-2 or two cores each a third or so the size. And the two cores will tend to be more power efficient two with less data moving long distances through the caches.

Now, if you are hitting main memory often it makes sense to go wild with thread and use SMT-8 like IBM's POWER cores or Sun's SPARC cores did.

And if you mostly just care about maxing out single threaded performance for user responsiveness then you do indeed have lots of unused resources most of the time and you might as well add SMT for when you're more throughput bound.

But while the design and transistor costs of adding SMT to a design are very modest, everything I've heard about the test and verification of it seems pretty hairy.


> Considering hyperthreading as a way of obtaining an extra "real core" is a gross misunderstanding of what hyperthreading is meant to achieve.

It's as real as a Bulldozer CMT thread, and that was widely considered a "real core".

An integer ALU being pinned to a particular thread doesn't make it "real" especially when it comes at the expense of shared frontend resources like a decoder that has to alternate between servicing each thread on alternate cycles (which, as Agner Fog's Microarchitecture notes, massively bottlenecks both threads). And obviously if one thread has ILP that can be exploited, and the other "core" isn't using its ALU, it would be better to share that! And when you allow that, what you get is SMT.

At the end of the day that's all CMT is - SMT with inefficiently allocated (pinned) resources, and it had even less dedicated resources on the frontend as well. And people will absolutely die on the hill that bulldozer was a "real core". There are probably some scheduling advantages to doing CMT instead of SMT, but also performance costs as well.

So what is a "real core" anyway in this context? Is physical (unsharable, unchangeable) division of resources "inherently better" (or even inherently different) from logical/software-defined division of resources?

Then you've got this whole thing from IBM recently... and leaving aside the fact it's cache, the question IBM is fundamentally asking us is, why not just allocate more resources to "cores" that need them? Why not execution units as well, why wouldn't that be better? Why is hardware defined core better than software defined core? https://www.anandtech.com/show/16924/did-ibm-just-preview-th...

And when you look to how you would implement that for non-cache resources, isn't the simplistic answer something very similar to SMT? Not sure POWER9 is all that far off base with runtime-configurerd SMT4/SMT8 and a big fat core, maybe that's how Intel can make better use of some of the gigacores they've built. Sure it can run one thread really fast but why not 8 threads on the same resources? Or you can go the other way and have one thread issue onto multiple cores, as long as there's ILP to cover it and the performance impact of crossing cores is not large, does the difference really matter?

And yeah maybe that's still one core... but then so is a bulldozer CMT module too lol. The whole "what is a core exactly" is kind of trite, it doesn't really matter.


Especially because a hyperthreaded core can usually run 2-4 of any particular instruction in parallel, and 6-10 total instructions in parallel. Even if your workloads never stall, they can still get a reasonable core worth of resources each, just a smaller core.


Terminology nit: Hyper-Threading(tm) is the Intel-ese marketing term for their implementation. The computer architecture term is SMT (simultaneous multithreading).


SMT is synonym with multi-processing and applies to full cores as well as shared cores. AFAIK, there is no non-marketing term for Hyper-Threading, it's a very specific architecture.


You're probably thinking of SMP or CMP? SMT is specifically about multiple threads on a core.


Ouch, yes, I was thinking of SMP, not SMT.

TIL that there is a general name.


That's a pure AWS choice (you can use them only as real core) as is the fact that AWS has decided to provide large discounts for their graviton instance as a way to limit portability to other clouds.


How does that limit portability?

If you have been able to convert your deployment from x64 to aarch64, you can do the same in the other direction, or choose another cloud provider that provides aarch64 stuff. It is really easy to build container images in multiple architectures nowadays.


All major cloud providers have arm64-based instances available now.


Graviton 3 is currently way ahead of the other offerings. The others are closer to Graviton 2.


No, its a generation behind the YT710 instances at alibaba, which are now in preview even in the US.

https://www.alibabacloud.com/product/ecs/g8m

"As the first instance family that uses ARM v9 architecture CPUs,"

and "Arm-based Alibaba Cloud T-Head Yitian 710 Crushes SPECrate2017_int_base"

https://www.servethehome.com/arm-based-alibaba-cloud-t-head-...

The gravaton3's are V1's (ARM v8.4), so yes AWS is ahead of the companies using the N1 cores from ampere. But, its a bit of a US centric view.


It doesn't appear to be in preview in the USA yet. It appears only to be available in the Hangzhou-I zone, and you can only obtain access to a single instance.


> AWS has decided to provide large discounts for their graviton instance as a way to limit portability to other clouds.

It this a negative aspect?


Absolutely yes.


So they're evil because they're offering things at low prices?


Are you familiar with the concept of predatory pricing?


Yes, and dumping.

However, you're making a bold assumption there. You don't know their cost structure. Their CPU time could literally be just that cheap for Graviton.


We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.

Until AWS breaks out their margins between X86 and ARM (not going to happen) or total AWS margins start to go down, we don’t really publicly know if they are “dumping” or if they are just passing on the savings to the consumer.

I’d bet that it’s a little of both


The cost of the CPU is small part of a server TCO. Graviton instances could be cheaper because the platform is cheaper, uses less power and needs less cooling - I think we know from Apple Silicon that ARM chips can have these advantages.

Disclaimer: I work for AWS but I don't have any internal knowledge about Graviton pricing and non-public performance data.


It could go a third way, where they use them internally, too, and they're cheaper to operate.

So a certain percentage of the costs could be amortized through internal usage.

Edit:

> We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.

Oh, and for this :-)

We do know that Apple's per unit costs must be cheap, since the iPhone SE 2022 retails at $429 for an entire phone, and it comes with A15, which is quite competitive with Android flagship chipsets from 2023. Since they can squeeze in an entire phone + profits in $429, the chipset itself can't be that expensive.

Yes, R&D is probably very expensive but they can spread that around to millions of units, just like Amazon :-)

Of course, Apple could also be dumping chipsets, in which case, I don't know ¯\_(ツ)_/¯


> Apple Silicon → ARM not cheap

Well, maybe not cheap, but certainly cheaper.

"Apple’s move to M1 chips will save $2.5B this year, estimates IBM exec"

https://9to5mac.com/2020/11/18/apples-move-to-m1/

Apple even passed some of those cost savings on to consumers, most of the comparable M1 models were ~ $100 cheaper than the much slower Intel models they replaced. And Apple being Apple, if they passed on that much, their total cost savings almost certainly were significantly higher.


> We know enough about Apple Silicon to know that high-performance ARM chips aren’t cheap to manufacture.

We also know that Intel server CPUs aren't cheap to buy and include a significant profit margin for Intel.

More to the point lots about Graviton seems to be focused on keeping the total cost of ownership down - clock speed for example to limit power consumption.

Perhaps AWS is pricing aggressively to drive adoption. I'd be astonished if they are offering a service like this at scale at below their required return on capital.


My point was more about `limit portability to other clouds` than the pricing issue, even if it is known that Amazon used is economic power in the past to push competitors out of the market.


By the way for anyone curious how the numbers in the article compare to a modern desktop computer, a 13th-generation Intel Core does 169ns/url, twice as fast as either of these server rigs. If I cap mine to 3GHz to reflect more realistic server scenarios, 310ns/url, a bit quicker than the Graviton 3. Even the "efficiency cores" on a desktop are faster than these server parts, at 286ns/url.


Cloud servers have many, many more cores, cache and memory channels than desktop chips. You can’t clock them that high, nor would hyperscalers want to. They would cost more in power and cooling. Cloud is about multi factor optimization.


Just to note the Graviton 3 is on 5nm. The 13th Gen is on Intel 7. The Core is also Real Core compared to a VPC which is a thread.


I tried this benchmark on both my desktop (Ryzen 5800X3D) and phone (Sony Xperia 1 IV)

    Windows => time/url=364.626ns
    WSL2    => time/url=234.915ns
    Android => time/url=252.276ns
This benchmark feels flawed to me, but I'm not qualified to speculate why.


Another datapoint on a modern desktop CPU, this time ARM (M1 Max laptop, so the same TSMC 5nm process as Graviton 3): 175ns/url


Well the Graviton 3 runs at 2.6 GHz so not sure that really shows 13th-gen Core in a great light.


How about running as many threads as your cpu has SMT threads? Those are what AWS sells as vCPUs.


Our CI jobs run on AWS building for arm64 and amd64 and arm64 runs slower. This might be because AWS uses graviton2 for cloud build and not graviton3.


When I see the word "benchmark" and don't see a methodology I get a little wary. In this case the author ran a custom benchmark from one of their projects. https://github.com/ada-url/ada/blob/main/benchmarks/wpt_benc...

To be clear I'm not questioning the benchmark's accuracy or author's bona fides, but that post was a little short for my taste.


To me it's clear in the post: they benchmarked how fast is their piece of code on c6i and c7g instances. It's not to prove Graviton 3 is faster than Intel in general, just one in many cases.


I remember just adding one letter to the instance type in EMR to switch from Intel to ARM... and saw a 20% speedup as well as additional 20% hourly cost saving, all in all 1/3 cost decrease from a single character.

Some migrations are easier than others.


We saw similar gains going from Intel to Graviton 2 on EMR: 15% to 40% speedups depending on the job. Saw similar gains in EMR serverless switching from x86 to arm as well.

The only issue I've had is actually getting enough on-demand instances in certain regions during peak times.

I can't wait for Graviton 3 to be available on EMR.


Just chiming in: also switched a lot of stuff at work from c5/c6/r5/r6 Intel to c6g/r6g Graviton. We are since seeing notably lower average load - read: more available capacity despite same number of instances - while also having a lower monthly cost.


Is this affected at all by "other people's code" running on the same hardware simultaneously?

I am not an expert on the various instance types and/or bursting, but I assume that ARM hardware is not as underprovisioned just by lack of fellow users.


Isn't the claim that ARM has nothing comparable to AVX512 technically incorrect? AFAIK SVE and SVE2 are definitly "comparable" even tough the actual vector size depends on the actual hardware of the processor.


In current commercially available cores SVE and SVE2 are comparable to AVX but not AVX512 specifically, because every SVE design out there that isn't in a Fujitsu supercomputer is limited to 256 bits. There might also be differences on scatter/gather that are important in some places.


So around 15% cheaper and 15% faster, good stuff


Yeah I can't remember what AWS's marketing says exactly but they definitely say cheaper and faster (for Lambda as least).


When you control the pricing and want to drive people to your own hardware, of course it's gonna be cheaper. Gotta exercise that market power over your suppliers


As someone who knows very little about CPUs but is aware of the energy savings that can be had on ARM, along with reports like this about speed and my personal experience of my M1 Pro MBP which feels like I won't need a new computer for a decade, can anyone explain to me why we're not seeing much wider adoption of ARM on servers? I'm primarily thinking of the lack of ARM runners on GitHub and the almost unicorn-level rarity of ARM options on VPS hosts.


Most VPS hosting services don't build their own hardware the way some of the giants can and rely on available offering from server manufacturers. Some even use consumer grade regular tower pc. The offering is growing but is still not as large as intel/amd one.


I think this tells us that price/performance is not a critical factor to these vendors, most of the time. Perhaps because they have high margins, or because the CPU is not a significant part of their input costs?


Also lack of ARM runners on GitLab CI


Good luck actually getting a c7g.


When I was trying to launch c7gs I was having trouble but then I just added a capacity reservation and was able to launch fine. This was in us-west-2 about 6 months ago. Not sure if there are still problems in that region or if this hack still works.


I would like to see more extended benchmarks.


Running on your own computer is much faster than cloud.


So how do you plan to get your own Graviton 3 computer?


Mac mini M2!


I guess Ampere One should be similar in 2-3 years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: