H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark

ed · on June 27, 2023

These could train GPT-3 in 46 hours. (Edited from "11 mins" per https://news.ycombinator.com/item?id=36500154 )

An equivalent number of V100's (GPT-3's original GPU) would've taken about 36 days [0].

0 – https://www.reddit.com/r/GPT3/comments/p1xf10/comment/h8h3sl...

april7 · on June 27, 2023

Kind of. This tweet better explains what they did:

https://twitter.com/abhi_venigalla/status/167381386318645248...

smaddox · on June 27, 2023

So it would be more like 46 hours to train GPT-3 from scratch.

    (300BT / 1.2BT) * 11 min * (1 hr / 60 min) = 45.8 hr

Still pretty incredible. That's an 18.8x speedup over 36 days.

lumost · on June 28, 2023

This points to an interesting future for foundation models. This is an 18x cost reduction in only 2 years. Either foundation models are going to get much bigger, or variations will become common.

rerx · on June 28, 2023

V100 GPUs are from 2017, so it's more than two years. A100 already appeared there years ago, btw.

An eight GPU DGX-1 server cost ~149k$ back then (googled news postings). A current gen DGX H100 is 520k$ with 5 years of support. Of course it holds 5x the memory, plus GPUs and interconnect are much faster. But when comparing costs, take price hikes into account.

jsjohnst · on June 28, 2023

An important thing to also keep in mind is how much inflation changed prices over the duration. $520k in 2023 dollars is around $420k in 2017 dollars. Sure, still almost 3x more expensive, but that’s better than being 0.7x higher.

raverbashing · on June 28, 2023

Variations of specializations I guess

For writing code you don't care about feeding world history to your model. So a smaller model might be better at a specialized task

Sure, having a big multi-modal-model is great, but by having specialized models you can spread tasks better

mlboss · on June 28, 2023

But I am sure prompt understanding improves with more text data. Same with reasoning ability.

usaar333 · on June 28, 2023

Your citation is for 1k A100s, not 3.5k V100s. I think it's actually ~51 days on 3.5k V100s.

Just to compare the GPUs, TF32 Tensor processing went from ~125 TFlops to ~990. It then looks like they also dropped the precision to FP8, which gives you another 4x win.

What's interesting is to look at how we're progressing in performance over time. In some sense, a bit slow?

A V100 costs $10k at release; an H100 seems to be $40k.

So we've only managed to halve the cost of a flop in 5 years. That seems.. much slower than what Moore's Law would have suggested.

ac29 · on June 28, 2023

> So we've only managed to halve the cost of a flop in 5 years. That seems.. much slower than what Moore's Law would have suggested.

Moore's law says nothing about price or performance.

H100 has about 4x as many transistors as V100, which is pretty close to what Moore's law would predict.

esperent · on June 28, 2023

> Moore's law says nothing about price or performance.

Moore's law is about doubling of transistor count for the same price. At least that's always been my understanding.

EDIT: I decided to look it up. Heres the original 1975 statement from him that led to the law:

"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years."

So yup, it's about transistor density vs price.

jsjohnst · on June 28, 2023

No, it’s not about price of a completely finished product in my humble opinion. It’s about the cost to produce the silicon component itself. Sure, it’s about money both ways, but the distinction is important.

The V100 had up to 32gb ram, the H100 is 188gb. While there’s obviously transistors in ram, the counts being compared are for the GPU itself. I’d argue that a big chunk of the price difference in the V100 vs H100 is RAM.

usaar333 · on June 28, 2023

> I’d argue that a big chunk of the price difference in the V100 vs H100 is RAM.

But that's the same Moore's Law problem when using the 'price' definition.

6x RAM 5 years later = 4x the cost? That's even slower than the FLOPS gains.

jsjohnst · on June 29, 2023

Moore’s law never applied to ram and again, you aren’t comparing apples to apples either. There’s a lot of factors that go into building a complete system, so stop trying to apply an old cpu transistor adage to something it was never intended for and maybe you’ll have better success?

esperent · on June 29, 2023

Moore's law never applied to GPUs either. I don't know if it even said anything about CPUs, specifically. It's about transistor density vs cost.

Roughly the same tech advances apply to making all three kinds of chips so I think it applies (somewhat) equally to CPU, GPUs, RAM and even SSDs etc.

Bearing in mind that ram lags a few generations behind CPU and GPU (I think DDR5 is 12nm vs latest GPUs around 4nm). And also that Moore's law is not a physical law, even when it was in full swing it was only ever meant as a rough guideline for what to expect over a couple of years period.

winwang · on June 28, 2023

What about power draw? Quick google says V100 is max 300W while H100 is max 700W TDP, which makes the cost more favorable than 4x, so more like 7x less per flop. *assuming* the same electricity cost, which actually seems to have increased significantly.

On a (minor) side note, it seems that $1.00 from 2018 is worth ~$1.20 in 2023. I wish more cost comparisons included inflation, because the past few years have had a lot of it.

usaar333 · on June 28, 2023

Good call on power draw, but it seems dwarfed by capital cost. Assuming 3 year life (this tech gets outdated..), 300W constant consumption costs under $1k at $0.3/kwh.

You need to be an order of magnitude higher in power usage before it really starts mattering. e.g. a Tesla3 consumes around 15 KW (20x the H100) while driving.

That said it looks like flops/watt dropped by only 3.4x, which is also sub Moore's Law (3 years to halve power consumption)

TOMDM · on June 28, 2023

I don't think flops/$ is enough to really capture the difference here.

You couldn't replicate the scale of compute this allows no matter how many V100's you had.

A huge amount of cost here is embodied in networking and memory.

If one were to design a chip that cared only for flop/$ without caring for all of the interconnect and memory, then the 4090 is a much fairer comparison, and even then that card isn't designed for a flop/$ optimisation.

zetazzed · on June 27, 2023

Yeah, it's wild - feels like we see "software is so slow now! nobody optimizes software anymore, they just run Python and burn cycles!" posts all the time, but man, when a company REALLY wants to optimize something -- it's a thing of beauty.

godelski · on June 28, 2023

With 3,584 GPUs, or 448 Nodes w/ 8GPUs/node. At 5 nodes/rack, that's 90 racks. I looked and found and old listing for Lambda's Echelon racks about $650k. So the infra cost is a little under $6 million.

If you bought the $2/hr H100 instance they offer, that would cost ~$330k. Pricey, but not too bad.

My bigger hope is that with cheaper compute we can see more architecture search and designs. A lot of different architectures are relatively unexplored due to computational constraints and are typically performed by smaller labs so they don't scale and it is kinda hard to compare models when we're just looking at performance benchmarks and not considering other factors. We definitely don't want big labs to railroad our research directions. Feels weird that a huge amount of NLP is based on using pretrained models and tuning them. Vision is going this way too. You basically can't get published without being SOTA so you basically have to modify an existing model or have a multi-million dollar lab and train from scratch. Really weird to expect academia to compete with big labs and really weird to not let academia take "bigger risks" and explore less popular areas. It is vital to our research path that we don't force everything onto a single track.

m00x · on June 28, 2023

H100s are now around 30k USD if you're lucky, going up to 40k on Ebay/Amazon.

3,584 GPUs at $30,000 is $104,550,000 USD only in GPUs.

godelski · on June 28, 2023

We're talking about building a server, so usually availability is different than to a home consumer. But fwiw lambda has their Node for 330k (which gets pretty close to that price), whereas the A100's are $170k, so double my figure. Especially considering the H100 nodes are 8U instead of 4U. so you need twice as many racks.

m00x · on June 29, 2023

30k is to businesses, not consumers. Consumers don't buy H100s.

messe · on June 27, 2023

3500 of them trained GPT-3 in 11 minutes. That number is still worth remarking.

layoric · on June 28, 2023

GPT3 benchmark, as others have said this means it would be ~46 hours to train GPT, still impressive! Also, going by other comments here, ~3500 of these or 13-14 DGX GH200s at ~$10m each means we are talking about ~$135m worth of compute here. Still very impressive but holy hell that is a lot of money worth of compute hardware.

reaperman · on June 27, 2023

It’s remarkable that one of these could train GPT-3 in under a month. However, good to note that its the price is that of twenty 4090’s.

cavisne · on June 28, 2023

One interesting thing is if the model can’t fit into GPU memory (sharded across multiple chips) it would be much slower. So one chip would take more than a month.

That’s why these clusters are so valuable, even with Nvidias margins they are still cheaper than using less compute for longer.

ClassyJacket · on June 28, 2023

Could it actually? Or do you need the memory size of thousands of them running in parallel?

reaperman · on June 28, 2023

Yes I worded that extremely poorly.

visarga · on June 28, 2023

title says: "a massive GPT-3-based benchmark"

that means they might pass a few batches through a GPT-3 sized random init network and time it

hinkley · on June 27, 2023

I am keeping my fingers crossed that there will be a 'Bloom Filter' moment for AI where we create an algorithm that can eliminate an expensive calculation in 99% of interactions and give us a 10x speedup in some larger problem domain.

Or some other 'trust but verify' situation where the suggested action is still validated against business rules that have some notion of consumer protection.

usaar333 · on June 28, 2023

Dropping numerical precision seems to have done that to some degree?

dsign · on June 28, 2023

I think the parent is referring to the fact that LLMs and generative models are, for the time being, “creating new markets” and “creating new problems”, but not really solving any of the hard problems we already have. I see it this way: you have this incredible piece of technology. You can use it to take the scraps away from a bunch of starving artists, and convert that survival income in wealth for some entrepreneur. Or you can use it to cure cancer or aging or make fossil-free airplanes. At the moment, the first application seems to be the one moving the market, and that’s really sad.

On the side of hope, if somebody were to demonstrate how to use a bunch of GPUs to make the e-coli bacteria live for a little longer using an approach that has a semblance of generality, I guarantee that a lot of old people would be willing to part with their fortunes in exchange for a sliver of hope for them or future generations.

soultrees · on June 28, 2023

That’s only because the companies with marketing budgets and marketing teams are the ones you are implying.

There is serious research going down right now, but you don’t see you typical medical research department at your university buying ads on Instagram.

hinkley · on June 28, 2023

You have a deterministic AI program to show me?

pama · on June 28, 2023

Not sure exactly what you mean here but most of the quantum mechanics calculations in material science and chemistry applications can now be done at dramatic speedups using such approximate models. It has been exactly this Bloom Filter moment a few years back and the complete transformation of how people in these domains work is well underway.

tikkun · on June 27, 2023

If you want to try one, some good options are:

FluidStack or Lambda Labs

If you need to rent a supercluster and you’re not tied to one of the big 3 clouds, then talk with FluidStack, Lambda, Oracle, maybe CoreWeave.

rch · on June 28, 2023

Lambda seems like the easiest to work with at present.

tikkun · on June 28, 2023

I’ve used both and found both to work well for me. There are a variety of other gpu cloud options that aren’t good, though. I listed a bunch here, some are good, some seem to be worse on all dimensions (price, capacity, and UX) - https://gpus.llm-utils.org/alternative-gpu-clouds/

Tl;dr of the good ones: FluidStack and Lambda for H100s (1x instances), Runpod for A100s.

Karupan · on June 28, 2023

Has AMD completely missed the ML/AI train? I’m quite surprised that even for inference, there doesn’t seem to be a viable competitor to nvidia. Is there anything in AMD’s roadmap to suggest they are planning to even compete?

singhrac · on June 28, 2023

Intel is much closer with Habana Gaudi2. Sometimes I’m unsure whether they even recognize it.

They also published MLPerf results today: https://habana.ai/blog/gaudi2-demonstrates-competitive-llm-p...

With 384 Gaudi2s they did the LLM task in 312 minutes, compared to 46 minutes for 768 H100s. It’ll come down to cost, but given the H100 is a process node or two ahead (and much more expensive I imagine?), Intel is actually much closer than I had realized. They’re the only other to submit an MLPerf result for the LLM task, I think. All credit of course to Habana Labs which was acquired by Intel.

Karupan · on June 28, 2023

As a ML outsider, this was the first I’ve ever heard of it. Obviously I’m not the target audience, but I’m just shocked that I didn’t even know Intel was in the game.

Dalewyn · on June 28, 2023

Intel's going big in their own way by putting "AI" accelerators and the like in their latest and future processors, kind of need to be living under a rock to just miss it in these kinds of tech circles.

But that said, Intel suffers from the same fundamental problem as AMD: They aren't Nvidia and they can't CUDA.

singhrac · on June 28, 2023

I’d argue you don’t need CUDA if PyTorch runs well on your platform, which is what everyone is trying to show with these MLPerf results. I think Intel’s strategy with oneAPI is not bad, it’s just late.

varispeed · on June 28, 2023

More like Intel stopped being interesting years ago. For me, they just keep reheating the old designs and stopped innovate. To get my attention back, they would have to create something that would get to me through some mainstream articles. Otherwise, I think following Intel is a waste of time. They need to prove themselves.

dogma1138 · on June 28, 2023

OneAPI is quite pleasant to work with, far more so than ROCm.

Karupan · on June 28, 2023

I do live under a rock when it comes to Intel I guess, since I haven't really looked into buying anything with their processor in 7 years :)

Art9681 · on June 28, 2023

I saw a project in GitHub a few weeks ago that claimed to run CUDA in Intel GPUs. It’s named ZLUDA. Any thoughts on that? I have not tried it.

loudmax · on June 28, 2023

It might be technically possible to port CUDA to Intel or AMD, but there might be patents of copyright rules that prevent legal redistribution. This wouldn't be the first time that IP rules stifle free market competition.

Probably better to aim for PyTorch compatibility. In practice, that's how most AI programmers interact with their GPUs.

jiggawatts · on June 28, 2023

Their upcoming AMD MI300 APU looks very competitive with NVIDIA's GH200.

The problem remains the lack of software support.

rapsey · on June 28, 2023

AMD is bad at software and that is not changing so they will always be behind.

dsign · on June 28, 2023

I agree with this. Even their GPU drivers for gaming are bad, they get stuck at “basic” things like MPEG encoding.

But I don’t understand why software is a problem for them with their deep pockets. It can’t possibly be dearth of talent, or that it is expensive. Here in Europe a good software engineer earns half of what a mediocre software engineer earns in USA, to say nothing of India or China. They could just hire a bunch of teams and up their software game.

rapsey · on June 28, 2023

Hardware development is very different to software development. AMD is a hardware company and they probably run their software development badly.

cavisne · on June 28, 2023

The "even for inference" thing has turned into a bit of a trap imo.

Data parallel models scaled up for training and then could run on individual chips, but these massive model parallel models require a couple of chips directly linked together even to do inference.

So the idea that a competitor could come in with a simple, cheap inference chip doesn't really work.

tiernano · on June 27, 2023

I have heard these are 100k each... anyone know if that's correct? Guessing that's list, and no one pays list, but still...

tikkun · on June 27, 2023

I'm writing a cool mega post on this at the moment – not published yet but here's the excerpt on pricing:

How much do these GPUs cost?

H100s are around $30-33k at the IT hardware resellers CDW and SHI (https://www.cdw.com/product/nvidia-h100-gpu-computing-proces..., https://www.shi.com/product/45671009/NVIDIA-H100-GPU-computi...)

Supermicro’s HGX H100 8x GPU server is $297k at the reseller Dihuni (https://www.dihuni.com/product/supermicro-8125gs-tnhr-server...)

DGX H100 is $521k at the reseller Insight (https://www.insight.com/en_US/shop/product/DGXH-G640F+P2CMI6...)

The DGX GH200 might cost in the range of $10mm-20mm or more (A guesstimate ballpark from an exec at a cloud company I talked with)

If anyone wants to pre-review the post and can offer thoughtful comments, my email's in my profile.

And to clarify the difference between all of these product names, I put together this diagram - https://gpus.llm-utils.org/dgx-gh200-vs-gh200-vs-h100/. I don't have the HGX H100 or the DGX H100 on there though - but the HGX H100 is a reference platform for OEMs to design and make H100 based servers with either 4x H100s or 8x H100s (https://nvdam.widen.net/s/5kgbjq2v2t/hpc-hgx-h100-datasheet-...) and the DGX H100 is the official Nvidia server with 8x H100s (https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise...).

mrcggl · on June 28, 2023

I don't get how Lambda makes money with those prices. 2$ per hour seems too cheap for a >30k$ GPU

huac · on June 28, 2023

it's marketing :)

huac · on June 28, 2023

from somebody who bought 2500 of them: https://news.ycombinator.com/item?id=36313960

finexplained · on June 27, 2023

I thought the H100s were ~30-40k each. But they're not widely available and you usually buy multiple boxes from vendors that also come with expensive CPUs/RAM etc.

aiappreciator · on June 27, 2023

H100s are intended to be retailed piecemeal. The big enterprise model is DGX GH200, at $10 mil each.

Its just that current supply is extremely short, so the H100s end up only available to big buyers. But that will be resolved in time. Nvidia wants every university lab to have a H100 so no competitor sneaks in there.

ollin · on June 28, 2023

There's a second blog post here with substantially more technical detail: https://developer.nvidia.com/blog/breaking-mlperf-training-r...

Additionally, code for the actual submission is available here https://github.com/mlcommons/training_results_v3.0/tree/main...

SushiHippie · on June 27, 2023

Does anyone know how much kWh was used (or could have probably been used) in these 11 minutes?

aiappreciator · on June 27, 2023

H100s consume like 350W. 3000 H100s = 350kW 350*0.2h = 70kWh, about $20 at $0.3/kWh (lets assume data centers have expensive electrical costs).

SushiHippie · on June 28, 2023

Ah Thanks! Taking the data from here[0] for the USA, this would be about 25,69kg of CO2.

[0] https://ourworldindata.org/grapher/carbon-intensity-electric...

alwa · on June 28, 2023

Which, per the EPA [0], is roughly the tailpipe CO2 you’d emit by burning a little under 3 gallons of gas in your car, right? Which is to say, driving 65ish miles in the US [1]?

It amazes me that computational feats of this magnitude can be so energy efficient in the scheme of things.

[0] https://www.epa.gov/greenvehicles/tailpipe-greenhouse-gas-em...

[1] https://www.epa.gov/greenvehicles/tailpipe-greenhouse-gas-em....

NhanH · on June 28, 2023

It’s not that surprising actually, computation moves nothing and all energy has to turn into heat. If computation would use as much energy as moving physical objects, the heat would burn everything down

nl · on June 28, 2023

If you are on GCP you can choose low carbon data centers: https://cloud.google.com/sustainability/region-carbon

YetAnotherNick · on June 30, 2023

Or about 3s of 747 cruise[0]

[0]: https://www.carbonindependent.org/22.html#:~:text=At%20a%20c...

shaklee3 · on June 28, 2023

You can't really use 350W. It's very rare for a workload to consume the full TDP.

pinkcan · on June 28, 2023

this is napkin math, for sure there are other components in the system that do not directly contribute to the computation – cooling systems, DC lights

mrcggl · on June 28, 2023

A large H100 cluster (>10k GPUs) could likely train a LLM with 10x compute (FP8) of GPT-4, which was apparently trained on a mix of A100s and V100s.

deviantbit · on June 28, 2023

SkyNet awakens... then says, "I'll be back." Goes to sleep to be trained again. (ominous music follows)... Across the screen it says, "Sarah Connor is now in grade school." Then blinks away. A moment later, "John Connor has yet to be conceived." (More ominous music) Image of Jensen Huang in a terminator leather jacket standing hold the next generation GPU is shown. (More ominous music that peaks to a finale)

Will you join the resistance?

LOL, I welcome our new overload.

Electricniko · on June 28, 2023

Is our new overlord Jensen Huang or Skynet? I prefer one over the other.

deviantbit · on June 28, 2023