Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
H100 GPUs Set Standard for Gen AI in Debut MLPerf Benchmark (nvidia.com)
157 points by Anon84 on June 27, 2023 | hide | past | favorite | 75 comments


These could train GPT-3 in 46 hours. (Edited from "11 mins" per https://news.ycombinator.com/item?id=36500154 )

An equivalent number of V100's (GPT-3's original GPU) would've taken about 36 days [0].

0 – https://www.reddit.com/r/GPT3/comments/p1xf10/comment/h8h3sl...


Kind of. This tweet better explains what they did:

https://twitter.com/abhi_venigalla/status/167381386318645248...


So it would be more like 46 hours to train GPT-3 from scratch.

    (300BT / 1.2BT) * 11 min * (1 hr / 60 min) = 45.8 hr
Still pretty incredible. That's an 18.8x speedup over 36 days.


This points to an interesting future for foundation models. This is an 18x cost reduction in only 2 years. Either foundation models are going to get much bigger, or variations will become common.


V100 GPUs are from 2017, so it's more than two years. A100 already appeared there years ago, btw.

An eight GPU DGX-1 server cost ~149k$ back then (googled news postings). A current gen DGX H100 is 520k$ with 5 years of support. Of course it holds 5x the memory, plus GPUs and interconnect are much faster. But when comparing costs, take price hikes into account.


An important thing to also keep in mind is how much inflation changed prices over the duration. $520k in 2023 dollars is around $420k in 2017 dollars. Sure, still almost 3x more expensive, but that’s better than being 0.7x higher.


Variations of specializations I guess

For writing code you don't care about feeding world history to your model. So a smaller model might be better at a specialized task

Sure, having a big multi-modal-model is great, but by having specialized models you can spread tasks better


But I am sure prompt understanding improves with more text data. Same with reasoning ability.


Your citation is for 1k A100s, not 3.5k V100s. I think it's actually ~51 days on 3.5k V100s.

Just to compare the GPUs, TF32 Tensor processing went from ~125 TFlops to ~990. It then looks like they also dropped the precision to FP8, which gives you another 4x win.

What's interesting is to look at how we're progressing in performance over time. In some sense, a bit slow?

A V100 costs $10k at release; an H100 seems to be $40k.

So we've only managed to halve the cost of a flop in 5 years. That seems.. much slower than what Moore's Law would have suggested.


> So we've only managed to halve the cost of a flop in 5 years. That seems.. much slower than what Moore's Law would have suggested.

Moore's law says nothing about price or performance.

H100 has about 4x as many transistors as V100, which is pretty close to what Moore's law would predict.


> Moore's law says nothing about price or performance.

Moore's law is about doubling of transistor count for the same price. At least that's always been my understanding.

EDIT: I decided to look it up. Heres the original 1975 statement from him that led to the law:

"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years."

So yup, it's about transistor density vs price.


No, it’s not about price of a completely finished product in my humble opinion. It’s about the cost to produce the silicon component itself. Sure, it’s about money both ways, but the distinction is important.

The V100 had up to 32gb ram, the H100 is 188gb. While there’s obviously transistors in ram, the counts being compared are for the GPU itself. I’d argue that a big chunk of the price difference in the V100 vs H100 is RAM.


> I’d argue that a big chunk of the price difference in the V100 vs H100 is RAM.

But that's the same Moore's Law problem when using the 'price' definition.

6x RAM 5 years later = 4x the cost? That's even slower than the FLOPS gains.


Moore’s law never applied to ram and again, you aren’t comparing apples to apples either. There’s a lot of factors that go into building a complete system, so stop trying to apply an old cpu transistor adage to something it was never intended for and maybe you’ll have better success?


Moore's law never applied to GPUs either. I don't know if it even said anything about CPUs, specifically. It's about transistor density vs cost.

Roughly the same tech advances apply to making all three kinds of chips so I think it applies (somewhat) equally to CPU, GPUs, RAM and even SSDs etc.

Bearing in mind that ram lags a few generations behind CPU and GPU (I think DDR5 is 12nm vs latest GPUs around 4nm). And also that Moore's law is not a physical law, even when it was in full swing it was only ever meant as a rough guideline for what to expect over a couple of years period.


What about power draw? Quick google says V100 is max 300W while H100 is max 700W TDP, which makes the cost more favorable than 4x, so more like 7x less per flop. *assuming* the same electricity cost, which actually seems to have increased significantly.

On a (minor) side note, it seems that $1.00 from 2018 is worth ~$1.20 in 2023. I wish more cost comparisons included inflation, because the past few years have had a lot of it.


Good call on power draw, but it seems dwarfed by capital cost. Assuming 3 year life (this tech gets outdated..), 300W constant consumption costs under $1k at $0.3/kwh.

You need to be an order of magnitude higher in power usage before it really starts mattering. e.g. a Tesla3 consumes around 15 KW (20x the H100) while driving.

That said it looks like flops/watt dropped by only 3.4x, which is also sub Moore's Law (3 years to halve power consumption)


I don't think flops/$ is enough to really capture the difference here.

You couldn't replicate the scale of compute this allows no matter how many V100's you had.

A huge amount of cost here is embodied in networking and memory.

If one were to design a chip that cared only for flop/$ without caring for all of the interconnect and memory, then the 4090 is a much fairer comparison, and even then that card isn't designed for a flop/$ optimisation.


Yeah, it's wild - feels like we see "software is so slow now! nobody optimizes software anymore, they just run Python and burn cycles!" posts all the time, but man, when a company REALLY wants to optimize something -- it's a thing of beauty.


With 3,584 GPUs, or 448 Nodes w/ 8GPUs/node. At 5 nodes/rack, that's 90 racks. I looked and found and old listing for Lambda's Echelon racks about $650k. So the infra cost is a little under $6 million.

If you bought the $2/hr H100 instance they offer, that would cost ~$330k. Pricey, but not too bad.

My bigger hope is that with cheaper compute we can see more architecture search and designs. A lot of different architectures are relatively unexplored due to computational constraints and are typically performed by smaller labs so they don't scale and it is kinda hard to compare models when we're just looking at performance benchmarks and not considering other factors. We definitely don't want big labs to railroad our research directions. Feels weird that a huge amount of NLP is based on using pretrained models and tuning them. Vision is going this way too. You basically can't get published without being SOTA so you basically have to modify an existing model or have a multi-million dollar lab and train from scratch. Really weird to expect academia to compete with big labs and really weird to not let academia take "bigger risks" and explore less popular areas. It is vital to our research path that we don't force everything onto a single track.


H100s are now around 30k USD if you're lucky, going up to 40k on Ebay/Amazon.

3,584 GPUs at $30,000 is $104,550,000 USD only in GPUs.


We're talking about building a server, so usually availability is different than to a home consumer. But fwiw lambda has their Node for 330k (which gets pretty close to that price), whereas the A100's are $170k, so double my figure. Especially considering the H100 nodes are 8U instead of 4U. so you need twice as many racks.


30k is to businesses, not consumers. Consumers don't buy H100s.


3500 of them trained GPT-3 in 11 minutes. That number is still worth remarking.


GPT3 benchmark, as others have said this means it would be ~46 hours to train GPT, still impressive! Also, going by other comments here, ~3500 of these or 13-14 DGX GH200s at ~$10m each means we are talking about ~$135m worth of compute here. Still very impressive but holy hell that is a lot of money worth of compute hardware.


It’s remarkable that one of these could train GPT-3 in under a month. However, good to note that its the price is that of twenty 4090’s.


One interesting thing is if the model can’t fit into GPU memory (sharded across multiple chips) it would be much slower. So one chip would take more than a month.

That’s why these clusters are so valuable, even with Nvidias margins they are still cheaper than using less compute for longer.


Could it actually? Or do you need the memory size of thousands of them running in parallel?


Yes I worded that extremely poorly.


title says: "a massive GPT-3-based benchmark"

that means they might pass a few batches through a GPT-3 sized random init network and time it


I am keeping my fingers crossed that there will be a 'Bloom Filter' moment for AI where we create an algorithm that can eliminate an expensive calculation in 99% of interactions and give us a 10x speedup in some larger problem domain.

Or some other 'trust but verify' situation where the suggested action is still validated against business rules that have some notion of consumer protection.


Dropping numerical precision seems to have done that to some degree?


I think the parent is referring to the fact that LLMs and generative models are, for the time being, “creating new markets” and “creating new problems”, but not really solving any of the hard problems we already have. I see it this way: you have this incredible piece of technology. You can use it to take the scraps away from a bunch of starving artists, and convert that survival income in wealth for some entrepreneur. Or you can use it to cure cancer or aging or make fossil-free airplanes. At the moment, the first application seems to be the one moving the market, and that’s really sad.

On the side of hope, if somebody were to demonstrate how to use a bunch of GPUs to make the e-coli bacteria live for a little longer using an approach that has a semblance of generality, I guarantee that a lot of old people would be willing to part with their fortunes in exchange for a sliver of hope for them or future generations.


That’s only because the companies with marketing budgets and marketing teams are the ones you are implying.

There is serious research going down right now, but you don’t see you typical medical research department at your university buying ads on Instagram.


You have a deterministic AI program to show me?


Not sure exactly what you mean here but most of the quantum mechanics calculations in material science and chemistry applications can now be done at dramatic speedups using such approximate models. It has been exactly this Bloom Filter moment a few years back and the complete transformation of how people in these domains work is well underway.


If you want to try one, some good options are:

FluidStack or Lambda Labs

If you need to rent a supercluster and you’re not tied to one of the big 3 clouds, then talk with FluidStack, Lambda, Oracle, maybe CoreWeave.


Lambda seems like the easiest to work with at present.


I’ve used both and found both to work well for me. There are a variety of other gpu cloud options that aren’t good, though. I listed a bunch here, some are good, some seem to be worse on all dimensions (price, capacity, and UX) - https://gpus.llm-utils.org/alternative-gpu-clouds/

Tl;dr of the good ones: FluidStack and Lambda for H100s (1x instances), Runpod for A100s.


Has AMD completely missed the ML/AI train? I’m quite surprised that even for inference, there doesn’t seem to be a viable competitor to nvidia. Is there anything in AMD’s roadmap to suggest they are planning to even compete?


Intel is much closer with Habana Gaudi2. Sometimes I’m unsure whether they even recognize it.

They also published MLPerf results today: https://habana.ai/blog/gaudi2-demonstrates-competitive-llm-p...

With 384 Gaudi2s they did the LLM task in 312 minutes, compared to 46 minutes for 768 H100s. It’ll come down to cost, but given the H100 is a process node or two ahead (and much more expensive I imagine?), Intel is actually much closer than I had realized. They’re the only other to submit an MLPerf result for the LLM task, I think. All credit of course to Habana Labs which was acquired by Intel.


As a ML outsider, this was the first I’ve ever heard of it. Obviously I’m not the target audience, but I’m just shocked that I didn’t even know Intel was in the game.


Intel's going big in their own way by putting "AI" accelerators and the like in their latest and future processors, kind of need to be living under a rock to just miss it in these kinds of tech circles.

But that said, Intel suffers from the same fundamental problem as AMD: They aren't Nvidia and they can't CUDA.


I’d argue you don’t need CUDA if PyTorch runs well on your platform, which is what everyone is trying to show with these MLPerf results. I think Intel’s strategy with oneAPI is not bad, it’s just late.


More like Intel stopped being interesting years ago. For me, they just keep reheating the old designs and stopped innovate. To get my attention back, they would have to create something that would get to me through some mainstream articles. Otherwise, I think following Intel is a waste of time. They need to prove themselves.


OneAPI is quite pleasant to work with, far more so than ROCm.


I do live under a rock when it comes to Intel I guess, since I haven't really looked into buying anything with their processor in 7 years :)


I saw a project in GitHub a few weeks ago that claimed to run CUDA in Intel GPUs. It’s named ZLUDA. Any thoughts on that? I have not tried it.


It might be technically possible to port CUDA to Intel or AMD, but there might be patents of copyright rules that prevent legal redistribution. This wouldn't be the first time that IP rules stifle free market competition.

Probably better to aim for PyTorch compatibility. In practice, that's how most AI programmers interact with their GPUs.


Their upcoming AMD MI300 APU looks very competitive with NVIDIA's GH200.

The problem remains the lack of software support.


AMD is bad at software and that is not changing so they will always be behind.


I agree with this. Even their GPU drivers for gaming are bad, they get stuck at “basic” things like MPEG encoding.

But I don’t understand why software is a problem for them with their deep pockets. It can’t possibly be dearth of talent, or that it is expensive. Here in Europe a good software engineer earns half of what a mediocre software engineer earns in USA, to say nothing of India or China. They could just hire a bunch of teams and up their software game.


Hardware development is very different to software development. AMD is a hardware company and they probably run their software development badly.


The "even for inference" thing has turned into a bit of a trap imo.

Data parallel models scaled up for training and then could run on individual chips, but these massive model parallel models require a couple of chips directly linked together even to do inference.

So the idea that a competitor could come in with a simple, cheap inference chip doesn't really work.


I have heard these are 100k each... anyone know if that's correct? Guessing that's list, and no one pays list, but still...


I'm writing a cool mega post on this at the moment – not published yet but here's the excerpt on pricing:

How much do these GPUs cost?

H100s are around $30-33k at the IT hardware resellers CDW and SHI (https://www.cdw.com/product/nvidia-h100-gpu-computing-proces..., https://www.shi.com/product/45671009/NVIDIA-H100-GPU-computi...)

Supermicro’s HGX H100 8x GPU server is $297k at the reseller Dihuni (https://www.dihuni.com/product/supermicro-8125gs-tnhr-server...)

DGX H100 is $521k at the reseller Insight (https://www.insight.com/en_US/shop/product/DGXH-G640F+P2CMI6...)

The DGX GH200 might cost in the range of $10mm-20mm or more (A guesstimate ballpark from an exec at a cloud company I talked with)

If anyone wants to pre-review the post and can offer thoughtful comments, my email's in my profile.

And to clarify the difference between all of these product names, I put together this diagram - https://gpus.llm-utils.org/dgx-gh200-vs-gh200-vs-h100/. I don't have the HGX H100 or the DGX H100 on there though - but the HGX H100 is a reference platform for OEMs to design and make H100 based servers with either 4x H100s or 8x H100s (https://nvdam.widen.net/s/5kgbjq2v2t/hpc-hgx-h100-datasheet-...) and the DGX H100 is the official Nvidia server with 8x H100s (https://resources.nvidia.com/en-us-dgx-systems/ai-enterprise...).


I don't get how Lambda makes money with those prices. 2$ per hour seems too cheap for a >30k$ GPU


it's marketing :)


from somebody who bought 2500 of them: https://news.ycombinator.com/item?id=36313960


I thought the H100s were ~30-40k each. But they're not widely available and you usually buy multiple boxes from vendors that also come with expensive CPUs/RAM etc.


H100s are intended to be retailed piecemeal. The big enterprise model is DGX GH200, at $10 mil each.

Its just that current supply is extremely short, so the H100s end up only available to big buyers. But that will be resolved in time. Nvidia wants every university lab to have a H100 so no competitor sneaks in there.


There's a second blog post here with substantially more technical detail: https://developer.nvidia.com/blog/breaking-mlperf-training-r...

Additionally, code for the actual submission is available here https://github.com/mlcommons/training_results_v3.0/tree/main...


Does anyone know how much kWh was used (or could have probably been used) in these 11 minutes?


H100s consume like 350W. 3000 H100s = 350kW 350*0.2h = 70kWh, about $20 at $0.3/kWh (lets assume data centers have expensive electrical costs).


Ah Thanks! Taking the data from here[0] for the USA, this would be about 25,69kg of CO2.

[0] https://ourworldindata.org/grapher/carbon-intensity-electric...


Which, per the EPA [0], is roughly the tailpipe CO2 you’d emit by burning a little under 3 gallons of gas in your car, right? Which is to say, driving 65ish miles in the US [1]?

It amazes me that computational feats of this magnitude can be so energy efficient in the scheme of things.

[0] https://www.epa.gov/greenvehicles/tailpipe-greenhouse-gas-em...

[1] https://www.epa.gov/greenvehicles/tailpipe-greenhouse-gas-em....


It’s not that surprising actually, computation moves nothing and all energy has to turn into heat. If computation would use as much energy as moving physical objects, the heat would burn everything down


If you are on GCP you can choose low carbon data centers: https://cloud.google.com/sustainability/region-carbon



You can't really use 350W. It's very rare for a workload to consume the full TDP.


this is napkin math, for sure there are other components in the system that do not directly contribute to the computation – cooling systems, DC lights


A large H100 cluster (>10k GPUs) could likely train a LLM with 10x compute (FP8) of GPT-4, which was apparently trained on a mix of A100s and V100s.


SkyNet awakens... then says, "I'll be back." Goes to sleep to be trained again. (ominous music follows)... Across the screen it says, "Sarah Connor is now in grade school." Then blinks away. A moment later, "John Connor has yet to be conceived." (More ominous music) Image of Jensen Huang in a terminator leather jacket standing hold the next generation GPU is shown. (More ominous music that peaks to a finale)

Will you join the resistance?

LOL, I welcome our new overload.


Is our new overlord Jensen Huang or Skynet? I prefer one over the other.


Yes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: