This points to an interesting future for foundation models. This is an 18x cost reduction in only 2 years. Either foundation models are going to get much bigger, or variations will become common.
V100 GPUs are from 2017, so it's more than two years. A100 already appeared there years ago, btw.
An eight GPU DGX-1 server cost ~149k$ back then (googled news postings). A current gen DGX H100 is 520k$ with 5 years of support. Of course it holds 5x the memory, plus GPUs and interconnect are much faster. But when comparing costs, take price hikes into account.
An important thing to also keep in mind is how much inflation changed prices over the duration. $520k in 2023 dollars is around $420k in 2017 dollars. Sure, still almost 3x more expensive, but that’s better than being 0.7x higher.
Your citation is for 1k A100s, not 3.5k V100s. I think it's actually ~51 days on 3.5k V100s.
Just to compare the GPUs, TF32 Tensor processing went from ~125 TFlops to ~990. It then looks like they also dropped the precision to FP8, which gives you another 4x win.
What's interesting is to look at how we're progressing in performance over time. In some sense, a bit slow?
A V100 costs $10k at release; an H100 seems to be $40k.
So we've only managed to halve the cost of a flop in 5 years. That seems.. much slower than what Moore's Law would have suggested.
> Moore's law says nothing about price or performance.
Moore's law is about doubling of transistor count for the same price. At least that's always been my understanding.
EDIT: I decided to look it up. Heres the original 1975 statement from him that led to the law:
"The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years."
No, it’s not about price of a completely finished product in my humble opinion. It’s about the cost to produce the silicon component itself. Sure, it’s about money both ways, but the distinction is important.
The V100 had up to 32gb ram, the H100 is 188gb. While there’s obviously transistors in ram, the counts being compared are for the GPU itself. I’d argue that a big chunk of the price difference in the V100 vs H100 is RAM.
Moore’s law never applied to ram and again, you aren’t comparing apples to apples either. There’s a lot of factors that go into building a complete system, so stop trying to apply an old cpu transistor adage to something it was never intended for and maybe you’ll have better success?
Moore's law never applied to GPUs either. I don't know if it even said anything about CPUs, specifically. It's about transistor density vs cost.
Roughly the same tech advances apply to making all three kinds of chips so I think it applies (somewhat) equally to CPU, GPUs, RAM and even SSDs etc.
Bearing in mind that ram lags a few generations behind CPU and GPU (I think DDR5 is 12nm vs latest GPUs around 4nm). And also that Moore's law is not a physical law, even when it was in full swing it was only ever meant as a rough guideline for what to expect over a couple of years period.
What about power draw? Quick google says V100 is max 300W while H100 is max 700W TDP, which makes the cost more favorable than 4x, so more like 7x less per flop. *assuming* the same electricity cost, which actually seems to have increased significantly.
On a (minor) side note, it seems that $1.00 from 2018 is worth ~$1.20 in 2023. I wish more cost comparisons included inflation, because the past few years have had a lot of it.
Good call on power draw, but it seems dwarfed by capital cost. Assuming 3 year life (this tech gets outdated..), 300W constant consumption costs under $1k at $0.3/kwh.
You need to be an order of magnitude higher in power usage before it really starts mattering. e.g. a Tesla3 consumes around 15 KW (20x the H100) while driving.
That said it looks like flops/watt dropped by only 3.4x, which is also sub Moore's Law (3 years to halve power consumption)
I don't think flops/$ is enough to really capture the difference here.
You couldn't replicate the scale of compute this allows no matter how many V100's you had.
A huge amount of cost here is embodied in networking and memory.
If one were to design a chip that cared only for flop/$ without caring for all of the interconnect and memory, then the 4090 is a much fairer comparison, and even then that card isn't designed for a flop/$ optimisation.
Yeah, it's wild - feels like we see "software is so slow now! nobody optimizes software anymore, they just run Python and burn cycles!" posts all the time, but man, when a company REALLY wants to optimize something -- it's a thing of beauty.
With 3,584 GPUs, or 448 Nodes w/ 8GPUs/node. At 5 nodes/rack, that's 90 racks. I looked and found and old listing for Lambda's Echelon racks about $650k. So the infra cost is a little under $6 million.
If you bought the $2/hr H100 instance they offer, that would cost ~$330k. Pricey, but not too bad.
My bigger hope is that with cheaper compute we can see more architecture search and designs. A lot of different architectures are relatively unexplored due to computational constraints and are typically performed by smaller labs so they don't scale and it is kinda hard to compare models when we're just looking at performance benchmarks and not considering other factors. We definitely don't want big labs to railroad our research directions. Feels weird that a huge amount of NLP is based on using pretrained models and tuning them. Vision is going this way too. You basically can't get published without being SOTA so you basically have to modify an existing model or have a multi-million dollar lab and train from scratch. Really weird to expect academia to compete with big labs and really weird to not let academia take "bigger risks" and explore less popular areas. It is vital to our research path that we don't force everything onto a single track.
We're talking about building a server, so usually availability is different than to a home consumer. But fwiw lambda has their Node for 330k (which gets pretty close to that price), whereas the A100's are $170k, so double my figure. Especially considering the H100 nodes are 8U instead of 4U. so you need twice as many racks.
GPT3 benchmark, as others have said this means it would be ~46 hours to train GPT, still impressive! Also, going by other comments here, ~3500 of these or 13-14 DGX GH200s at ~$10m each means we are talking about ~$135m worth of compute here. Still very impressive but holy hell that is a lot of money worth of compute hardware.
One interesting thing is if the model can’t fit into GPU memory (sharded across multiple chips) it would be much slower. So one chip would take more than a month.
That’s why these clusters are so valuable, even with Nvidias margins they are still cheaper than using less compute for longer.
I am keeping my fingers crossed that there will be a 'Bloom Filter' moment for AI where we create an algorithm that can eliminate an expensive calculation in 99% of interactions and give us a 10x speedup in some larger problem domain.
Or some other 'trust but verify' situation where the suggested action is still validated against business rules that have some notion of consumer protection.
I think the parent is referring to the fact that LLMs and generative models are, for the time being, “creating new markets” and “creating new problems”, but not really solving any of the hard problems we already have. I see it this way: you have this incredible piece of technology. You can use it to take the scraps away from a bunch of starving artists, and convert that survival income in wealth for some entrepreneur. Or you can use it to cure cancer or aging or make fossil-free airplanes. At the moment, the first application seems to be the one moving the market, and that’s really sad.
On the side of hope, if somebody were to demonstrate how to use a bunch of GPUs to make the e-coli bacteria live for a little longer using an approach that has a semblance of generality, I guarantee that a lot of old people would be willing to part with their fortunes in exchange for a sliver of hope for them or future generations.
Not sure exactly what you mean here but most of the quantum mechanics calculations in material science and chemistry applications can now be done at dramatic speedups using such approximate models. It has been exactly this Bloom Filter moment a few years back and the complete transformation of how people in these domains work is well underway.
I’ve used both and found both to work well for me. There are a variety of other gpu cloud options that aren’t good, though. I listed a bunch here, some are good, some seem to be worse on all dimensions (price, capacity, and UX) - https://gpus.llm-utils.org/alternative-gpu-clouds/
Tl;dr of the good ones: FluidStack and Lambda for H100s (1x instances), Runpod for A100s.
Has AMD completely missed the ML/AI train? I’m quite surprised that even for inference, there doesn’t seem to be a viable competitor to nvidia. Is there anything in AMD’s roadmap to suggest they are planning to even compete?
With 384 Gaudi2s they did the LLM task in 312 minutes, compared to 46 minutes for 768 H100s. It’ll come down to cost, but given the H100 is a process node or two ahead (and much more expensive I imagine?), Intel is actually much closer than I had realized. They’re the only other to submit an MLPerf result for the LLM task, I think. All credit of course to Habana Labs which was acquired by Intel.
As a ML outsider, this was the first I’ve ever heard of it. Obviously I’m not the target audience, but I’m just shocked that I didn’t even know Intel was in the game.
Intel's going big in their own way by putting "AI" accelerators and the like in their latest and future processors, kind of need to be living under a rock to just miss it in these kinds of tech circles.
But that said, Intel suffers from the same fundamental problem as AMD: They aren't Nvidia and they can't CUDA.
I’d argue you don’t need CUDA if PyTorch runs well on your platform, which is what everyone is trying to show with these MLPerf results. I think Intel’s strategy with oneAPI is not bad, it’s just late.
More like Intel stopped being interesting years ago. For me, they just keep reheating the old designs and stopped innovate. To get my attention back, they would have to create something that would get to me through some mainstream articles. Otherwise, I think following Intel is a waste of time. They need to prove themselves.
It might be technically possible to port CUDA to Intel or AMD, but there might be patents of copyright rules that prevent legal redistribution. This wouldn't be the first time that IP rules stifle free market competition.
Probably better to aim for PyTorch compatibility. In practice, that's how most AI programmers interact with their GPUs.
I agree with this. Even their GPU drivers for gaming are bad, they get stuck at “basic” things like MPEG encoding.
But I don’t understand why software is a problem for them with their deep pockets. It can’t possibly be dearth of talent, or that it is expensive. Here in Europe a good software engineer earns half of what a mediocre software engineer earns in USA, to say nothing of India or China. They could just hire a bunch of teams and up their software game.
The "even for inference" thing has turned into a bit of a trap imo.
Data parallel models scaled up for training and then could run on individual chips, but these massive model parallel models require a couple of chips directly linked together even to do inference.
So the idea that a competitor could come in with a simple, cheap inference chip doesn't really work.
I thought the H100s were ~30-40k each. But they're not widely available and you usually buy multiple boxes from vendors that also come with expensive CPUs/RAM etc.
H100s are intended to be retailed piecemeal. The big enterprise model is DGX GH200, at $10 mil each.
Its just that current supply is extremely short, so the H100s end up only available to big buyers. But that will be resolved in time. Nvidia wants every university lab to have a H100 so no competitor sneaks in there.
Which, per the EPA [0], is roughly the tailpipe CO2 you’d emit by burning a little under 3 gallons of gas in your car, right? Which is to say, driving 65ish miles in the US [1]?
It amazes me that computational feats of this magnitude can be so energy efficient in the scheme of things.
It’s not that surprising actually, computation moves nothing and all energy has to turn into heat. If computation would use as much energy as moving physical objects, the heat would burn everything down
SkyNet awakens... then says, "I'll be back." Goes to sleep to be trained again. (ominous music follows)... Across the screen it says, "Sarah Connor is now in grade school." Then blinks away. A moment later, "John Connor has yet to be conceived." (More ominous music) Image of Jensen Huang in a terminator leather jacket standing hold the next generation GPU is shown. (More ominous music that peaks to a finale)
An equivalent number of V100's (GPT-3's original GPU) would've taken about 36 days [0].
0 – https://www.reddit.com/r/GPT3/comments/p1xf10/comment/h8h3sl...