It's fantastic that more orgs are releasing open-source models trained on more t...

lhl · on April 19, 2023

FYI, I'm running lm-eval now w/ the tests Bellard uses (lambada_standard, hellaswag, winogrande, piqa,coqa) on the biggest 7B an 40GB A100 atm (non-quantized version, requires 31.4GB) so will be directly comparable to what various LLaMAs look like: https://bellard.org/ts_server/

(UPDATE: run took 1:36 to complete run, but failed at the end with a TypeError, so will need to poke and rerun).

I'll place results in my spreadsheet (which also has my text-davinci-003 results): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...

lhl · on April 20, 2023

Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392

The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.

Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md

MacsHeadroom · on April 20, 2023

Note that this is StableLM ALPHA (only 0.52 epochs into training).

The fully trained version will surely be much better.

Also, you should benchmark GPT-3 Babbage for a fair comparison since that is the same size as 7B.

ALittleLight · on April 20, 2023

How many epochs will they run?

lunixbochs · on April 19, 2023

Are you using https://github.com/EleutherAI/lm-evaluation-harness?

lhl · on April 19, 2023

Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...

There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness

And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available

sebzim4500 · on April 19, 2023

How possible is it that every other model suffers from dataset contamination and this model is being unfairly penalized for having properly sanitized training data?

guywithabowtie · on April 19, 2023

Do you also have results of GPT4 somewhere? or text-davinci-003-turbo

lhl · on April 19, 2023

I'm still on the waitlist for GPT-4 API access. Note, that text-davinci-003 cost about $90 to benchmark at $0.02/1K tokens, so if you're able to use a GPT-4 model (for completion and not just instruction) that'll probably be $270-$540 in credits to benchmark...

hhh · on April 20, 2023

I have GPT-4 8k access and am willing to run the evals if someone wants to pay. Email in my acc info (the character is h)

Just a note, I get errors semi-frequently when running queries against GPT-4 often (timeouts mostly…) so any code would need to handle that well.

MacsHeadroom · on April 20, 2023

You should benchmark GPT-3 Curie (7B) for comparison since it is the same size as llama-7B and StableLM-7B.

That will give us some indication of how much better these models are than GPT-3 at the same size.

jimsimmons · on April 19, 2023

Just think about benchmarking 32K GPT4 haha

sebzim4500 · on April 19, 2023

>- No [...] details about the model

You can see the model architecture here

https://github.com/Stability-AI/StableLM/blob/main/configs/s...

GaggiX · on April 19, 2023

>Small models only trained on 800B tokens

"These models will be trained on up to 1.5 trillion tokens." on the Github repo.

https://github.com/stability-AI/stableLM/#stablelm-alpha

youssefabdelm · on April 19, 2023

That's great news, but one would think that since they're behind Stable Diffusion, that they'd use the insights behind it and scale data even more than that to result in better quality at a smaller scale model that can run on most people's machines.

Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation), and a long context on a 7B parameter model then see if that gets you better results than a 30 or 65B parameter on 1.5 trillion tokens.

A lot of these open source projects just seem to be trying to follow and (poorly) reproduce OpenAI's breakthroughs instead of trying to surpass them.

GaggiX · on April 19, 2023

>try 10 trillion or 100 trillion tokens

Computation is not free and data is not infinite.

youssefabdelm · on April 19, 2023

You could've said the same to OpenAI when they were scaling GPT from 1 billion to 175 billion parameters. We're all grateful they didn't follow that line of thought.

But Stability does have access to a pretty big cluster, so it's not paying cloud compute (I assume), so cost will be less, and data of course is not infinite...never stated that.

But considering 3.7 million videos are uploaded to youtube everyday, 2 million scientific articles published every year, yada yada...that argument falls apart.

At the very least implement spiral development... 1 trillion... 3 trillion... (oh it seems to be getting WAY better! There seems to be a STEP CHANGE!)... 5 trillion... (holy shit this really works, lets keep going)

dragonwriter · on April 19, 2023

The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting and being in a standard format for ingestion in your first trillion tokens of training data.

taneq · on April 19, 2023

There’s a difference between telling someone they’re wasting their time with their current project, and asking them why they didn’t spend 6x - 60x as much budget on an already expensive project.

youssefabdelm · on April 20, 2023

They're loaded, and we know scaling works, they'd massively benefit... both in marketing and profit.

Although it is open source to be fair.

dragonwriter · on April 19, 2023

> Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation)

But where’s the corpus supposed ro come from?

Taek · on April 20, 2023

Nobody knows where to find 10 trillion tokens of good data. Publicly available / data without a license seems to cap at around 1.5 trillion tokens total. The internet isn't as big as you thought! (Or at least, all the good stuff is behind a walled garden, which I think we did know)

Taek · on April 19, 2023

Devs confirmed that the small ones use 800B, 1.5T is for the large ones

GaggiX · on April 19, 2023

@thunderbird120 asked a Stability employee and say that the plan is going to keep training the models up to 1.5T. So I don't know where do you read this.

Taek · on April 19, 2023

That may be, but the weights you can download today were trained on 800B

sroussey · on April 19, 2023

I think they are “checkpoint” models in this case.

Will be fun to compare when completed!

oehtXRwMkIs · on April 20, 2023

Are not all models checkpoints? I think you may be interpreting it too colloquially.

GaggiX · on April 19, 2023

yes of course that's why they use "will be trained" on the GH repo.

nickthegreek · on April 19, 2023

https://github.com/Stability-AI/StableLM#stablelm-alpha shows that the 3b and 7B had 800b training tokens.

DustinBrett · on April 19, 2023

I'm wondering what the sweet spot for parameters will be. Right now it feels like the Mhz race we had back in the CPU days, but 20 years later I am still using a 2-3GHz CPU.

lhl · on April 19, 2023

I think "sweet spot" is going to depend on your task, but here's a good recent paper that may give you some more context on thinking about training and model sizes: https://www.harmdevries.com/post/model-size-vs-compute-overh...

There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774

version_five · on April 19, 2023

I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).

Taek · on April 20, 2023

Well, based on all the data we have available now it seems like you don't get much benefit yet from going above 200 billion.

swyx · on April 19, 2023

> 128 head dim, so can use flash attention (unlike GPT-J)

mind explaining why this is so attractive/what the hurdle is for the laypeople in the audience? (me)

GaggiX · on April 19, 2023

Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. Also FalshAttention is faster.

sroussey · on April 19, 2023

So there must be a downside to FlashAttention. What is it?

lhl · on April 19, 2023

https://arxiv.org/abs/2205.14135 - Section 5 suggests that the biggest limitation is that custom CUDA kernels need to be coded on a per-GPU architecture basis.

fpgaminer · on April 19, 2023

FlashAttention is mathematically identical to standard attention, so in theory there's no downside. In practice, numerical inaccuracies of floating point mean that the results differ slightly. I don't know of any papers going in depth to analyze what impact those variances have in a range of real models, but generally speaking deep models handle slightly variances well. I've not noticed any difference in my applications training models. And tons of people use FlashAttention as a drop-in replacement on models trained on standard attention (e.g. using xformers in StableDiffusion).

Also in practice FlashAttention is still relatively new so it isn't well supported in libraries yet. Until PyTorch 2.0 you had to either implement it yourself, or use something like xformers which comes with a bag of caveats. PyTorch 2.0 now has it built-in, and it's easy to use, but the implementation is incomplete so you can't, for example, use it with an attention mask (which is needed in LLMs, for example).

tl;dr: Basically none, but it just isn't well supported yet.

kiraaa · on April 19, 2023

installing it is a nightmare

WithinReason · on April 19, 2023

According to the paper Flash Attention also needs quadratic memory:

Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑 <= 𝑀 <= 𝑁𝑑. Standard attention (Algorithm 0) requires Θ(𝑁𝑑+𝑁²) HBM accesses, while FlashAttention (Algorithm 1) requires Θ(𝑁²𝑑²M⁻¹) HBM accesses.

GaggiX · on April 19, 2023

https://github.com/HazyResearch/flash-attention#memory

"standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

I guess you have just reported how many times the layer will need to access the memory, not how much memory usage scales with sequence length.

whimsicalism · on April 19, 2023

> Small models only trained on 800B tokens, compared to 1T for llama-7B

LLaMA is trained far beyond chinchilla optimality, so this is not as surprising to me.

dragonwriter · on April 19, 2023

But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.

whimsicalism · on April 19, 2023

Yep, but if stability has the goal of training the best possible model then that would explain the choices they made.

GaggiX · on April 19, 2023

I mean 800B tokens on a 3B model and 7B model is still way beyond the Chinchilla scale.

MacsHeadroom · on April 19, 2023

They're going to 1.5T and possibly 3T. The 800B is just for the "Alpha" checkpoints released today. New checkpoints will be released later.

anentropic · on April 19, 2023

According to this LLaMA still didn't go far enough: https://www.harmdevries.com/post/model-size-vs-compute-overh...

whimsicalism · on April 19, 2023

Yep, it depends on what your goal is.

cubefox · on April 19, 2023

This doesn't say that LLaMA didn't go far enough.

anentropic · on April 20, 2023

Not exactly, but it did say they could have gone further than they did without wasting time and energy on infinitesimally small gains though

capableweb · on April 19, 2023

> - 3B to 65B released or in progress

Seems they want to do 3B to 175B, although 175B is not in progress yet.

ipsum2 · on April 19, 2023

It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.

tempaccount420 · on April 19, 2023

If you want it to just regurgitate training data, sure. But more parameters will always be better for more complex tasks.

thewataccount · on April 19, 2023

> But more parameters will always be better for more complex tasks.

I think you should checkout this paper which discusses the relationship of performance and the ratio of training tokens to parameter count.

https://arxiv.org/abs/2203.15556

tempaccount420 · on April 19, 2023

StableLM already has an optimal parameter number to tokens ratio, so what's your point? They should train the 65B model on even more tokens?

> StableLM is trained on a new experimental dataset built on The Pile, but three times larger with 1.5 trillion tokens of content

thewataccount · on April 19, 2023

If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.

> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.

This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.

My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).

> If you want it to just regurgitate training data, sure.

This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"

wokwokwok · on April 20, 2023

> you don't need to have a 175B size model…

Sure, that’s true.

…but, a fully trained larger model is going to be better.

There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.

It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.

For a given compute budget an under trained large model and a well trained small mode may be comparable, right?

…but surely, the laws of diminishing returns applies here?

There’s an upper bound to how good your smaller model can ever be, right?

Over time, someone can take a larger model which is under trained and refine that model right?

The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.

Over all of time that compute budget is not fixed.

thewataccount · on April 20, 2023

> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.

You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.

> …but surely, the laws of diminishing returns applies here?

They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.

> Over all of time that compute budget is not fixed.

Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.

MacsHeadroom · on April 19, 2023

The optimal training tokens for 65B parameters is like 80T.

Emad tweeted "Goin to train a 3B model on 3T tokens" last month. These 800B checkpoints are just early alpha training checkpoints.

The full training set is 1.5T currently and will likely grow.

sebzim4500 · on April 19, 2023

Depends on your compute budget.

kiraaa · on April 19, 2023

and also easy to deploy

burtonator · on April 19, 2023

Were you able to figure out if the RL models are going to be jailed? A 65B parameter model could be a bit frightening. That's 1/3rd the size of GPT3.

sebzim4500 · on April 19, 2023

I'm sure there will be a bunch of different RL tuned versions of them, RLHF isn't that expensive. IIRC Microsoft has software that will do it for a few thousand dollars for a model that size. I'm sure someone will release a non-lobotomized version, maybe OpenAssistant.

kiraaa · on April 19, 2023

its not alway about the size, but yeah its really good!

HarHarVeryFunny · on April 19, 2023

They mention 1.5T training tokens, perhaps for the largest model only ?

vikp · on April 19, 2023

It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.

thunderbird120 · on April 19, 2023

I just asked a stability employee and they said the the current models ran into an overfitting issue probably due to some duplicated data somewhere in their dataset, which consists of 1.5T tokens. The 800B tokens is the number of tokens they've been trained on so far. The plan is to keep going and train on the rest of the data once the issue is resolved.

HarHarVeryFunny · on April 19, 2023

I've asked this question in a few places, and never been able to get an answer, maybe you know...

Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?

This seems maybe related to suspecting data duplication as a cause of overfitting.

Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!

thunderbird120 · on April 20, 2023

Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.

HarHarVeryFunny · on April 21, 2023

Thanks.

So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.

It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.