It's fantastic that more orgs are releasing open-source models trained on more than 300B or so tokens. Here's my take from the details I could find.
Pros
- 4096 context width (vs 2048 for llama, gpt-j, etc)
- 3B to 65B released or in progress
- RL tuned models available
- Trained on more tokens than existing non-llama models
- 128 head dim, so can use flash attention (unlike GPT-J)
Cons
- No benchmarks released, or details about the model
- Somewhat restrictive license on the base models, and NC license on the RL models
- Small models only trained on 800B tokens, compared to 1T for llama-7B, and potentially more for other upcoming alternatives (RedPajama, etc). I'd like to see their loss curves to see why they chose 800B.
High-level, this is likely to be more accurate than existing non-llama open source models. It's hard to say without benchmarks (but benchmarks have been gamed by training on benchmark data, so really it's just hard to say).
Some upcoming models in the next few weeks may be more accurate than this, and have less restrictive licenses. But this is a really good option nonetheless.
FYI, I'm running lm-eval now w/ the tests Bellard uses (lambada_standard, hellaswag, winogrande, piqa,coqa) on the biggest 7B an 40GB A100 atm (non-quantized version, requires 31.4GB) so will be directly comparable to what various LLaMAs look like: https://bellard.org/ts_server/
(UPDATE: run took 1:36 to complete run, but failed at the end with a TypeError, so will need to poke and rerun).
Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392
The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.
How possible is it that every other model suffers from dataset contamination and this model is being unfairly penalized for having properly sanitized training data?
I'm still on the waitlist for GPT-4 API access. Note, that text-davinci-003 cost about $90 to benchmark at $0.02/1K tokens, so if you're able to use a GPT-4 model (for completion and not just instruction) that'll probably be $270-$540 in credits to benchmark...
That's great news, but one would think that since they're behind Stable Diffusion, that they'd use the insights behind it and scale data even more than that to result in better quality at a smaller scale model that can run on most people's machines.
Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation), and a long context on a 7B parameter model then see if that gets you better results than a 30 or 65B parameter on 1.5 trillion tokens.
A lot of these open source projects just seem to be trying to follow and (poorly) reproduce OpenAI's breakthroughs instead of trying to surpass them.
You could've said the same to OpenAI when they were scaling GPT from 1 billion to 175 billion parameters. We're all grateful they didn't follow that line of thought.
But Stability does have access to a pretty big cluster, so it's not paying cloud compute (I assume), so cost will be less, and data of course is not infinite...never stated that.
But considering 3.7 million videos are uploaded to youtube everyday, 2 million scientific articles published every year, yada yada...that argument falls apart.
At the very least implement spiral development... 1 trillion... 3 trillion... (oh it seems to be getting WAY better! There seems to be a STEP CHANGE!)... 5 trillion... (holy shit this really works, lets keep going)
The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting
and being in a standard format for ingestion in your first trillion tokens of training data.
There’s a difference between telling someone they’re wasting their time with their current project, and asking them why they didn’t spend 6x - 60x as much budget on an already expensive project.
Nobody knows where to find 10 trillion tokens of good data. Publicly available / data without a license seems to cap at around 1.5 trillion tokens total. The internet isn't as big as you thought! (Or at least, all the good stuff is behind a walled garden, which I think we did know)
@thunderbird120 asked a Stability employee and say that the plan is going to keep training the models up to 1.5T. So I don't know where do you read this.
I'm wondering what the sweet spot for parameters will be. Right now it feels like the Mhz race we had back in the CPU days, but 20 years later I am still using a 2-3GHz CPU.
There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774
I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).
Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. Also FalshAttention is faster.
https://arxiv.org/abs/2205.14135 - Section 5 suggests that the biggest limitation is that custom CUDA kernels need to be coded on a per-GPU architecture basis.
FlashAttention is mathematically identical to standard attention, so in theory there's no downside. In practice, numerical inaccuracies of floating point mean that the results differ slightly. I don't know of any papers going in depth to analyze what impact those variances have in a range of real models, but generally speaking deep models handle slightly variances well. I've not noticed any difference in my applications training models. And tons of people use FlashAttention as a drop-in replacement on models trained on standard attention (e.g. using xformers in StableDiffusion).
Also in practice FlashAttention is still relatively new so it isn't well supported in libraries yet. Until PyTorch 2.0 you had to either implement it yourself, or use something like xformers which comes with a bag of caveats. PyTorch 2.0 now has it built-in, and it's easy to use, but the implementation is incomplete so you can't, for example, use it with an attention mask (which is needed in LLMs, for example).
tl;dr: Basically none, but it just isn't well supported yet.
According to the paper Flash Attention also needs quadratic memory:
Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑 <= 𝑀 <= 𝑁𝑑. Standard attention (Algorithm 0) requires Θ(𝑁𝑑+𝑁²) HBM accesses, while FlashAttention (Algorithm 1) requires Θ(𝑁²𝑑²M⁻¹) HBM accesses.
But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.
If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.
> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.
This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.
My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).
> If you want it to just regurgitate training data, sure.
This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"
…but, a fully trained larger model is going to be better.
There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.
It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.
For a given compute budget an under trained large model and a well trained small mode may be comparable, right?
…but surely, the laws of diminishing returns applies here?
There’s an upper bound to how good your smaller model
can ever be, right?
Over time, someone can take a larger model which is under trained and refine that model right?
The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.
Over all of time that compute budget is not fixed.
> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.
I'm sure there will be a bunch of different RL tuned versions of them, RLHF isn't that expensive. IIRC Microsoft has software that will do it for a few thousand dollars for a model that size. I'm sure someone will release a non-lobotomized version, maybe OpenAssistant.
It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.
I just asked a stability employee and they said the the current models ran into an overfitting issue probably due to some duplicated data somewhere in their dataset, which consists of 1.5T tokens. The 800B tokens is the number of tokens they've been trained on so far. The plan is to keep going and train on the rest of the data once the issue is resolved.
I've asked this question in a few places, and never been able to get an answer, maybe you know...
Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?
This seems maybe related to suspecting data duplication as a cause of overfitting.
Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!
Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.
So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.
It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.
Pros
Cons High-level, this is likely to be more accurate than existing non-llama open source models. It's hard to say without benchmarks (but benchmarks have been gamed by training on benchmark data, so really it's just hard to say).Some upcoming models in the next few weeks may be more accurate than this, and have less restrictive licenses. But this is a really good option nonetheless.