Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's fantastic that more orgs are releasing open-source models trained on more than 300B or so tokens. Here's my take from the details I could find.

Pros

  - 4096 context width (vs 2048 for llama, gpt-j, etc)
  - 3B to 65B released or in progress
  - RL tuned models available
  - Trained on more tokens than existing non-llama models
  - 128 head dim, so can use flash attention (unlike GPT-J)
Cons

  - No benchmarks released, or details about the model
  - Somewhat restrictive license on the base models, and NC license on the RL models
  - Small models only trained on 800B tokens, compared to 1T for llama-7B, and potentially more for other upcoming alternatives (RedPajama, etc).  I'd like to see their loss curves to see why they chose 800B.
High-level, this is likely to be more accurate than existing non-llama open source models. It's hard to say without benchmarks (but benchmarks have been gamed by training on benchmark data, so really it's just hard to say).

Some upcoming models in the next few weeks may be more accurate than this, and have less restrictive licenses. But this is a really good option nonetheless.



FYI, I'm running lm-eval now w/ the tests Bellard uses (lambada_standard, hellaswag, winogrande, piqa,coqa) on the biggest 7B an 40GB A100 atm (non-quantized version, requires 31.4GB) so will be directly comparable to what various LLaMAs look like: https://bellard.org/ts_server/

(UPDATE: run took 1:36 to complete run, but failed at the end with a TypeError, so will need to poke and rerun).

I'll place results in my spreadsheet (which also has my text-davinci-003 results): https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYp...


Looks like my edit window closed, but my results ended up being very low so there must be something wrong (I've reached out to StabilityAI just in case). It does however seem to roughly match another user's 3B testing: https://twitter.com/abacaj/status/1648881680835387392

The current scores I have place it between gpt2_774M_q8 and pythia_deduped_410M (yikes!). Based on training and specs you'd expect it to outperform Pythia 6.9B at least... this is running on a HEAD checkout of https://github.com/EleutherAI/lm-evaluation-harness (releases don't support hf-casual) for those looking to replicate/debug.

Note, another LLM currently being trained, GeoV 9B, already far outperforms this model at just 80B tokens trained: https://github.com/geov-ai/geov/blob/master/results.080B.md


Note that this is StableLM ALPHA (only 0.52 epochs into training).

The fully trained version will surely be much better.

Also, you should benchmark GPT-3 Babbage for a fair comparison since that is the same size as 7B.


How many epochs will they run?



Yeah, although looks like it currently has some issues with coqa: https://github.com/EleutherAI/lm-evaluation-harness/issues/2...

There's also the bigscience fork, but I ran into even more problems (although I didn't try too hard) https://github.com/bigscience-workshop/lm-evaluation-harness

And there's https://github.com/EleutherAI/lm-eval2/ (not sure if it's just starting over w/ a new repo or what?) but it has limited tests available


How possible is it that every other model suffers from dataset contamination and this model is being unfairly penalized for having properly sanitized training data?


Do you also have results of GPT4 somewhere? or text-davinci-003-turbo


I'm still on the waitlist for GPT-4 API access. Note, that text-davinci-003 cost about $90 to benchmark at $0.02/1K tokens, so if you're able to use a GPT-4 model (for completion and not just instruction) that'll probably be $270-$540 in credits to benchmark...


I have GPT-4 8k access and am willing to run the evals if someone wants to pay. Email in my acc info (the character is h)

Just a note, I get errors semi-frequently when running queries against GPT-4 often (timeouts mostly…) so any code would need to handle that well.


You should benchmark GPT-3 Curie (7B) for comparison since it is the same size as llama-7B and StableLM-7B.

That will give us some indication of how much better these models are than GPT-3 at the same size.


Just think about benchmarking 32K GPT4 haha


>- No [...] details about the model

You can see the model architecture here

https://github.com/Stability-AI/StableLM/blob/main/configs/s...


>Small models only trained on 800B tokens

"These models will be trained on up to 1.5 trillion tokens." on the Github repo.

https://github.com/stability-AI/stableLM/#stablelm-alpha


That's great news, but one would think that since they're behind Stable Diffusion, that they'd use the insights behind it and scale data even more than that to result in better quality at a smaller scale model that can run on most people's machines.

Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation), and a long context on a 7B parameter model then see if that gets you better results than a 30 or 65B parameter on 1.5 trillion tokens.

A lot of these open source projects just seem to be trying to follow and (poorly) reproduce OpenAI's breakthroughs instead of trying to surpass them.


>try 10 trillion or 100 trillion tokens

Computation is not free and data is not infinite.


You could've said the same to OpenAI when they were scaling GPT from 1 billion to 175 billion parameters. We're all grateful they didn't follow that line of thought.

But Stability does have access to a pretty big cluster, so it's not paying cloud compute (I assume), so cost will be less, and data of course is not infinite...never stated that.

But considering 3.7 million videos are uploaded to youtube everyday, 2 million scientific articles published every year, yada yada...that argument falls apart.

At the very least implement spiral development... 1 trillion... 3 trillion... (oh it seems to be getting WAY better! There seems to be a STEP CHANGE!)... 5 trillion... (holy shit this really works, lets keep going)


The training corpus is the problem. An extra trillion tokens is (ballpark) an extra million KJV bibles worth of text formatted for ingestion. And you probably picked all of the low hanging fruit in terms of quality prior vetting and being in a standard format for ingestion in your first trillion tokens of training data.


There’s a difference between telling someone they’re wasting their time with their current project, and asking them why they didn’t spend 6x - 60x as much budget on an already expensive project.


They're loaded, and we know scaling works, they'd massively benefit... both in marketing and profit.

Although it is open source to be fair.


> Like... try 10 trillion or 100 trillion tokens (although that may be absurd, I never did the calculation)

But where’s the corpus supposed ro come from?


Nobody knows where to find 10 trillion tokens of good data. Publicly available / data without a license seems to cap at around 1.5 trillion tokens total. The internet isn't as big as you thought! (Or at least, all the good stuff is behind a walled garden, which I think we did know)


Devs confirmed that the small ones use 800B, 1.5T is for the large ones


@thunderbird120 asked a Stability employee and say that the plan is going to keep training the models up to 1.5T. So I don't know where do you read this.


That may be, but the weights you can download today were trained on 800B


I think they are “checkpoint” models in this case.

Will be fun to compare when completed!


Are not all models checkpoints? I think you may be interpreting it too colloquially.


yes of course that's why they use "will be trained" on the GH repo.


https://github.com/Stability-AI/StableLM#stablelm-alpha shows that the 3b and 7B had 800b training tokens.


I'm wondering what the sweet spot for parameters will be. Right now it feels like the Mhz race we had back in the CPU days, but 20 years later I am still using a 2-3GHz CPU.


I think "sweet spot" is going to depend on your task, but here's a good recent paper that may give you some more context on thinking about training and model sizes: https://www.harmdevries.com/post/model-size-vs-compute-overh...

There have also been quite a few developments on sparsity lately. Here's a technique SparseGPT which suggests that you can prune 50% of parameters with almost no loss in performance for example: https://arxiv.org/abs/2301.00774


I was wondering if the longer training thing was a similar phenomenon to the double-descent we see in other deep learning models. Training for a really long time can improve generalization (as can adding more parameters) - but I don't know enough about LLM architecture to know if that's relevant here. My skim of the blog post led me to think it's proposing a different mechanism (scaling laws).


Well, based on all the data we have available now it seems like you don't get much benefit yet from going above 200 billion.


> 128 head dim, so can use flash attention (unlike GPT-J)

mind explaining why this is so attractive/what the hurdle is for the laypeople in the audience? (me)


Standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. Also FalshAttention is faster.


So there must be a downside to FlashAttention. What is it?


https://arxiv.org/abs/2205.14135 - Section 5 suggests that the biggest limitation is that custom CUDA kernels need to be coded on a per-GPU architecture basis.


FlashAttention is mathematically identical to standard attention, so in theory there's no downside. In practice, numerical inaccuracies of floating point mean that the results differ slightly. I don't know of any papers going in depth to analyze what impact those variances have in a range of real models, but generally speaking deep models handle slightly variances well. I've not noticed any difference in my applications training models. And tons of people use FlashAttention as a drop-in replacement on models trained on standard attention (e.g. using xformers in StableDiffusion).

Also in practice FlashAttention is still relatively new so it isn't well supported in libraries yet. Until PyTorch 2.0 you had to either implement it yourself, or use something like xformers which comes with a bag of caveats. PyTorch 2.0 now has it built-in, and it's easy to use, but the implementation is incomplete so you can't, for example, use it with an attention mask (which is needed in LLMs, for example).

tl;dr: Basically none, but it just isn't well supported yet.


installing it is a nightmare


According to the paper Flash Attention also needs quadratic memory:

Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑 <= 𝑀 <= 𝑁𝑑. Standard attention (Algorithm 0) requires Θ(𝑁𝑑+𝑁²) HBM accesses, while FlashAttention (Algorithm 1) requires Θ(𝑁²𝑑²M⁻¹) HBM accesses.


https://github.com/HazyResearch/flash-attention#memory

"standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

I guess you have just reported how many times the layer will need to access the memory, not how much memory usage scales with sequence length.


> Small models only trained on 800B tokens, compared to 1T for llama-7B

LLaMA is trained far beyond chinchilla optimality, so this is not as surprising to me.


But Chinchilla optimality, while an interesting result, is a strange target for most practical purposes. Training is one time, inference is many times; not training past the point where its cheaper to training a larger model for the same (proxy for) quality discounts to zero the import of the cost of inference.


Yep, but if stability has the goal of training the best possible model then that would explain the choices they made.


I mean 800B tokens on a 3B model and 7B model is still way beyond the Chinchilla scale.


They're going to 1.5T and possibly 3T. The 800B is just for the "Alpha" checkpoints released today. New checkpoints will be released later.


According to this LLaMA still didn't go far enough: https://www.harmdevries.com/post/model-size-vs-compute-overh...


Yep, it depends on what your goal is.


This doesn't say that LLaMA didn't go far enough.


Not exactly, but it did say they could have gone further than they did without wasting time and energy on infinitesimally small gains though


> - 3B to 65B released or in progress

Seems they want to do 3B to 175B, although 175B is not in progress yet.


It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.


If you want it to just regurgitate training data, sure. But more parameters will always be better for more complex tasks.


> But more parameters will always be better for more complex tasks.

I think you should checkout this paper which discusses the relationship of performance and the ratio of training tokens to parameter count.

https://arxiv.org/abs/2203.15556


StableLM already has an optimal parameter number to tokens ratio, so what's your point? They should train the 65B model on even more tokens?

> StableLM is trained on a new experimental dataset built on The Pile, but three times larger with 1.5 trillion tokens of content


If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.

> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.

This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.

My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).

> If you want it to just regurgitate training data, sure.

This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"


> you don't need to have a 175B size model…

Sure, that’s true.

…but, a fully trained larger model is going to be better.

There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.

It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.

For a given compute budget an under trained large model and a well trained small mode may be comparable, right?

…but surely, the laws of diminishing returns applies here?

There’s an upper bound to how good your smaller model can ever be, right?

Over time, someone can take a larger model which is under trained and refine that model right?

The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.

Over all of time that compute budget is not fixed.


> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.

You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.

> …but surely, the laws of diminishing returns applies here?

They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.

> Over all of time that compute budget is not fixed.

Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.


The optimal training tokens for 65B parameters is like 80T.

Emad tweeted "Goin to train a 3B model on 3T tokens" last month. These 800B checkpoints are just early alpha training checkpoints.

The full training set is 1.5T currently and will likely grow.


Depends on your compute budget.


and also easy to deploy


Were you able to figure out if the RL models are going to be jailed? A 65B parameter model could be a bit frightening. That's 1/3rd the size of GPT3.


I'm sure there will be a bunch of different RL tuned versions of them, RLHF isn't that expensive. IIRC Microsoft has software that will do it for a few thousand dollars for a model that size. I'm sure someone will release a non-lobotomized version, maybe OpenAssistant.


its not alway about the size, but yeah its really good!


They mention 1.5T training tokens, perhaps for the largest model only ?


It's unclear which models will be trained to 1.5T tokens. The details of how many tokens each model saw in training are on Github - https://github.com/stability-AI/stableLM/ . But only for the ones that have been released.


I just asked a stability employee and they said the the current models ran into an overfitting issue probably due to some duplicated data somewhere in their dataset, which consists of 1.5T tokens. The 800B tokens is the number of tokens they've been trained on so far. The plan is to keep going and train on the rest of the data once the issue is resolved.


I've asked this question in a few places, and never been able to get an answer, maybe you know...

Q: Why are these LLMs trained on a single epoch, and perform worse if the dataset is repeated ?

This seems maybe related to suspecting data duplication as a cause of overfitting.

Why don't LLMs need multi-epoch training at a low learning rate to generalize? If they are managing to learn from a single epoch, that sounds more like they may be memorizing!


Never repeating your training data is what you'd ideally like to do for training basically any ML model. If you do that you don't really need to worry about overfitting since the model is constantly trying to fit a stream of new data. To reduce its training error it actually has to model the structure of the data rather than just memorizing it since each training step will involve data it has never seen before. Larger models are more prone to overfitting but also learn several orders of magnitude faster. If you can use larger models without being concerned about overfitting it's generally desirable to do so. It's just that most tasks don't actually have enough data to support doing that. Thankfully, text modeling does have enough data.


Thanks.

So when, for example, we train an ImageNet model over multiple epochs using rotation/scaling/etc augmentation, it's really better to think of this as one epoch over a unique set of images than multi-epoch per se ? I was really thinking of augmentation as a way to get coverage over the input space rather than ensuring the training data doesn't repeat, but I guess it serves both purposes.

It does still seem that many LLMs are overfitting / memorizing to a fair degree though - maybe just because they are still too big for the amount of data they are trained on ? It seems like a bit of a balancing act - wanting an LLM to generalize, but yet also to serve as somewhat of a knowledge store for rare data it has only seen once.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: