It's not efficient to do 175B. Training a smaller model (65B) on more data gives...

tempaccount420 · on April 19, 2023

If you want it to just regurgitate training data, sure. But more parameters will always be better for more complex tasks.

thewataccount · on April 19, 2023

> But more parameters will always be better for more complex tasks.

I think you should checkout this paper which discusses the relationship of performance and the ratio of training tokens to parameter count.

https://arxiv.org/abs/2203.15556

tempaccount420 · on April 19, 2023

StableLM already has an optimal parameter number to tokens ratio, so what's your point? They should train the 65B model on even more tokens?

> StableLM is trained on a new experimental dataset built on The Pile, but three times larger with 1.5 trillion tokens of content

thewataccount · on April 19, 2023

If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.

> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.

This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.

My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).

> If you want it to just regurgitate training data, sure.

This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"

wokwokwok · on April 20, 2023

> you don't need to have a 175B size model…

Sure, that’s true.

…but, a fully trained larger model is going to be better.

There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.

It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.

For a given compute budget an under trained large model and a well trained small mode may be comparable, right?

…but surely, the laws of diminishing returns applies here?

There’s an upper bound to how good your smaller model can ever be, right?

Over time, someone can take a larger model which is under trained and refine that model right?

The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.

Over all of time that compute budget is not fixed.

thewataccount · on April 20, 2023

> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.

You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.

> …but surely, the laws of diminishing returns applies here?

They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.

> Over all of time that compute budget is not fixed.

Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.

MacsHeadroom · on April 19, 2023

The optimal training tokens for 65B parameters is like 80T.

Emad tweeted "Goin to train a 3B model on 3T tokens" last month. These 800B checkpoints are just early alpha training checkpoints.

The full training set is 1.5T currently and will likely grow.

sebzim4500 · on April 19, 2023

Depends on your compute budget.

kiraaa · on April 19, 2023

and also easy to deploy