If I understand correctly, based on their prediction in Table 3 on page 8, they do have enough tokens, but they also need over a magnitude more compute time.
> It's not efficient to do 175B. Training a smaller model (65B) on more data gives better performance for the same compute.
This is OP's comment you replied to - so I was responding under OP's context that the amount of compute time would be the same, which I apologize I didn't make clear, and my response was very poorly worded.
My intent was to link the paper because I think it supports OP's statement that for the same amount of compute time and a token ratio, the performance of a smaller model will be better then a larger one (assuming they haven't converged yet which they haven't at this size).
> If you want it to just regurgitate training data, sure.
This paper was about showing Chinchilla performing with models many times larger then itself, showing you don't need to have a 175B size model for more performance then "regurgitating training data"
…but, a fully trained larger model is going to be better.
There only reasonable reason to prefer a smaller model is because it’s cheaper and less intensive to train.
It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that.
For a given compute budget an under trained large model and a well trained small mode may be comparable, right?
…but surely, the laws of diminishing returns applies here?
There’s an upper bound to how good your smaller model
can ever be, right?
Over time, someone can take a larger model which is under trained and refine that model right?
The “small model is just as good” narrative only holds up for a fixed once only training of a model for a fixed compute budget at the moment of release.
Over all of time that compute budget is not fixed.
> It sounds a lot like you’re saying “small models are just as good” … which is false. No one believes that. … a fully trained larger model is going to be better.
You're absolutely right, a fully trained larger model _will_ be better. This is meant to be under the context of OP of a "limited compute", the statement I'm trying to make is “fully trained small models are just as good as a undertrained large model”.
> …but surely, the laws of diminishing returns applies here?
They do but it's diminishing in that the performance gains of larger models becomes less and less, while the training time required changes a lot. If I'm reading the first chart of figure 2, page 5 correctly, you a 5B vs 10B, the 10B needs almost 10x the training time for a 10% loss gain. and its a similar jump from 1B to 5B. My understanding is at this also starts flattening out, and that loss gain from each 10x becomes gradually lower and lower.
> Over all of time that compute budget is not fixed.
Realistically there is an upper bound to your compute budget. If you needed 1000GPUS for 30 days for a small model, you need 1000GPUS for 300 days for that ~10% at these smaller sizes, or 10,000GPUS for 30 days... You're going to become limited very quickly by time and/or money. There's a reason openai said they aren't training a model larger then GPT 4 at the moment - I don't think they can scale it from what I think is a ~1~2T model.