What hardware did you use to train 1.5B? And how do you know 1.5B is more effect...

nickwalton00 · on Dec 5, 2019

I did have access to a cluster of GPUs through my professor's lab so compute wasn't as much of an issue. And it may be true that 774M is equally effective. I haven't played around with 774M enough to know. I had a decent amount of data so I think that helped me get more out of the 1.5 B param model then if my dataset was sparser.

I did notice that having a large batch size and training slowly was important for me to get better results.

siweizzz · on Dec 5, 2019

Can you share how large your dataset (how many tokens) and batch size was and how many epochs you used? By training slowly, do you mean that you used a small learning rate? If so, what was it?

I've been reading up on batch size and people are all over the place. Some say smaller is better and some say larger is better. Mostly when it comes to gpt2 people say larger is better but there must come a point when increasing the batch size is no longer beneficial (or is it just that you use as large as your memory will allow)?

sillysaurusx · on Dec 5, 2019

In fact, it’s an open question whether larger batch sizes are better. https://twitter.com/jeremyphoward/status/1189643170377658369...

Seconding all of your questions! Details about successful 1.5B training is really hard to come by.

In case it’s helpful, here are some details of how a Chinese 1.5b GPT-2 was trained: https://github.com/imcaspar/gpt2-ml

It looks like they used a batch size of 2 on a TPUv3-256 pod. It took 50 hours and 99,000 training steps, which seems like about 1.3 examples per second.

siweizzz · on Dec 5, 2019

Agreed there doesn’t seem to be a consensus. Thanks for the links

nickwalton00 · on Dec 5, 2019

Had to go check my training file to remember.

Datasize: Around 30 MB, so around ~8000000 token? Can't remember exactly Learning Rate: was 1e-4, so I guess not that slow. I trained for around 1000 steps, but ended up liking the model from step 550. Which I think ended up at around 2 full passes through my data.

There probably is a point where increasing batch size is no longer helpful, my batch size was 32. When I had it lower I had issues with memorization/bias towards particular parts of the training data that it had most recently trained on.

siweizzz · on Dec 5, 2019

Thanks, good to have this data point. I’ve been training a roughly similarly sized dataset for many 10s of ks of steps (but on 355m). Wondering if I need so many steps.

gwern · on Dec 6, 2019

Only 30MB? If it's based on text adventures, can't you get way more data than that?

nickwalton00 · on Dec 6, 2019

I scraped a bunch of stories from chooseyourstory.com but I did curate them to make sure they had the right second person format. I couldn't really anywhere else that had a consistent format that would make scraping easy enough.

sillysaurusx · on Dec 5, 2019

I did have access to a cluster of GPUs through my professor's lab so compute wasn't as much of an issue.

Out of curiosity, what was the specific hardware you used? Some V100s, or maybe a DGX cluster?

Also, how many days did it take to get the loss down to acceptable levels? Did you aim for a loss of ~2.5, or less?

For now I'm trying to train it via 100 TPUv2-8's thanks to TFRC. Unfortunately, each TPUv2-8 is roughly 11x slower than a K80 GPU. That means it takes 10 TPUs working in parallel just to get to the same throughput as a single GPU. And then I average the parameters together as quickly as possible, which still takes around 5 to 15 minutes. (Training happens in parallel to all of that.)

It sort of seems to work, but it's hard to get the learning rate right. If it's set too high, various TPUs diverge. Too low and the loss stays constant.

But I imagine I'll crack it one of these days...

nickwalton00 · on Dec 5, 2019

I used a DGX 1 to train, I trained for somewhere around 12-16 hours down to a loss of something around ~2.2-2.3.