In terms of speed, we're talking about 140t/s for 7B models, and 40t/s for 33B models on a 3090/4090 now.[1] (1 token ~= 0.75 word) It's quite zippy. llama.cpp performs close on Nvidia GPUs now (but they don't have a handy chart) and you can get decent performance on 13B models on M1/M2 Macs.
You can take a look at a list of evals here: https://llm-tracker.info/books/evals/page/list-of-evals - for general usage, I think home-rolled evals like llm-jeopardy [2] and local-llm-comparison [3] by hobbyists are more useful than most of the benchmark rankings.
That being said, personally I mostly use GPT-4 for code assistance to that's what I'm most interested in, and the latest code assistants are scoring quite well: https://github.com/abacaj/code-eval - a recent replit-3b fine tune the human-eval results for open models (as a point of reference, GPT-3.5 gets 60.4 on pass@1 and 68.9 on pass@10 [4]) - I've only just started playing around with it since replit model tooling is not as good as llamas (doc here: https://llm-tracker.info/books/howto-guides/page/replit-mode...).
I'm interested in potentially applying reflexion or some of the other techniques that have been tried to even further increase coding abilities. (InterCode in particular has caught my eye https://intercode-benchmark.github.io/)
As far as I can see llama.cpp with CUDA is still a bit slower than ExLLaMA but I never had the chance to do the comparison by myself, and maybe it will change soon as these projects are evolving very quickly.
Also I am not exactly sure whether the quality of the output is the same with these 2 implementations.
Until recently, exllama was significantly faster, but they're about on par now (with llama.cpp pulling ahead on certain hardware or with certain compile-time optimizations now even).
There are a couple big difference as I see it. llama.cpp uses `ggml` encoding for their models. There were a few weeks where they kept making breaking revisions which was annoying, but it seems to have stabilized and now also supports more flexible quantization w/ k-quants. exllamma was built for 4-bit GPTQ quants (compatible w/ GPTQ-for-LLaMA, AutoGPTQ) exclusively. exllama still had an advantage w/ the best multi-GPU scaling out there, but as you say, the projects are evolving quickly, so it's hard to say. It has a smaller focus/community than llama.cpp, which also has its pros and cons.
It's good to have multiple viable options though, especially if you're trying to find something that works best w/ your environment/hardware and I'd recommend anyone to HEAD checkouts a try for both and see which one works best for them.
Thank you for the update!
Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama?
Also, in terms of VRAM consumption, are they equivalent?
ExLlama still uses a bit less VRAM than anything else out there: https://github.com/turboderp/exllama#new-implementation - this is sometimes significant since from my personal experience it can support full context on a quantized llama-33b model on a 24GB GPU that can OOM w/ other inference engines.
On wikitext, for llama-13b, the perplexity of a q4_K_M GGML on llama.cpp was within 0.3% of the perplexity of a 4-bit 128g desc_act GPTQ on ExLlama, so basically interchangeable.
There are some new quantization formats being proposed like AWQ, SpQR, SqueezeLLM that perform slightly better, but none have been implemented in any real systems yet (the paper for SqueezeLLM is the latest, and has comparison vs AWQ and SpQR if you want to read about it: https://arxiv.org/pdf/2306.07629.pdf)
Those GPUs are 1200$ and upwards. This is equivalent to 20,000,000 tokens on GPT-4. I don't think I will ever use this many tokens for my personal use.
I agree that everyone should do their own cost-benefit analysis, especially if they have to buy additional hardware (used RTX 3090s are ~$700 atm), but one important thing to note for those running the numbers is that all your tokens need to be resubmitted for every query. That means, that if you end up using the OpenAI API for long-running tasks like say a code assistant or pair programmer, with an avg of 4K tokens of context, you will pay $0.18/query, or hit $1200 at about 7000 queries. [1] At 100 queries a day, you'll hit that in just over 2 months. (Note, that is 28M tokens. In general tokens go much faster than you think. Even running a tiny subset of lm-eval against will use about 5M tokens.)
If people are mostly using their LLMs for specific tasks, then using cloud providers (Vast.ai and Runpod were cheapest last time I checked) can be cheaper than dedicated hardware, especially if your power costs are high. If you're needs are minimal, Google Colab offers a free tier with a GPU w/ 11GB of VRAM, so you can run 3B/7B quantized models easily.
There are reasons of course irrespective of cost to run your own model (offline access, fine-tuning/running task specific models, large context/other capabilities OpenAI doesn't provide (eg, you can run multi-modal open models now), privacy/PII, BCP/not being dependent on a single vendor, some commercial or other non-ToS allowed tasks, etc).
You can take a look at a list of evals here: https://llm-tracker.info/books/evals/page/list-of-evals - for general usage, I think home-rolled evals like llm-jeopardy [2] and local-llm-comparison [3] by hobbyists are more useful than most of the benchmark rankings.
That being said, personally I mostly use GPT-4 for code assistance to that's what I'm most interested in, and the latest code assistants are scoring quite well: https://github.com/abacaj/code-eval - a recent replit-3b fine tune the human-eval results for open models (as a point of reference, GPT-3.5 gets 60.4 on pass@1 and 68.9 on pass@10 [4]) - I've only just started playing around with it since replit model tooling is not as good as llamas (doc here: https://llm-tracker.info/books/howto-guides/page/replit-mode...).
I'm interested in potentially applying reflexion or some of the other techniques that have been tried to even further increase coding abilities. (InterCode in particular has caught my eye https://intercode-benchmark.github.io/)
[1] https://github.com/turboderp/exllama#results-so-far
[2] https://github.com/aigoopy/llm-jeopardy
[3] https://github.com/Troyanovsky/Local-LLM-comparison/tree/mai...
[4] https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder