In terms of speed, we're talking about 140t/s for 7B models, and 40t/s for 33B m...

ignoramous · on July 9, 2023

> https://github.com/turboderp/exllama

Is exllama an alternative to llama.cpp?

juliensalinas · on July 10, 2023

llama.cpp focuses on optimizing inference on a CPU, while exllama is for inference on a GPU.

ignoramous · on July 10, 2023

Thanks. I thought llama.cpp got CUDA capabilities a while ago? https://github.com/ggerganov/llama.cpp/pull/1827

juliensalinas · on July 11, 2023

Oh it seems you're right, I had missed that.

As far as I can see llama.cpp with CUDA is still a bit slower than ExLLaMA but I never had the chance to do the comparison by myself, and maybe it will change soon as these projects are evolving very quickly. Also I am not exactly sure whether the quality of the output is the same with these 2 implementations.

lhl · on July 13, 2023

Until recently, exllama was significantly faster, but they're about on par now (with llama.cpp pulling ahead on certain hardware or with certain compile-time optimizations now even).

There are a couple big difference as I see it. llama.cpp uses `ggml` encoding for their models. There were a few weeks where they kept making breaking revisions which was annoying, but it seems to have stabilized and now also supports more flexible quantization w/ k-quants. exllamma was built for 4-bit GPTQ quants (compatible w/ GPTQ-for-LLaMA, AutoGPTQ) exclusively. exllama still had an advantage w/ the best multi-GPU scaling out there, but as you say, the projects are evolving quickly, so it's hard to say. It has a smaller focus/community than llama.cpp, which also has its pros and cons.

It's good to have multiple viable options though, especially if you're trying to find something that works best w/ your environment/hardware and I'd recommend anyone to HEAD checkouts a try for both and see which one works best for them.

juliensalinas · on July 14, 2023

Thank you for the update! Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama? Also, in terms of VRAM consumption, are they equivalent?

lhl · on July 19, 2023

ExLlama still uses a bit less VRAM than anything else out there: https://github.com/turboderp/exllama#new-implementation - this is sometimes significant since from my personal experience it can support full context on a quantized llama-33b model on a 24GB GPU that can OOM w/ other inference engines.

oobabooga recently did a direct perplexity comparison against various engines/quants: https://oobabooga.github.io/blog/posts/perplexities/

On wikitext, for llama-13b, the perplexity of a q4_K_M GGML on llama.cpp was within 0.3% of the perplexity of a 4-bit 128g desc_act GPTQ on ExLlama, so basically interchangeable.

There are some new quantization formats being proposed like AWQ, SpQR, SqueezeLLM that perform slightly better, but none have been implemented in any real systems yet (the paper for SqueezeLLM is the latest, and has comparison vs AWQ and SpQR if you want to read about it: https://arxiv.org/pdf/2306.07629.pdf)

abhinavkulkarni · on July 15, 2023

Here's one: https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

juliensalinas · on July 18, 2023

Thank you.

Gasp0de · on July 7, 2023

Those GPUs are 1200$ and upwards. This is equivalent to 20,000,000 tokens on GPT-4. I don't think I will ever use this many tokens for my personal use.

lhl · on July 7, 2023

I agree that everyone should do their own cost-benefit analysis, especially if they have to buy additional hardware (used RTX 3090s are ~$700 atm), but one important thing to note for those running the numbers is that all your tokens need to be resubmitted for every query. That means, that if you end up using the OpenAI API for long-running tasks like say a code assistant or pair programmer, with an avg of 4K tokens of context, you will pay $0.18/query, or hit $1200 at about 7000 queries. [1] At 100 queries a day, you'll hit that in just over 2 months. (Note, that is 28M tokens. In general tokens go much faster than you think. Even running a tiny subset of lm-eval against will use about 5M tokens.)

If people are mostly using their LLMs for specific tasks, then using cloud providers (Vast.ai and Runpod were cheapest last time I checked) can be cheaper than dedicated hardware, especially if your power costs are high. If you're needs are minimal, Google Colab offers a free tier with a GPU w/ 11GB of VRAM, so you can run 3B/7B quantized models easily.

There are reasons of course irrespective of cost to run your own model (offline access, fine-tuning/running task specific models, large context/other capabilities OpenAI doesn't provide (eg, you can run multi-modal open models now), privacy/PII, BCP/not being dependent on a single vendor, some commercial or other non-ToS allowed tasks, etc).

[1] https://gptforwork.com/tools/openai-chatgpt-api-pricing-calc...