Obviously I don't know the story but yeah... That founder and potential company ...

dartos · on April 17, 2024

I pointed them towards vLLM, but it sounded like they were set on ollama

I’m curious though, why do you think llama.cpp is a toy compared to vllm?

I understand that vllm is also a server, but could someone not build a similar high throughput server on llama.cpp?

I’ve been looking for a way to serve small-scale-but-still-production workloads (using quantized phi models) on CPU and llama.cpp seems to be the only player in town.

kkielhofner · on April 17, 2024

> I pointed them towards vLLM, but it sounded like they were set on ollama

I'm baffled how someone could be so set on Ollama. Being married to a tool is always weird to me and being set on the (very) wrong tool for the job even when faced with good advice is even weirder.

Maybe they'll change their mind the first time a VC, customer, or hire sees Ollama and laughs ;). Kind of kidding but not.

> I’m curious though, why do you think llama.cpp is a toy compared to vllm?

llama.cpp is downright incredible for supporting things you would never do in multi-user production environment:

- Support Nvidia GPUs going back to Maxwell(!)

- CPU (waaaay too slow)

- Split layers between GPU and CPU (still way too slow)

- Wild quantization methods

- Support all kinds of random platforms you'd never deploy to in production (Apple Silicon, etc)

- Much, much more

Whereas the emphasis for vLLM is:

- High scale serving of LLMs in production environments

llama.cpp does really well when used in Ollama type use cases - "I want to run this on my Macbook and send a request every once in a while" or "load a huge model across VRAM and RAM on my desktop". WITH the understanding that being hosted locally is more important than being at least as fast as ChatGPT (which is more-or-less considered the bare-minimum standard in the industry).

I said "at least isn't aware of vLLM" because you can take it even further than this (like Cloudflare, Amazon, Mistral, Phind, Databricks, etc) and use something like TensorRT-LLM with Triton Inference Server which kicks performance and production suitability up yet another couple of notches.

It's a right tool for the job kind of thing.

At this risk of sounding elitist I have no idea how a dozen total tokens/s on CPU (or whatever) is going to be acceptable to users.

Especially in the case of the original scenario (AI startup) - if you go into a highly competitive and crowded space with Ollama (CPU or not) you're going to get beaten up by people deploying with solutions that are so fundamentally drastically better.

All of this said I have no idea what you mean by "small-scale-but-still-production" and no idea of your users or use case(s). I suppose there's always a chance llama.cpp on CPU could be fine in some cases. I just can't possibly imagine what they would be but that could just be my own experience and bias talking.

dartos · on April 17, 2024

I’m working on an internal tool. Maybe 30-40 “customers” total. I say it’s production because it has to be reliable.

We just don’t want to rent a GPU for this little thing. It draws up reports once a day, so it’s okay if it takes a couple mins. It’s work that took a single person maybe 2 hours to do before.

I’ll need to look into triton, I haven’t heard of that yet!

If you have any resources for running models in production that you’d be willing to share, I’d appreciate them.