Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Obviously I don't know the story but yeah... That founder and potential company are in for a rude awakening.

> I don’t really think ollama scales to production workloads anyway

Not even close. At the risk of gatekeeping in terms of production/commercial serving of LLMs Ollama (and llama.cpp) are basically toys. They serve a purpose and are fantastic projects for their intended use cases (serving a user or two) but compared to production workloads they're basically "my first LLM".

If that founder isn't at least aware of vLLM or HF TGI (let alone llama.cpp!!) they'll have a really tough time being even remotely competitive in the space, to the point of "it doesn't work and it's not going to".

Obviously there is much, much more that goes into startup success but this is pretty fundamental.



I pointed them towards vLLM, but it sounded like they were set on ollama

I’m curious though, why do you think llama.cpp is a toy compared to vllm?

I understand that vllm is also a server, but could someone not build a similar high throughput server on llama.cpp?

I’ve been looking for a way to serve small-scale-but-still-production workloads (using quantized phi models) on CPU and llama.cpp seems to be the only player in town.


> I pointed them towards vLLM, but it sounded like they were set on ollama

I'm baffled how someone could be so set on Ollama. Being married to a tool is always weird to me and being set on the (very) wrong tool for the job even when faced with good advice is even weirder.

Maybe they'll change their mind the first time a VC, customer, or hire sees Ollama and laughs ;). Kind of kidding but not.

> I’m curious though, why do you think llama.cpp is a toy compared to vllm?

llama.cpp is downright incredible for supporting things you would never do in multi-user production environment:

- Support Nvidia GPUs going back to Maxwell(!)

- CPU (waaaay too slow)

- Split layers between GPU and CPU (still way too slow)

- Wild quantization methods

- Support all kinds of random platforms you'd never deploy to in production (Apple Silicon, etc)

- Much, much more

Whereas the emphasis for vLLM is:

- High scale serving of LLMs in production environments

llama.cpp does really well when used in Ollama type use cases - "I want to run this on my Macbook and send a request every once in a while" or "load a huge model across VRAM and RAM on my desktop". WITH the understanding that being hosted locally is more important than being at least as fast as ChatGPT (which is more-or-less considered the bare-minimum standard in the industry).

I said "at least isn't aware of vLLM" because you can take it even further than this (like Cloudflare, Amazon, Mistral, Phind, Databricks, etc) and use something like TensorRT-LLM with Triton Inference Server which kicks performance and production suitability up yet another couple of notches.

It's a right tool for the job kind of thing.

At this risk of sounding elitist I have no idea how a dozen total tokens/s on CPU (or whatever) is going to be acceptable to users.

Especially in the case of the original scenario (AI startup) - if you go into a highly competitive and crowded space with Ollama (CPU or not) you're going to get beaten up by people deploying with solutions that are so fundamentally drastically better.

All of this said I have no idea what you mean by "small-scale-but-still-production" and no idea of your users or use case(s). I suppose there's always a chance llama.cpp on CPU could be fine in some cases. I just can't possibly imagine what they would be but that could just be my own experience and bias talking.


I’m working on an internal tool. Maybe 30-40 “customers” total. I say it’s production because it has to be reliable.

We just don’t want to rent a GPU for this little thing. It draws up reports once a day, so it’s okay if it takes a couple mins. It’s work that took a single person maybe 2 hours to do before.

I’ll need to look into triton, I haven’t heard of that yet!

If you have any resources for running models in production that you’d be willing to share, I’d appreciate them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: