Ollama 0.1.32: WizardLM 2, Mixtral 8x22B, macOS CPU/GPU model split

mariuz · on April 17, 2024

What makes it worse for me is that they don't thank or kudos @ggerganov and the team at all in their README nor in any acknowledgment. Ollama wouldn't exist without llama.cpp."

https://twitter.com/_philschmid/status/1780509972092121242

kkielhofner · on April 17, 2024

What's frustrating is they're good at the value they add: documentation, UI/UX, their model zoo thing, etc. This alone is something to be proud of and adds a lot of value. As of this writing they have 2,377 commits - there is quite a bit of effort and resulting value in what they're doing.

However, IMO it is pretty sleazy that they frequently make claims like "Ollama now supports X" with zero mention of llama.cpp[0] - an incredible project that makes what they're doing possible in the first place and largely enables these announcements. They don't even mention llama.cpp in their Github README or release notes which cranks the sleaze up a few notches.

I don't know who they are or what their "angle" is but this reeks of "some business opportunity/VC/something is going to come along and we'll cash in on AI hype while potentially misrepresenting what we're actually doing". To a more naive audience that doesn't quite understand the shoulders of giants they're standing on it makes it seem that they are doing far more than they actually are.

Of course I don't know this is the case but it sure looks like it and it would be trivial for them to address this but they're also very good at marketing and I assume that takes priority.

[0] - https://ollama.com/blog

fearface · on April 17, 2024

First of all, they are not violating any license or terms in any form. They add value and enable thousands of people to use local LLMs, that would not be able to do that so easy otherwise. Maybe llama.cpp should mention that Ollama takes care of easy workable access to their functionality…

kkielhofner · on April 17, 2024

> First of all, they are not violating any license or terms in any form.

IANAL but from what I understand likely debatable at least. You'll notice I said "sleazy" and didn't touch on license, potential legal issues, etc.

I'm pointing out that other projects that are substantially based/dependent on other pieces of software to do the "heavy lifting" nearly always acknowledge it. An example being faster-whisper which is a good corollary and actually has "with CTranslate2" right in the heading[0] with direct links to whisper.cpp and CTranslate2 immediately following.

Ollama is the diametric opposite of this - unless you go spelunking through commits, etc you'd have no idea that Ollama doesn't do much in terms of the underlying LLM. Take a look at llama.cpp to see just how much "Ollama functionality" it provides.

Then look at /r/LocalLLaMA, HN, etc to see just how many Ollama users (most) have no idea that llama.cpp even exists.

I don't know how this could be anything other than an attempt to mislead people into thinking Ollama is uniquely and directly implementing all of the magic. It's pretty glaring and has been pointed out repeatedly. It's not some casual oversight.

> They add value and enable thousands of people to use local LLMs, that would not be able to do that so easy otherwise.

The very first thing I said going so far as to mention commits, model zoo, etc while specifically acknowledging the level of effort and added value.

> Maybe llama.cpp should mention that Ollama takes care of easy workable access to their functionality…

Are you actually suggesting that enabling software should mention, track, or even be aware of the likely countless projects that are built on it?

[0] - https://github.com/SYSTRAN/faster-whisper

Zambyte · on April 17, 2024

The llama.cpp license does actually require attribution, which I'm not sure exactly how ollama is complying with.

secondary_op · on April 17, 2024

PR they do is very creepy, it is literally reads as if all work is being done by ollama themselves, but when I saw they started to do meet-ups and do integration with other companies(I presume with paid support), then imho coupled with previous points this is red line, do freaking attribution.

It is the same behaviour as Amazon did with OSS which in turn forced companies to adopt more restrictive licenses.

https://www.forbes.com/sites/davidjeans/2021/03/01/elastic-w...

falcor84 · on April 17, 2024

I agree with everything except for "forced"

survirtual · on April 17, 2024

Ollama is an ergonomic "frontend" to a lower level library (llama.cpp).

The way they are operating is extremely common to the way anyone else operates. If I built a webservice on top of, say, Warp in Rust, people generally aren't putting much acknowledgement for using Warp. Or should I give acknowledgement to Hyper, which warp is built in?

Actually, on the flip side, Warp is a good example of giving acknowledgement, since they mention Hyper in their readme (of course it is made by the same owner, so he is just linking his works):

https://crates.io/crates/warp

Maybe Ollama should add a similar type of acknowledgement?

I will see about opening an issue on their github.

survirtual · on April 17, 2024

made an issue:

https://github.com/ollama/ollama/issues/3697

dartos · on April 17, 2024

I had a consulting call with a young founder trying to start an AI company backed by ollama. I don’t really think ollama scales to production workloads anyway, but they had no idea what llama.cpp was.

Kinda made me sad.

belter · on April 17, 2024

Hopefully what made you sad was that somebody could aim to start an AI company without knowing what llama.cpp is, not they did not know about llama.cpp :-)

kkielhofner · on April 17, 2024

Obviously I don't know the story but yeah... That founder and potential company are in for a rude awakening.

> I don’t really think ollama scales to production workloads anyway

Not even close. At the risk of gatekeeping in terms of production/commercial serving of LLMs Ollama (and llama.cpp) are basically toys. They serve a purpose and are fantastic projects for their intended use cases (serving a user or two) but compared to production workloads they're basically "my first LLM".

If that founder isn't at least aware of vLLM or HF TGI (let alone llama.cpp!!) they'll have a really tough time being even remotely competitive in the space, to the point of "it doesn't work and it's not going to".

Obviously there is much, much more that goes into startup success but this is pretty fundamental.

dartos · on April 17, 2024

I pointed them towards vLLM, but it sounded like they were set on ollama

I’m curious though, why do you think llama.cpp is a toy compared to vllm?

I understand that vllm is also a server, but could someone not build a similar high throughput server on llama.cpp?

I’ve been looking for a way to serve small-scale-but-still-production workloads (using quantized phi models) on CPU and llama.cpp seems to be the only player in town.

kkielhofner · on April 17, 2024

> I pointed them towards vLLM, but it sounded like they were set on ollama

I'm baffled how someone could be so set on Ollama. Being married to a tool is always weird to me and being set on the (very) wrong tool for the job even when faced with good advice is even weirder.

Maybe they'll change their mind the first time a VC, customer, or hire sees Ollama and laughs ;). Kind of kidding but not.

> I’m curious though, why do you think llama.cpp is a toy compared to vllm?

llama.cpp is downright incredible for supporting things you would never do in multi-user production environment:

- Support Nvidia GPUs going back to Maxwell(!)

- CPU (waaaay too slow)

- Split layers between GPU and CPU (still way too slow)

- Wild quantization methods

- Support all kinds of random platforms you'd never deploy to in production (Apple Silicon, etc)

- Much, much more

Whereas the emphasis for vLLM is:

- High scale serving of LLMs in production environments

llama.cpp does really well when used in Ollama type use cases - "I want to run this on my Macbook and send a request every once in a while" or "load a huge model across VRAM and RAM on my desktop". WITH the understanding that being hosted locally is more important than being at least as fast as ChatGPT (which is more-or-less considered the bare-minimum standard in the industry).

I said "at least isn't aware of vLLM" because you can take it even further than this (like Cloudflare, Amazon, Mistral, Phind, Databricks, etc) and use something like TensorRT-LLM with Triton Inference Server which kicks performance and production suitability up yet another couple of notches.

It's a right tool for the job kind of thing.

At this risk of sounding elitist I have no idea how a dozen total tokens/s on CPU (or whatever) is going to be acceptable to users.

Especially in the case of the original scenario (AI startup) - if you go into a highly competitive and crowded space with Ollama (CPU or not) you're going to get beaten up by people deploying with solutions that are so fundamentally drastically better.

All of this said I have no idea what you mean by "small-scale-but-still-production" and no idea of your users or use case(s). I suppose there's always a chance llama.cpp on CPU could be fine in some cases. I just can't possibly imagine what they would be but that could just be my own experience and bias talking.

dartos · on April 17, 2024

I’m working on an internal tool. Maybe 30-40 “customers” total. I say it’s production because it has to be reliable.

We just don’t want to rent a GPU for this little thing. It draws up reports once a day, so it’s okay if it takes a couple mins. It’s work that took a single person maybe 2 hours to do before.

I’ll need to look into triton, I haven’t heard of that yet!

If you have any resources for running models in production that you’d be willing to share, I’d appreciate them.

jerrygenser · on April 17, 2024

They ended up (attempting to) address this issue by including it on the last line of their readme as one of the "Supported backends[sic]".

https://github.com/ollama/ollama/issues/3697 https://github.com/ollama/ollama/commit/9755cf9173152047030b...

runjake · on April 17, 2024

New commit with acknowledgement in README.

https://github.com/ollama/ollama/pull/3700

xyc · on April 17, 2024

What did they do to support WizardLM 2? It seems to work with an earlier llama.cpp version. (I have an app in production that uses a llama.cpp version before WizardLM 2 release)

xyc · on April 17, 2024

Quite possible that llama.cpp already supports WizardLM 2: https://github.com/ggerganov/llama.cpp/issues/6691

viraptor · on April 17, 2024

They're not even sure if they'll keep llama.cpp as a dep https://github.com/ollama/ollama/issues/2534#issuecomment-19...

Currently the way it's vendored is a bit dodgy already.

ekianjo · on April 17, 2024

Do they even contribute back to llama.cpp in any meaningful way?

bertman · on April 17, 2024

They don't.

I just checked: there's exactly one user that has contributed (only typo fixes) to both ollama and llama.cpp according to github's contributors graphs.

JimDabell · on April 17, 2024

This doesn’t seem correct to me. I saw in an Ollama issue mentioned in another comment, an Ollama contributor said:

> As you pointed out, we carry patches, although in general we try to upstream those.

— https://github.com/ollama/ollama/issues/2534#issuecomment-19...

So I followed the link to his profile and saw that he has opened some non-documentation pull requests for llama.cpp:

https://github.com/ggerganov/llama.cpp/pull/5244

https://github.com/ggerganov/llama.cpp/pull/5576

I didn’t dig any deeper, but it took me less than thirty seconds to find those so I expect there are more.

bertman · on April 17, 2024

Ah, thanks for this! I can't edit my parent comment that you replied to any longer unfortunately.

As I said, I only compared the contributors graphs [0] and checked for overlaps. But those apparently only go back about year and only list at most 100 contributors ranked by number of commits.

[0]: https://github.com/ollama/ollama/graphs/contributors and https://github.com/ggerganov/llama.cpp/graphs/contributors

anon373839 · on April 17, 2024

Isn’t llama.cpp low level and highly optimized? There may not be that much overlap in the required skill sets.

throwaway5959 · on April 17, 2024

Are they under any obligation to?

wokwokwok · on April 17, 2024

There’s no legal obligation.

…but I think it’s fair to say there’s a social obligation tip your hat to the shoulders you stand on.

This has come up before, and they still do exactly the same thing, and there absolutely zero chance they haven’t heard the critique about it.

So you must presume it’s a deliberate choice, rather than “oops we didn’t think of that”… /shrug

If you don’t want to be called out for it, don’t do it. I’m not particularly sympathetic to them in this case.

Zambyte · on April 17, 2024

There is no license that obligates contributing upstream.

paulmd · on April 17, 2024

This is just like tannenbaum getting mad nobody credited him for the intel management engine (which he feels makes him the posthumous victor in the Linux/minix debates).

Bro, you shouldn’t have chosen a non-attribution license, if you wanted to be attributed.

Just like Tannenbaum, if you wanted your ego stroked, that’s attribution in this context.

yjftsjthsd-h · on April 17, 2024

Er. Minix is under a BSD license, which does require attribution. Also my distant memory is that Tannenbaum wasn't even mad about Intel not actually fulfilling the terms of the license, but I may be misremembering.

paulmd · on April 19, 2024

Can you demonstrate that there is not an attribution in the Intel Management Engine documentation? ;)

Sneaky or not, that's the license Tanenbaum chose, and he has to live with it. Same deal here.

Anyway no, tanenbaum isn't mad, per-se, or at least not at intel. He's sniping back at Linus Kernel (remember the Torvalds-Tanenbaum debates? it was a thing) about how he was right after all about minix being the most-widely used OS in the world. It's not anger, it's gloating - it's not even really a letter that's meant for Intel at all.

https://www.cs.vu.nl/~ast/intel/

And again, that is the point of the entire BSD/MIT vs GPL debate - which went completely over Tanenbaum's head. BSD/MIT provides maximum freedom to the developer... sometimes including the freedom to deny freedoms to the user. He is critiquing Intel (obliquely) for doing the specific thing that makes this license desirable to these customers, and the specific thing Torvalds argued against.

Like it's a gloat, but about how his OS is more popular, but it also backhandedly shows why Linus Kernel was right. And the same is true here. Want attribution? Choose a license that requires it.

yjftsjthsd-h · on April 25, 2024

Yeah, I know they have attribution now, but I was fairly confident that they added that because someone called them out on shipping it without attribution before.

visarga · on April 17, 2024

Ollama is seriously cool, it handles models like docker images, you pull and run. Especially when combined with a frontend app like Open WebUI, where you can set up your chatGPT key and other providers, to have all open + closed source models in one place.

tosh · on April 17, 2024

> When running larger models that don't fit into VRAM on macOS, Ollama will now split the model between GPU and CPU to maximize performance.

Can anyone explain what this means? I thought Apple Silicon Macs have unified memory. Or is this only related to macOS running on Intel?

brandall10 · on April 17, 2024

This might be referencing older Macs, though it also could be referencing the default allocation ceiling of ~2/3 unified for GPU usage on Apple silicon. That said, the latter issue can be expanded to a higher hard limit so it's kinda moot.

whartung · on April 17, 2024

Does this version use the GPU on recent Intel Macs?

I know folks have been looking into it, but don’t know if it works yet.

As is, it works but the performance doesn’t make in particularly usable on the last model of the Intel iMac.

planetafro · on April 17, 2024

...uh, everyone doesn't run a Mac? I run adn Nvidia GPU in Win11/WSL, so this should prove helpful in a running models that push my VRAM limits.

kgwgk · on April 17, 2024

Most people who doesn’t run a Mac don’t care much about “models that don't fit into VRAM on macOS.”

tosh · on April 17, 2024

this part of the announcement specifically mentions macOS

bravura · on April 17, 2024

For someone who hasn't been paying attention for the last three months, what are the specific differentiators and strengths/weaknesses of Ollama?

How does it compare to llama.cpp or llama-cpp-python, particularly with Metal support?

fermuch · on April 17, 2024

Ollama uses llama.cpp in the background. It's more of a tool to use models: you download models as if they were docker images, you have a centralized way to organize your models, you can create new models with `Modelfile`... It's the docker of AI.

cwillu · on April 17, 2024

ollama is a wrapper around llama.cpp

_ink_ · on April 17, 2024

> When running larger models that don't fit into VRAM on macOS, Ollama will now split the model between GPU and CPU to maximize performance.

Is there a way to achieve this on a PC (preferably Windows)?

fermuch · on April 17, 2024

On linux at least it already works like that

planetafro · on April 17, 2024

I use WSL2 on Win11 and it uses my GPU, no issues. I'm not a power user so YMMV.

gpm · on April 17, 2024

Has anyone tried/does it make any sense to use a small model to get a first opinion, and then if it is unsure what token to emit next run a bigger model? It seems like a lot of tokens ought to be very obvious and not require as much computation. E.g. most words are split into multiple tokens, but most of the time in context after the first token for a word only one follow-on token makes sense.

aubanel · on April 17, 2024

On WizardLM2: the weights have been pulled back by Microsoft from HF Hub, as there seems to be internal debate on whether to really publish it. There's a risk that it does not ever get published again. So keep your local copies if you have any!

stainablesteel · on April 17, 2024

wizardcoder didn't seem that hot to me

and i expected better from microsoft than piggy backing off of llama

they can keep it

sir-alien · on April 17, 2024

The Wizard 8x22B is definitely for the high end, even the 2bit version. I attempted to run it on a workstation with RTX3090 and the performance was as bad as 1 word per 2 seconds. Probably a good candidate for a Groq accelerator.

dsrtslnd23 · on April 17, 2024

you mean a few hundred Groq accelerators ;-) (they have 230MB SRAM per accelerator)

pama · on April 17, 2024

The H100 has 50MB SRAM (L2 cache) and does just fine.

https://docs.nvidia.com/launchpad/ai/h100-mig/latest/h100-mi...

kkielhofner · on April 17, 2024

...and 80GB of very high speed VRAM.

pama · on April 17, 2024

Sure but the point of the comment was SRAM. There is some confusion in a subset of the ML people about hardware memories, their latencies, and bandwidths. We don’t all need to write kernels like Tri Dao to make transformers efficient on GPUs, but it would be great if more people were aware of the theoretical compute constraints of each type of model on a given hardware and then a subset of them worked towards building better pipelines.

kkielhofner · on April 17, 2024

Your parent comment (by my reading) implied the H100 "does just fine" when it has 50MB SRAM.

The reason Grok needs multiple racks of chips to serve up models that fit in a single H100 is because Grok chips are SRAM only while the H100 has 80GB of HBM VRAM bolted onto it in addition to SRAM.

pama · on April 18, 2024

I see. You are right. I also don’t think grok would be friendly to the home user.