yes i understand all that. I was saying the claim is incorrect. My understanding of deepseek is mechanically correct but apparently they use 3B models as experts, per your sibling comment. I don't buy it, regardless of what they put in the paper - 3B models are pretty dumb, and R1 isn't dumb. No amount of shuffling between "dumb" experts will make the output not dumb. it's more likely 32x32B experts, based on the quant sizes i've seen.
I'm not a DeepSeek employee but I think there is more clarification needed on what an "expert" is before the conversation can make any sense. Much like physics, one needs at least take a glance at how the math is going to be used to be able to sanity check a claim.
A model includes many components. There will be bits that encode/decode tokens as vectors, transformer blocks which do the actual processing on the data, some post-transformer block filtering to normalize that output, and maybe some other stuff depending on the model architecture. The part we're interested in involves parts of the transformer blocks, which handle using encoded relational information about the vectors involved (part 1) to transform the input vector using a feed forward network of weights (part 2) and then all that gets post processed in various ways (part 3). A model will chain these transformers together and each part of the chain is called a layer. A vector will run from the first layer through to the last, being modified by each transformer along the way.
In an MoE model the main part about the transformer block changed is part 2, which goes from "using a feed forward network of weights" to "using a subset of feed forward network weights chosen by a router for the given token and then recombined". In the MoE case each subset of weights per feed forward layer is what is called the "expert". Importantly, the expert is not a whole model - it's just a group of the weights available in a given layer. Each layer's router choses which group(s) of weights to use independently. As an example, if a 10 layer model had a total of 10 billion 8 bit parameters in the feed forward layers (so a >10 billion parameter model in overall parameters) and 10 experts that means each expert is ~100 MB (10 billion bytes / 10 layers / 10 experts per layer). These 10 billion parameters would be referred to as the sparse parameters (not always used each token) while the rest of the model would be referred to as the dense parameters (always used each token). Note: folks on the internet have a strong tendency to label this incorrectly as "10x1B" or "10x{ActiveParameters}" instead.
The "Mixture" part of MoE extends a bit further than "the parameter groups sit next to each other in the transformer block" though. Similar in concept to how transformer attention is combined in part 1, more than 1 expert can be activated and the outputs combined. At minimum there is usually 1 expert which is always used in a layer (the "shared expert") and at least 1 expert which is selected by the router (the "routed experts"). The shared expert exists to make the utilization of the routed experts more even by ensuring base information which needs to be used all the time is dedicated to it which increases training performance since the other experts can be more evenly selected by the router as a result.
With that understanding, the important takeaways are:
- Experts are parts of individual parameter networks in each layer, not a sub-model carved out.
- More than 1 expert can be used, and the way the data is combined is not like feeding the output of a low parameter LLM to another, it's more like how the first phase of the transformer has multiple attention heads which are combined to give the full attention information.
- There are a lot of other weights beyond the sparse weights used by experts in a model. The same is true for the equivalent portions of a dense model as well though. This ultimately makes comparing active parameters of a sparse model to total parameters of a dense model a valid comparison.
As to the original conversation: DeepSeek v3/R1 has 37 Billion active parameters so that should set the floor for a comparative dense model, not whatever the size of an individual expert in part of a single transformer layer is (which acts as a bit of a red herring in these kinds of conversations, doubly so since more than 1 experts worth of weights are used anyways). In reality a bit more than that, though definitely less than if all 671 Billion parameters were dense. While we don't have much concrete public information about modern version of ChatGPT, one thing we're relatively certain of is "ChatGPT 4.5 has a TON more parameters and has basically nothing to show for it". Meanwhile Gemma 3, a 27 B locally runnable model from Google, is a hair behind DeepSeek v3 in many popular benchmarks. Truth is, there are just a lot of other things beyond parameter count that go into making a well performing model and DeepSeek v3/r1 hit (or invented) a lot of them. If I had to place a bet on this whole comparison, I'd say OpenAI is very likely also using sparse/MoE-style architectures in their current products anyways.
As a final note, don't just dismiss the architecture of DeepSeek with "regardless of what they put in the paper"! The model files are available and they include this sort of layer/parameter information so your device is able to run them (well, "run" requires a bit of RAM in this case... but you can still read the model file on disk and see it's laid out as the paper claims regardless).
A deepseek employee is welcome to correct me.