Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

DeepSeek V3/R1 are MoE with 256 experts per layer, actively using 1 shared expert and 8 routed experts per layer https://arxiv.org/html/2412.19437v1#S2:~:text=with%20MoE%20l... so you can't just take the active parameters and assume that's close to the size of a single expert (ignoring experts are per layer anyways and that there are still dense parameters to count).

Despite connotations of specialized intelligences the term "expert" provokes it's really mostly about scalability/efficiency of running large models. By splitting up sections of the layers and not activating all of them for each pass a single query takes less bandwidth, can be distributed across compute, and can be parallelized with other queries on the same nodes.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: