Been working on an autoencoder that converts the hidden states of transformer models into a spatial representation that can be visualized. Started more on the toy scale but now I'm trying to scale it beyond my humble 3060. Using LLMs to help with torch and such but they are limited in the details of tensor twiddling.
Gemini is probably using ring attention. But scaling to that size requires more engineering effort in terms of interlink that goes beyond the purpose of this release from Mistral.
According to the paper, the dataset goes up to 1978 because that's when copyright law was updated to automatically apply to newswires. It's unfortunate that we got into the situation where academia has to play by the rules wrt copyright while big private labs flaunt it.
This is, unfortunately, how our (US) justice system works.
The entire concept of an "NDA" has been bastardized in this manner, if you think about it. Conceptually you may think of an NDA as protecting sensitive data from disclosure, a sort of intellectual property right. However, it's been co-opted by folks to do nothing more than cover up inconvenient truths because they realize most people cannot afford to either (a) give up money they are promised in the future or (b) bankrupt themselves in their own defense.
So it's basically a game where whomever has the most money can ensure their narrative wins out in the end, because competing narratives can simply be "bought out".
In that case there are two attractors - one towards the Golden Gate Bridge and one towards the harmless, helpful, honest assistant persona. Techniques as such probably get weirder results with model scale but no reason to think they get wiped out.
Unary operators are still operators. The integer parsing rules are probably different. In Lisp, -x would be a symbol, and the proper analog to -123 would be (- x) eg.
The bigger size is probably from the bigger vocabulary in the tokenizer. But most people are running this model quantized at least to 8 bits, and still reasonably down to 3-4 bpw.
If you're in the Silicon Valley and spend most of your day on the Internet, $10 for superb search is a no-brainer.
If you're a student / postgrad somewhere in the US, it still likely feels worth the price, even though $10 is already not below the threshold of observability.
If you are, say, somewhere in Thailand, or Uzbekistan, or Botswana, the $10/mo become unironically noticeable: not prohibitive, but you really want to get a lot of value in exchange.
And basically anywhere in the world, if you're a kid, and mom and dad would not buy Kagi for you, you're out of luck.
https://github.com/ristew/weightscan