Would there be any way to distribute RAG across multiple smaller models? Rather than one giant model handling your entire document base, have it be more of a tree where the top level classifies the docs into top-level categories and sends it to submodels to subclassify, etc? (Doesn't have to be 1:1 classification). And same for q/a search?
These could all presumably be the same physical instance, just each query would use a different system prompt and perhaps different embeddings. (I'm guessing; I don't actually know how RAG works). So, a little slower and clunkier, but presumably way more efficient. And match could be anywhere between horrible to better-than-one-large-model. This would be more like how businesses organize docs.
Or maybe there's no real benefit to this, and each subclassifier would require just as big of a model as if you were to throw all docs into a single model anyway. I assume it's probably been tried before.
There's just been a twitter post by Omar Khattab (@lateinteraction) on encoding documents into a scoring function instead of a simple vector for the work on ColBERT - and maybe at some point using a DNN as scoring function.
So, yes, maybe there's a way to "distribute" RAG. (I still wonder if that isn't just MoE taken to its logical conclusion)
So, dig for ColBERT papers, might be helpful. (I wish I had the time to do that)
Short answer: Yes, there are ways it can be done. Multiple. Needs to be custom built though, given no one has explored it deeply yet.
One simple way is what Omar Khattab (ColBert) mentioned about scoring function instead of a simple vector.
Another is to use a classifier at the start directing queries to the right model. You will have to train the classifier though. (I mean a language model kind of does this implicitly, you are just taking more control by making it explicit.)
Another is how you index your docs. Today, most RAG approaches do not encode enough information. If you have defined domains/models already, you can encode the same in metadata for your docs at the time of indexing, and you pick the model based on the metadata.
These approaches would work pretty well, given a model as small as 100M size can regurgitate what is in your docs. And is faster compared to your larger models.
Benefit wise, I don't see a lot of benefit except preserving privacy and gaining more control.
I was originally thinking about it as like a bazel plugin for large codebases. Each module would have its own LLM context, and it might make it easier to put whole modules into the context, plus summaries of the dependencies. That could work better than a single huge context attempting to summarize the whole monorepo.
The general idea is probably be better for the code use case too, since having the module's whole codebase in context likely allows for more precise edits. Whereas RAG is just search, not edit.
That said, probably code assistants must somewhat do this already, though it must be more ad-hoc. Obviously they wouldn't be able to do any completions if they don't have detailed context of the adjacent code.
What I meant was that at the time of indexing, you can add more information to any chunk. This[1] is a simple example by Anthropic where they add more relevant context. In our case, say you have two models, D1 and D2. At the time of creating a vector store, you can add which model is more suitable to a chunk, so that when you retrieve it, you use the same model for inference. This is custom built, very dependent on datasets, but would get you to the functionality described. I suggest this approach when there are linkages between various docs (eg: financial statements/earning calls etc.).
Thanks... I also have another lingering doubt about the ability of RAG to make sense of "history", i.e. how to make sure that a more recent document on a given topic has more "weight" than older documents on the same issue.
This is done at a reranking step. It's again custom. You have two variables - 1/ relevance (which most algos focus on) 2/ Date. Create a new score based on some combination of weights for relevance and date. Eg; Could be 50% of date. If the document has 70% relevance, but was published yesterday, it's overall score would be 85%. (A conceptual idea). This is similar to how you do weighted sorting anywhere.
I think he might be saying, have metadata in your vector retrieval that describe the domain of the retrieved chunk and use that as a decision on which model to use downstream. Sounds like very interesting improvement of RAG
TL;DR: It's a very interesting line of thought that as late as Q2 2024, there were a couple thought leaders who pushed the idea we'd have, like 16 specialized local models.
I could see that in the very long term, but as it stands, it works the way you intuited: 2 turkeys don't make an eagle, i.e. there's some critical size where its speaking coherently, and its at least an OOM bigger than it needs to be in order to be interesting for products
fwiw RAG for me in this case is:
- user asks q.
- llm generates search queries.
- search api returns urls.
- web view downloads urls.
- app turns html to text.
- local embedding model turns text into chunks.
- app decides, based on "character" limit configured by user, how many chunks to send.
- LLM gets all the chunks, instructions + original question, and answers.
It's incredibly interesting how many models fail this simple test, there's been multiple Google releases in the last year that just couldn't handle it.
- Some of it is basic too small to be coherent, bigcos don't make that mistake though.
- There's another critical threshold where the model doesn't wander off doing the traditional LLM task of completing rather than answering. What I mean is, throwing in 6 pages worth of retrieved webpages will cause some models to just start rambling like its writing more web pages, i.e. they're not able to "identify the context" of the web page snippets, and they ignore the instructions.
These could all presumably be the same physical instance, just each query would use a different system prompt and perhaps different embeddings. (I'm guessing; I don't actually know how RAG works). So, a little slower and clunkier, but presumably way more efficient. And match could be anywhere between horrible to better-than-one-large-model. This would be more like how businesses organize docs.
Or maybe there's no real benefit to this, and each subclassifier would require just as big of a model as if you were to throw all docs into a single model anyway. I assume it's probably been tried before.