> we use Sentence Transformers (all-MiniLM-L6-v2) as our default (solid all-around performer for speed and retrieval, English-only).
Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?
Check under the "Retrieval" section, either RTEB Multilingual or RTEB German (under language specific).
You may also want to filter for model sizes (under "Advanced Model Filters"). For instance if you are self-hosting and running on a CPU it may make sense to limit to something like <=100M parameters models.
> Do many models underperform or not support non-English languages?
Yes they do. However:
1. German is one of the more common languages to train on so more models will support it than say, Bahasa
2. There should still be a reasonable amount of multi-lingual models available. Particularly if you're OK with using proprietary models via API. AFAIK all the frontier embedding and reranking models (non open-source) are multi-lingual
Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?