Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> we use Sentence Transformers (all-MiniLM-L6-v2) as our default (solid all-around performer for speed and retrieval, English-only).

Huh, interesting. I might be building a German-language RAG at some point in my future and I never even considered that some models might not support German at all. Does anyone have any experience here? Do many models underperform or not support non-English languages?



You can refer to https://huggingface.co/spaces/mteb/leaderboard and use that to guide your selection.

Check under the "Retrieval" section, either RTEB Multilingual or RTEB German (under language specific).

You may also want to filter for model sizes (under "Advanced Model Filters"). For instance if you are self-hosting and running on a CPU it may make sense to limit to something like <=100M parameters models.


Thanks, that's really useful, I had no idea this table existed.


> Do many models underperform or not support non-English languages?

Yes they do. However:

1. German is one of the more common languages to train on so more models will support it than say, Bahasa

2. There should still be a reasonable amount of multi-lingual models available. Particularly if you're OK with using proprietary models via API. AFAIK all the frontier embedding and reranking models (non open-source) are multi-lingual


Yes I can confirm that,we had resorted to a multilingual embedding model back in the day. https://link.springer.com/chapter/10.1007/978-3-031-77918-3_...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: