I kind of agree in principle but there are a multitude of clever benchmarks that...

I kind of agree in principle but there are a multitude of clever benchmarks that try to measure lots of different aspects like robustness, knowledge, understanding, hallucinations, tool use effectiveness, coding performance, multimodal reasoning and generation, etc etc etc. all of these have lots of limitations but they all paint a pretty compelling picture that compliments the “vibes” which are also important.