It's certainly not my intent to undermine the efforts of Robin Rombach, Andreas ...

It's certainly not my intent to undermine the efforts of Robin Rombach, Andreas Blattman, Katherine Crowson (and many others).

Katherine's work on clip-guided-diffusion over the `guided-diffusion` ImageNet checkpoints was effectively the first time the public got to see what text-to-image via diffusion instead of purely transformer-based solutions (like in DALLE1/dalle-mini) would look like. And it happened well before GLIDE was published (and gets a mention/citation).

The CompVis team (Blattman, Rombach, etc.) has been able to not just compete, but surpass (in some ways - it's nuanced) the work of the big American research labs (OpenAI in particular) with solid novel research. Their research on `VQGAN` outperformed the Autoencoder from the DALLE-1 paper, and they've been competing directly in the vision space ever since.

Incredibly talented people.