It's certainly not my intent to undermine the efforts of Robin Rombach, Andreas Blattman, Katherine Crowson (and many others).
Katherine's work on clip-guided-diffusion over the `guided-diffusion` ImageNet checkpoints was effectively the first time the public got to see what text-to-image via diffusion instead of purely transformer-based solutions (like in DALLE1/dalle-mini) would look like. And it happened well before GLIDE was published (and gets a mention/citation).
The CompVis team (Blattman, Rombach, etc.) has been able to not just compete, but surpass (in some ways - it's nuanced) the work of the big American research labs (OpenAI in particular) with solid novel research. Their research on `VQGAN` outperformed the Autoencoder from the DALLE-1 paper, and they've been competing directly in the vision space ever since.
Katherine's work on clip-guided-diffusion over the `guided-diffusion` ImageNet checkpoints was effectively the first time the public got to see what text-to-image via diffusion instead of purely transformer-based solutions (like in DALLE1/dalle-mini) would look like. And it happened well before GLIDE was published (and gets a mention/citation).
The CompVis team (Blattman, Rombach, etc.) has been able to not just compete, but surpass (in some ways - it's nuanced) the work of the big American research labs (OpenAI in particular) with solid novel research. Their research on `VQGAN` outperformed the Autoencoder from the DALLE-1 paper, and they've been competing directly in the vision space ever since.
Incredibly talented people.