Note that Paella is a bit old in image model terms (Nov 2022) and modern stable ...

PheonixPharts · on June 27, 2023

Personally I think the SD1.5 trade off, loss of general knowledge for surprisingly high quality images on consumer hardware, is worth it.

It's fairly impressive to me what the community has made possible with SD1.5. Sure on a vanilla task something like Dall-E 2 generally performs better, but with some tweaking you can easily beat out Dall-E on a home gaming PC.

The fact that you can fine-tune SD1.5 on a 4090 is incredible to me.

Given how much powerful AI is locked behind fees and private APIs it's refreshing to see so much cool stuff coming out of the OSS world again. Best of all is it's not being driven exclusively by people with an ML background, but moreso curious amateurs. It really brings me back to a time when playing around with software/the web felt exciting.

tinus_hn · on June 27, 2023

I am really pleasantly surprised the field is kind of open and not drowning in software patents.

dome271 · on June 26, 2023

Fully correct, also the v2 of the paper introduced a model that is bigger and slower, however generates better images. So the 500ms was only for the first model we introduced in v1. I also want to mention our new work as it is very much related to this whole topic of "speeding up models" -> either training or sampling: Würstchen: https://github.com/dome272/wuerstchen/ With a current version we are training at the moment we can sample (using torch.compile) 4 1024x1024 images in 4 seconds. Also the training of this kind of model is very fast due to spatially encoding images much much more -> 3x512x512 images -> 16x12x12 latents => 42x spatial compression, whereas StableDiffusion has an 8x compression (3x512x512 -> 4x64x64).

brucethemoose2 · on June 27, 2023

All diffusion models are quite inefficent due to running in PyTorch eager mode (with torch.compile being kinda janky in practice on 2.0/2.1).

I would be more interested to see Paella vs SD running on a ML compiler framework, like TVM or AITemplate. Maybe one or the other is more amenable to optimization.

londons_explore · on June 27, 2023

A framework could make the programmer think they are in eager mode, while actually making graphs behind the scenes.

The trick is to simply not do any calculations till the last possible moment - ie. the time the program tries to convert the finished image to a jpeg. Only at that point do you compile the graph and run the actual computation on the GPU.

You then also cache the graph, so that the compilation step can be avoided if the program tries to do the same computation again with different data.

brucethemoose2 · on June 27, 2023

... Kinda like torch.compile?

That approach is limited though. AITemplate and TVM take a looong time to compile and produce standalone executable files, hence the gains are much larger than torch triton.

saiojd · on June 27, 2023

I find you post intriguing. What would you say are the major janks with torch.compile, and what issues are addressed by TVM/AITemplate but not by torch.compile?

EDIT: If I understand correctly these libraries target deployment performance, while torch.compile is also/mostly for training performance?

brucethemoose2 · on June 27, 2023

- The gain in stable diffusion is modest (15%-25% last I checked?)

- Torch 2.0 only supports static inputs. In actual usage scenarios, this means frequent lengthy recompiles.

- Eventually, these recompiles will overload the compilation cache and torch.compile will stop functioning.

- Some common augmentations (like TomeSD) break compilation, force recompiles, make compilation take forever, or kill the performance gains.

- There are othdr miscellaneous bugs, like compilation freezing the Python thread and causing networking timeouts in web UIs, or errors with embeddings.

- Dynamic input in Torch 2.1 nightly fixes many of these issues, but was only maybe working a week ago? See https://github.com/pytorch/pytorch/issues/101228#issuecommen...

- TVM and AITemplate have massive performance gains. ~2x or more for AIT, not sure about an exact number for TVM.

- AIT supported dynamic input before torch.compile did, and requires no recompilation after the initial compile. Also, weights (models and LORAs) can be swapped out without a recompile.

- TVM supports very performant Vulkan inference, which would massively expand hardware compatibility.

Note that the popular SD Web UIs don't support any of this, with two exceptions I know of: VoltaML (with WIP AIT support) and the Windows DirectML fork of A1111 (which uses optimized ONNX models, I think). There is about 0% chance of ML compilation support in A1111, and the HF diffusers UIs are less bleeding edge and performance/compatibility focused.

And yes, triton torch.compile is aimed at training. There is an alternative backend (Hidet) that explicitly targets inference, but it does not work with Stable Diffusion yet.

saiojd · on July 1, 2023

Thanks for the info. I didn't know about the TomeSD stuff, really interesting. Why do you think that AITemplate is so much faster?

brucethemoose2 · on June 27, 2023

(And for reference, AITemplate roughly doubles SD 1.5's speed. Not sure if thats good or if Paella would have even more room for auto optimization).

littlestymaar · on June 27, 2023

> Current SD1.5 models that can generate consistent high quality images have been fine-tuned and merged so many times that a lot of general knowledge has been lost,

You say that as if it was a bad thing, but it's actually good: GPU memory being a limiting factor means that general knowledge is mostly overhead, it is much better to have 50 specialized model (that you can all store on disk for cheap) that each takes 5 time less GPU memory than a big general model that you'll constantly under-use but still have to load entierly in the GPU memory. And it's even more true for LLMs.