Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that Paella is a bit old in image model terms (Nov 2022) and modern stable diffusion tools have access to optimized workflows.

My 3060 can generate a 256x256 8 step image in 0.5 seconds, no A100 needed. A 3090 is double the performance of a 3060 at 512x512, and an A100 is 50% faster than a 3090...

If you have access to an high end consumer GPU (4090) you can generate 512x512 images in less than a second, it's reached the point that you can increase the batch size and have it show 2-4 images per prompt without adversely affecting your workflow.

Too bad SD1.5 is too small* and we'll require models with more parameters if we want a true general purpose image model. If SD1.5 was the end-game, we'd have truly instant high res image generation in just a couple more generations of GPUs, think generating images in real time as you type the prompt, or have sliders that affect the strength of certain tokens and see the effects in real time, etc. Tho I heard that SDXL is actually faster for higher resolutions (>1024x1024) due to removing attention on the first layer, making it scale better with resolution even tho SDXL has 4x the parameter size.

* Current SD1.5 models that can generate consistent high quality images have been fine-tuned and merged so many times that a lot of general knowledge has been lost, e.g. they can be great at generating landscapes, but lacking in generating humans, or they can be very good at a certain style like comics but can do comic style only and lose the ability to generate more dynamic face variations, etc.



Personally I think the SD1.5 trade off, loss of general knowledge for surprisingly high quality images on consumer hardware, is worth it.

It's fairly impressive to me what the community has made possible with SD1.5. Sure on a vanilla task something like Dall-E 2 generally performs better, but with some tweaking you can easily beat out Dall-E on a home gaming PC.

The fact that you can fine-tune SD1.5 on a 4090 is incredible to me.

Given how much powerful AI is locked behind fees and private APIs it's refreshing to see so much cool stuff coming out of the OSS world again. Best of all is it's not being driven exclusively by people with an ML background, but moreso curious amateurs. It really brings me back to a time when playing around with software/the web felt exciting.


I am really pleasantly surprised the field is kind of open and not drowning in software patents.


Fully correct, also the v2 of the paper introduced a model that is bigger and slower, however generates better images. So the 500ms was only for the first model we introduced in v1. I also want to mention our new work as it is very much related to this whole topic of "speeding up models" -> either training or sampling: Würstchen: https://github.com/dome272/wuerstchen/ With a current version we are training at the moment we can sample (using torch.compile) 4 1024x1024 images in 4 seconds. Also the training of this kind of model is very fast due to spatially encoding images much much more -> 3x512x512 images -> 16x12x12 latents => 42x spatial compression, whereas StableDiffusion has an 8x compression (3x512x512 -> 4x64x64).


All diffusion models are quite inefficent due to running in PyTorch eager mode (with torch.compile being kinda janky in practice on 2.0/2.1).

I would be more interested to see Paella vs SD running on a ML compiler framework, like TVM or AITemplate. Maybe one or the other is more amenable to optimization.


A framework could make the programmer think they are in eager mode, while actually making graphs behind the scenes.

The trick is to simply not do any calculations till the last possible moment - ie. the time the program tries to convert the finished image to a jpeg. Only at that point do you compile the graph and run the actual computation on the GPU.

You then also cache the graph, so that the compilation step can be avoided if the program tries to do the same computation again with different data.


... Kinda like torch.compile?

That approach is limited though. AITemplate and TVM take a looong time to compile and produce standalone executable files, hence the gains are much larger than torch triton.


I find you post intriguing. What would you say are the major janks with torch.compile, and what issues are addressed by TVM/AITemplate but not by torch.compile?

EDIT: If I understand correctly these libraries target deployment performance, while torch.compile is also/mostly for training performance?


- The gain in stable diffusion is modest (15%-25% last I checked?)

- Torch 2.0 only supports static inputs. In actual usage scenarios, this means frequent lengthy recompiles.

- Eventually, these recompiles will overload the compilation cache and torch.compile will stop functioning.

- Some common augmentations (like TomeSD) break compilation, force recompiles, make compilation take forever, or kill the performance gains.

- There are othdr miscellaneous bugs, like compilation freezing the Python thread and causing networking timeouts in web UIs, or errors with embeddings.

- Dynamic input in Torch 2.1 nightly fixes many of these issues, but was only maybe working a week ago? See https://github.com/pytorch/pytorch/issues/101228#issuecommen...

- TVM and AITemplate have massive performance gains. ~2x or more for AIT, not sure about an exact number for TVM.

- AIT supported dynamic input before torch.compile did, and requires no recompilation after the initial compile. Also, weights (models and LORAs) can be swapped out without a recompile.

- TVM supports very performant Vulkan inference, which would massively expand hardware compatibility.

Note that the popular SD Web UIs don't support any of this, with two exceptions I know of: VoltaML (with WIP AIT support) and the Windows DirectML fork of A1111 (which uses optimized ONNX models, I think). There is about 0% chance of ML compilation support in A1111, and the HF diffusers UIs are less bleeding edge and performance/compatibility focused.

And yes, triton torch.compile is aimed at training. There is an alternative backend (Hidet) that explicitly targets inference, but it does not work with Stable Diffusion yet.


Thanks for the info. I didn't know about the TomeSD stuff, really interesting. Why do you think that AITemplate is so much faster?


(And for reference, AITemplate roughly doubles SD 1.5's speed. Not sure if thats good or if Paella would have even more room for auto optimization).


> Current SD1.5 models that can generate consistent high quality images have been fine-tuned and merged so many times that a lot of general knowledge has been lost,

You say that as if it was a bad thing, but it's actually good: GPU memory being a limiting factor means that general knowledge is mostly overhead, it is much better to have 50 specialized model (that you can all store on disk for cheap) that each takes 5 time less GPU memory than a big general model that you'll constantly under-use but still have to load entierly in the GPU memory. And it's even more true for LLMs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: