Presumably a transformer model or similar that uses positional encodings for the...

GaggiX · on June 27, 2023

This is not how it works, there is no sliding window, the model has no restrictions on the W and H dimensions, only on the C dim, so the UNet can actually be used directly on images with different aspect ratio and resolution; the attention layers pay attention to the entire image.

dplavery92 · on June 27, 2023

From Sections 3 and 4 of the VQGAN paper[1] upon this work is built: "To generate images in the megapixel regime, we ... have to work patch-wise and crop images to restrict the length of [the quantized encoding vector] s to a maximally feasible size during training. To sample images, we then use the transformer in a sliding-window manner as illustrated in Fig.3." ... "The sliding window approach introduced in Sec.3.2 enables image synthesis beyond a resolution of 256×256pixels."

From the Paella paper[2]: "Our proposal builds on the two-stage paradigm introduced by Esser et al. and consists of a Vector-quantized Generative Adversarial Network (VQGAN) for projecting the high dimensional images into a lower-dimensional latent space... [w]e use a pretrained VQGAN with an f=4 compression and a base resolution of 256×256×3, mapping the image to a latent resolution of 64×64indices." After training, in describing their token predictor architecture: "Our architecture consists of a U-Net-style encoder-decoder structure based on residual blocks,employing convolutional[sic] and attention in both, the encoder and decoder pathways."

U-Net, of course, is a convolutional neural network architecture. [3]. The "down" and "up" encoder/decoder blocks in the Paella code are batch-normed CNN layers. [4]

[1] https://arxiv.org/pdf/2012.09841.pdf [2] https://arxiv.org/pdf/2211.07292.pdf [3] https://arxiv.org/abs/1505.04597 [4] https://github.com/dome272/Paella/blob/main/src/modules.py#L...

GaggiX · on June 28, 2023

The transformer model used in the VQGAN paper has nothing to do with the autoencoder, it has been used to predict the quantized tokens. So no, you don't need to slide a window with the unet, you can directly predict images with different aspect ratios and resolutions like with SD.