Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. Second, the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. In future work we will explore a framework for responsible externalization that balances the value of external auditing with the risks of unrestricted open-access. At this time we have decided not to release code or a public demo. The potential risks of misuse raise concerns regarding responsible open-sourcing of code and demos. First, downstream applications of text-to-image models are varied and may impact society in complex ways. We offer a more detailed exploration of these challenges in our paper and offer a summarized version here. There are several ethical challenges facing text-to-image research broadly. Finally, Imagen is part of a series of text-to-image work at Google Research, including its sibling model Parti. The use of cascaded diffusion models is also popular throughout the literature, and has been used with success in diffusion models to generate high resolution images. XMC-GAN also uses BERT as a text encoder, but we scale to much larger text encoders and demonstrate the effectiveness thereof. GLIDE also uses cascaded diffusions models for text-to-image, but Imagen uses larger pretrained frozen language models, which we found to be instrumental to both image fidelity and image-text alignment. We believe Imagen is much simpler, as Imagen does not need to learn a latent prior, yet achieves better results in both MS-COCO FID and side-by-side human evaluation on DrawBench. DALL-E 2 uses a diffusion prior on CLIP latents, and cascaded diffusion models to generate high resolution 1024×1024 images. More recently, Diffusion models have been explored for text-to-image generation, including the concurrent work of DALL-E 2. Autoregressive models, GANs VQ-VAE Transformer based methods have all made remarkable progress in text-to-image research. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.ĭiffusion models have seen wide success in image generation. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. T5), pretrained on text-only corpora, are surprisinglyĮffective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-textĪlignment much more than increasing the size of the image diffusion model. Our key discovery is that generic large language models (e.g. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |