Since the beginning of 2021, the field of AI has launched a large number of text-to-image-based models (such as DALL-E-2, Stable Diffusion, and Midjourney, etc.). Recently, Google also released a text-based image generation model called “Muse”, claiming to achieve the most advanced image generation performance.
The images below are all text-based images generated by Muse
- A school of fish spells the word “MUSE” in the sea
- Welsh Corgi with the brand “MUSE” in its mouth
- Latte with “Muse”
- The flame in the fireplace presents the word “MUSE”
Muse is trained on the task of masking modeling in a discrete label space: given text embeddings extracted from a pretrained large language model (LLM), train Muse to predict randomly masked image labels. Using pretrained LLMs enables fine-grained language understanding, which translates into high-fidelity image generation and understanding of visual concepts (e.g., objects), such as spatial relationships, poses, cardinality, etc.
In general, the advantage of MUSE is that its FID and CLIP scores are higher, its generation efficiency is much faster than other similar models, and it supports out-of-the-box mask editing function (that is, it supports continuing to edit generated images through masks) .
higher score: The MUSE model achieves excellent FID and CLIP scores, which quantitatively measure image generation quality, diversity, and alignment to text. In terms of data, MUSE’s 900M parameter model achieved a new SOTA on CC3M with a FID score of 6.06. The Muse 3B parameter model achieves a FID of 7.88 and a CLIP score of 0.32 on the zero-shot COCO evaluation.
Generation efficiency: MUSE models are much faster than other comparable models due to the use of compressed, discrete latent spaces and parallel decoding. Compared with pixel space diffusion models such as Imagen and DALL-E 2, Muse uses discrete markers and requires fewer sampling iterations, so the generation efficiency is significantly improved; compared with Google’s own Parti and other autoregressive models, Muse uses parallel Decoding, generation efficiency is also higher.
Edit function: MUSE supports mask-based editing functions, such as the image below, by creating a mask on the left image and inputting “hot air balloon”, a new image on the right can be generated.
Additionally, the Muse team pointed out that there are some “potential harms” in the use cases of today’s language and image AI systems, such as social bias or the spread of misinformation. For this reason, the team has not released the source code of MUSE nor any public demos.
You can see more MUSE-based image works on the MUSE homepage, and the following picture is a preview of some MUSE works:
#Googles #texttoimage #model #Muse #claims #bestinclass #performance