Author: Vishwas Agrawal
Last Updated: Sun, Nov 12, 2023Stable Diffusion XL (SDXL) is a deep learning text-to-image diffusion model developed by Stability AI. It can enhance existing images or generate new ones guided by a text description prompt. SDXL can generate high-resolution images that are more realistic and visually appealing. It can also generate images with more complex details, such as faces and objects.
This guide explains how you can use the Stable Diffusion XL (SDXL) model on a Vultr Cloud GPU instance. You are to install the required packages, generate an image using Stable Diffusion XL Model, increase the image quality using the Refiner model, combine the Refiner model and the base model and mask images by Inpainting.
Before you begin:
Deploy a fresh Ubuntu 22.04 A100 Vultr GPU Stack server with at least:
Securely access the server using SSH as a non-root user with sudo privileges
Open a new terminal session in the Jupyter Lab interface
Install diffusers
and other required packages
$ pip install diffusers transformers accelerate safetensors matplotlib ipywidgets
The above command installs the following packages:
transformers
: Consists of multiple pre-trained models used for Natural Language Processing (NLP), Named Entity Recognition (NER), machine translation, and sentiment analysis.
diffusers
: These are pre-trained diffusion models in the form of prepackaged pipelines. It provides tools for building and training diffusion models. Diffusers also include many different core neural network models, and are used as building blocks to create new pipelines.
accelerate
: Enables PyTorch to run across any distributed configuration. It uses accelerators like GPUs and TPUs to improve efficiency and scalability, speed up natural language processing (NLP) workflows, and enhance performance.
safetensor
: It is a package aimed at enhancing debugging and diagnostics for PyTorch tensors. It provides additional features and checks to help identify and prevent common tensor-related issues.
matplotlib
: It enables you to display the images in Jupyter Notebook.
Create a new directory to save generated images
$ mkdir /home/jupyter/notebooks/generated_images
The above command creates a new directory to save the generated images.
Stable Diffusion XL (SDXL) is a pre-trained text-to-image generation model with 3.5 billion parameters, capable of generating realistic images with resolutions of up to 1024 x 1024 pixels.
To generate images with Stable Diffusion XL, import the required modules such as StableDiffusionXLPipeline
from diffusers
, torch
, and matplotlib.pyplot
. Then, initialize the model using the provided model_id
and set it up for GPU acceleration by assigning it to the CUDA device.
Open a new Notebook and set it's name to Stable Diffusion XL Base
To use the model, import the following modules
import torch
import matplotlib.pyplot as plt
from diffusers import StableDiffusionXLPipeline
Below is what each module does:
StableDiffusionXLPipeline
class provides an interface to the model for generating images.
torch
enables support for tensor computations. In this context, it's used for GPU acceleration.
matplotlib
library allows you to display the generated images.
Declare the model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = StableDiffusionXLPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
)
pipe.to("cuda")
By calling the from_pretrained
method, the pipeline takes care of the necessary setup to generate images from text.
The parameters passed to the from_pretrained()
method are:
The model_id
of a pipeline. The function call above loads the "stabilityai/stable-diffusion-xl-base-1.0"
model. The model ID can also be the path to a local directory containing model weights or a path (local or URL) to a checkpoint file.
torch_dtype
is the Torch datatype of the tensors used for pipeline computations. float16
is specified explicitly so that the model computations are done in 16-bit floating point numbers, to assist systems with less GPU RAM. It is possible to let the system choose the optimal data type using torch_dtype = "auto"
.
Generate an image by providing a prompt
as below
prompt = "Astronaut in a jungle"
image = pipe(prompt=prompt).images
Replace Astronaut in a jungle
with your desired text prompt
The above code declares and feeds the prompt to the previously declared pipeline and stores the image attribute. A different image is generated each time you run the module, you can enhance the output image by providing a more detailed prompt.
Render the generated image
plt.imshow(image[0])
The Astronaut in a jungle
generates an image like the one below:
Image-to-Image is a pre-trained diffusion pipeline from the Diffusers library that also includes a refiner checkpoint specialized in denoising low-noise stage images to generate images of improved high-frequency quality
The following section explains the steps to refining images using a pre-trained pipeline from the Diffusers library along with a refiner check point
Open a new Notebook and set it's name to Stable Diffusion XL Image to Image
.
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels.
To use the model, import the following packages
import torch
import matplotlib.pyplot as plt
from diffusers import StableDiffusionXLImg2ImgPipeline
from diffusers.utils import load_image
Below is what each module does:
StableDiffusionXLImg2ImgPipeline
class provides an interface to the model for refining images.
load_image
function from the utils
module within the diffusers library loads image data from a given source.
Declare the model
model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
)
pipe = pipe.to("cuda")
By calling the from_pretrained
method, the pipeline takes care of the necessary setup to refine the generated images.
The model_id
of a pipeline. The function call above loads the "stabilityai/stable-diffusion-xl-refiner-1.0"
model. The model ID can also be the path to a local directory containing model weights or a path (local or URL) to a checkpoint file.
Load the image
init_img_url = "https://i.imgur.com/UikG1MN.png"
init_image = load_image(init_img_url).convert("RGB")
url
contains the image's URL, which needs to be refined. And then, load_image(url).convert("RGB")
will load the image and convert the image into RGB
color space.
Generate an image by providing a prompt
as below
prompt = "A photo of an astronaut riding a horse on mars"
image = pipe(prompt, image=init_image).images
Use the same prompt
here that you will use to generate your original image or use a prompt
that describes your input image. For example, the above image's URL, image was generated by using this same A photo of an astronaut riding a horse on mars prompt
.
Render the generated image
plt.imshow(image[0])
The A photo of an astronaut riding a horse on mars
generates an image like the one below:
This refiner checkpoint can also be used as a second-stepâ pipeline after having run the base checkpoint to improve image quality.
You can also access the above code in the following Notebook
Diffusers also allow you to use two pipelines simultaneously from its library. With this, you can generate an image using the StableDiffusion-XL base checkpoint and improve the quality of the generated image using the refiner checkpoint
When you use the base and refiner models together to generate an image, this is known as an ensemble of expert denoisers. The ensemble of expert denoisers approach requires fewer denoising steps in total compared to passing the base model's output to the refiner model, making it significantly faster to run. However, it's important to note that you won't be able to inspect the base model's output because it still contains a significant amount of noise.
Open a new Notebook and set it's name to Stable Diffusion XL Base + Refiner
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels
To use the model, import the following packages
import torch
import matplotlib.pyplot as plt
from diffusers import DiffusionPipeline
Declare the base model
base = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, variant="fp16",
use_safetensors=True
)
base.to("cuda")
Declare the refiner model
refiner = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0",
text_encoder_2=base.text_encoder_2,
vae=base.vae,
torch_dtype=torch.float16,
use_safetensors=True,
variant="fp16",
)
refiner.to("cuda")
Because the text encoders and variational autoencoder are the same, you donât have to load those again for the refiner.
Define the number of inference steps and determine the point at which the model should be passed through the high-noise denoising stage (for example, the base model)
n_steps = 40
high_noise_frac = 0.8
The base model is trained on timesteps 0-999
, and the refiner is finetuned from the base model on low noise timesteps 0-199
inclusive, so use the base model for the first 800
timesteps (high noise) and the refiner for the last 200
timesteps (low noise). Hence, highnoisefrac is set to 0.8
so that all steps 200-999
(the first 80% of denoising timesteps) are performed by the base model and steps 0-199 (the last 20% of denoising timesteps) are performed by the refiner model.
Remember, the denoising process starts at high value (high noise) timesteps and ends at low value (low noise) timesteps.
Execute both the pipelines
prompt = "A majestic lion jumping from a big stone at night"
base_image = base(
prompt=prompt,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
output_type="latent",
).images
refined_image = refiner(
prompt=prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
image=base_image,
).images
Make sure to set denoising_end
and denoising_start
to the same values and keep num_inference_steps
constant. Also remember that the output of the base model should be in latent
space.
Render the generated image
plt.imshow(refined_image[0])
The image generated by the prompt A majestic lion jumping from a big stone at night
is an example that illustrates the comparison between using a base model and a combined base and refiner model.
However, you wonât be able to inspect the base modelâs output because it still contains a large amount of noise.
The above base model image was generated using the Stable Diffusion XL base model with a manual seed. Subsequently, the same manual seed was employed to generate these images for demonstration purposes. If you wish to reproduce images using a manual seed, you can follow the steps below.
Create a new generator
Apply this section with the Stable Diffusion XL Base model, following the same steps used to generate the previous base model image.
prompt = "A majestic lion jumping from a big stone at night"
generator = torch.Generator("cuda").manual_seed(3078)
image = pipe(prompt=prompt, generator=generator).images
Above code block defines a new generator to the pipeline. Using manual_seed()
with a fixed number makes the model produce consistent output. You can pick any integer as the manual seed. By default, the generator generates a new seed for a unique image every time.
Execute both the pipelines with same manual seed
Apply this section with Stable Diffusion XL combine Base and refiner model, following the same steps used to generate the previous model image.
prompt = "A majestic lion jumping from a big stone at night"
generator = torch.Generator("cuda").manual_seed(3078)
base_image = base(
prompt=prompt,
generator=generator,
num_inference_steps=n_steps,
denoising_end=high_noise_frac,
output_type="latent",
).images
refined_image = refiner(
prompt=prompt,
num_inference_steps=n_steps,
denoising_start=high_noise_frac,
image=base_image,
).images
Above code block above generates an image identical to the base model image using the same manual seed that was used to create the base model image. It then refines the image to improve its quality, allowing you to make a comparison between the two images.
You can also access the above code in the following Notebook
Inpainting is a pre-trained diffusion pipeline from the Diffusers library, designed to fill in missing or damaged regions of an image by predicting these areas based on the surrounding pixels.
Inpainting utilizes the same Stable Diffusion XL base model. The help of a trained diffusion pipeline enables the model to learn how to restore missing or damaged portions of an image while preserving its original style and content.
To generate inpainting images, youâll need the original image and a mask of what you want to replace in the original image. Create a prompt to describe what you want to replace in the masked area.
Open a new Notebook and set it's name to Stable Diffusion XL Inpaint
To clear GPU memory and start running the model, navigate to the Kernel menu option in your Jupyter Notebook, and click Shutdown Down All Kernels
To use the model, import the following packages
import torch
import matplotlib.pyplot as plt
from diffusers import StableDiffusionXLInpaintPipeline
from diffusers.utils import load_image
Declare the model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = StableDiffusionXLInpaintPipeline.from_pretrained(
model_id,
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
)
pipe.to("cuda")
Upload the images
init_img_url = "https://i.imgur.com/AsJ1lPf.png"
mask_img_url = "https://i.imgur.com/liBhpAv.png"
init_image = load_image(init_img_url).convert("RGB")
mask_image = load_image(mask_img_url).convert("RGB")
Generate an image by providing a prompt
as below
prompt = "A majestic tiger sitting on a bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=50, strength=0.80).images
Replace A majestic tiger sitting on a bench
with your desired prompt
Render the generated image
plt.imshow(image[0])
Below is an example of the image showing how the model works:
You can also access the above code in the following Notebook
Specify the directory you created earlier to save the generated images
save_directory = "/home/jupyter/notebooks/generated_images"
The defined directory must exist on the disk. Verify that you created it in your user home directory.
Using a for loop
, save the images
for i, image in enumerate(image):
image.save(f"{save_directory}/image_{i}.png")
The above code saves all generated images to the predefined save_directory
path. It uses the save()
method to save each image. Saved images are added with filenames in the format image_{i}.png
, where {i}
refers to the index of the image in the list.
In your terminal session, verify that the images are successfully saved to the directory.
$ ls /home/jupyter/notebooks/generated_images
To download a copy of the generated images, you can use a secure transfer protocol like SFTP, FTP, or RSync to fetch the files to your computer.
Below is what each parameter used in the model pipelines does:
prompt
: Represents the input text prompt that guides the image generation process
negative_prompt
: It guides on what to ignore in image generation. If not defined, you need to pass negative_prompt_embeds
instead. It's ignored when you're not using guidance guidance_scale < 1
generator
: An instance of the torch.Generator
class that allows you to control the random number generation
seed
: Specifies the random seed used to initialize the model and the data loader. This helps to ensure that the results are reproducible
guidance_scale
: It improves adherence to text prompts and affects sample quality. Values between 7
and 8.5
work well, and the default value is 7.5
add_watermarker
: Whether to use the invisible_watermark
library to watermark output images. If not defined, it will default to True if the package is installed, otherwise no watermarker will be used
images
: A list of all generated image objects
height
: Sets the height in pixels of the generated image in the pipeline
width
: Sets the width in pixels of the generated image in the pipeline
numinferencesteps
: It defines the number of steps involved in the inference process. It's recommended to set it to 50
to balance generation speed and result quality. A smaller value leads to faster results and vice versa
In this guide, you generated images using the Stable Diffusion XL (SDXL) model on a Vultr A100 Cloud GPU server. Additionally, you refined the generated images using Refiner, inpainted the mask images using an inpainting pipeline.
For more information, please visit the following resources: