Author: Vishwas Agrawal
Last Updated: Mon, Aug 21, 2023Diffusers is a Hugging Face library that provides access to pre-trained diffusion models in form of prepackaged pipelines. It offers tools for building and training diffusion models, and includes many different core neural network models used as building blocks to create new pipelines.
This article explains how you can use Hugging Face Diffusion models on a Vultr Cloud GPU server. You are to use a variation of models to generate human-readable results on the server.
Before you begin:
Deploy a fresh A100 Ubuntu 22.04 Cloud GPU Server on Vultr with at least:
Using SSH, access the server.
Create a non-root sudo user and switch to the account
Jupyter Notebook is an open-source application that offers a web-based development environment to create with live code, visualizations, and equations. To run models interactively on your Vultr Cloud GPU server, install Jupyter Notebook as described in the steps below.
Install the pip
package manager
$ sudo apt install python3-pip
Using pip
, install the Notebook package
$ sudo pip install notebook
Open the Jupyter Notebook port 8888
through the firewall to allow access to the web interface
$ sudo ufw allow 8888
Start Jupyter Notebook
$ jupyter notebook --ip=0.0.0.0
The above command starts Jupyter Notebook and allows connections from all server interfaces as declared by 0.0.0.0
. When successful, copy the generated access token displayed in your output:
[I 2023-08-10 12:57:52.455 ServerApp] Jupyter Server 2.7.0 is running at:
[I 2023-08-10 12:57:52.455 ServerApp] http://HOSTNAME:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06
[I 2023-08-10 12:57:52.455 ServerApp] http://127.0.0.1:8888/tree?token=73631c92ba278d265aedeb3b199bd4d48e5ef5b2eed0ae06
[I 2023-08-10 12:57:52.455 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
In case the command fails to run, close your SSH session, and start it again to activate Jupyter Notebook
$ exit
In a web browser such as Chrome, access Jupyter Notebook using your access token. Replace the example IP Address 192.0.2.100
with your actual Server IP
http://192.0.2.100:8888/tree?token=YOUR_TOKEN
A pipeline is a high-level interface that packages components required to perform different predefined tasks such as image-generation
, image-to-image-generation
, and audio-generation
. You can run a pipeline by specifying a task and letting it use the default settings for any additional tasks. It's also possible to custom-build a pipeline by specifying the model, tokenizer, and other parameters.
Examples in this article base on image/audio generation models and cover both pipeline approaches. Before loading new models in a Notebook session, it's recommended to close and restart the iPython notebook Kernel. This clears the old models from memory and frees up space for new models.
To run code in a Notebook session, add code in the code cell fields, and press CTRL + ENTER, or press the run button on the main taskbar.
The stable Diffusion v2.1 model is a fork of the stable-diffusion-2
checkpoint and it's trained with 55 thousand steps on the same dataset. Additionally, it's fine-tuned with 155 thousand extra steps on 768x768
images, in this section, use the model as described in the steps below.
Open a new Jupyter Notebook file. Rename it to stablediffusion
Install the required global packages
!pip install scipy safetensors matplotlib
To use the model, import the following packages
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
The StableDiffusionPipeline
class provides an interface to the Stable Diffusion v2.1 model for generating images. DPMSolverMultistepScheduler
provides a fast scheduler that generates good outputs with around 20 steps, and torch
enables support for GPU tensor computations.
Declare the model
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
The parameters passed to the from_pretrained()
method are:
model_id
: Loads the "stabilityai/stable-diffusion-2-1"
model. The model ID can also be the path to a local directory containing model weights or a path to a checkpoint file
torch_dtype
: It's the datatype of the tensors used for pipeline computations. bfloat16
specifies that the model computations run in 16-bit (instead of the default 32-bit). To let the system choose the optimal data type using torch_dtype = "auto"
, by default it's set to torch_dtype = "full-precision"
In diffusion models, a Scheduler
de-noises samples by iteratively adding noise during training, and updates samples based on the model outputs during inference. It updates the rule to solve the underlying differential equation.
Generate an image by providing a prompt as below. Replace An astronaut landing on planet
with your desired prompt
prompt = "An astronaut landing on planet"
image = pipe(prompt).images
image[0]
The above code declares and feeds the prompt to the previously declared pipeline. Then, it stores the image attribute and a different image generates each time you run the module. You can enhance the prompt by using details like the camera lens, environment, and include any other relevant information to refine your desired outcome.
Below are the accepted image generation parameters:
prompt: Represents the input text prompt that guides the image generation process
generator: An instance of the torch.Generator
class that allows you to control the random number generation. Specifying the seed value ensures that the generator produces consistent and deterministic outputs when used repeatedly with the same seed
guidance_scale: Sets the value of the guidance_scale
parameter in the pipeline. It improves adherence to text prompts and affects sample quality. Values between 7
and 8.5
work well, and the default value is 7.5
images: A list of all generated image objects
numinferencesteps: Sets the value of num_inference_steps
in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to 50
and balances the generation speed and result quality. A smaller value leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time
The An astronaut landing on planet
generates an image like the one below:
AudioLDM is a text-to-audio latent diffusion model (LDM) with 1.5 million training steps. The model incorporates over 700 CLAP audio dimensions and 400 million parameters. By taking a text prompt as input, it predicts the corresponding audio output, and generates realistic text-conditional sound effects, human speech, and music samples. Run the model to generate audio results as described in the steps below.
Open a new Jupyter Notebook file. Rename it to audioldm
In a new code cell, install the required packages
!pip install scipy
To use the model, import the necessary packages
from diffusers import AudioLDMPipeline
import torch
Declare the pipeline
model_id = "cvssp/audioldm-m-full"
pipe = AudioLDMPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
In the above command, the AudioLDMPipeline
instance uses the pre-trained model specified by model_id
. torch_dtype=torch.float16
sets the data type to 16-bit floating-point which helps with memory efficiency and faster computations. The pipeline is then moved to the GPU using cuda
for faster processing.
Generate audio by providing a prompt. Replace Piano and violin plays
with your desired text prompt
prompt = "Piano and violin plays"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
In the above command, the num_inference_steps
parameter specifies the number of diffusion steps (iterations) used in the generation process, and audio_length_in_s
sets the desired duration of the generated audio in seconds. The resulting audio outputs to the audio
variable.
Display the generated audio
from IPython.display import Audio
Audio(audio, rate=16000)
The above code block allows you to play and listen to the generated audio using the Audio
function from the iPython library. The rate=16000
argument specifies the sampling rate of the audio set to 16000 samples per second.
Save the audio to a file
import scipy
scipy.io.wavfile.write("file_name.wav", rate=16000, data=audio)
The above code saves the generated audio as a WAV file named file_name.wav
using scipy.io.wavfile.write()
. The specified sampling rate rate=16000
verifies that the audio saves with the correct sampling rate.
When using the model, the following are the accepted parameters.
prompt: Represents the input text prompt that guides the audio generation process. If not defined, you need to pass prompt_embeds
audiolengthin_s: Sets the value of the audio_length_in_s
parameter in the pipeline. The length of the generated audio sample in seconds, with a default value of 5.12 seconds
numinferencesteps: Sets the value of num_inference_steps
in the pipeline that defines the number of steps involved in the inference process. By default, it's set to 10
to balance generation speed and result quality. A smaller value of de-noising steps leads to faster results, whereas a larger value enhances quality at the cost of a longer generation time
guidance_scale: Sets the value of the guidance_scale
parameter in the pipeline. A higher value encourages the model to generate audio that is closely linked to the text prompt at the expense of lower sound quality. It's enabled when guidance_scale
is greater than 1
, and the default value is 2.5
negative_prompt: Sets the value of the negative_prompt
parameter in the pipeline. It guides on what to ignore in audio generation. If not defined, you need to pass negative_prompt_embeds
instead. It's ignored when you're not using guidance guidance_scale < 1
numwaveformsper_prompt: Sets the value of the num_waveforms_per_prompt
parameter in the pipeline. The number of waveforms to generate per prompt, and the default value is 1
eta: Sets the value of the eta
parameter in the pipeline. It corresponds to parameter eta (η)
from the DDIM paper. It only applies to the DDIMScheduler
, and it's ignored in other schedulers with the default value set to 0.0
return_dict: Sets the value of the return_dict
parameter in the pipeline to return a StableDiffusionPipelineOutput
instead of a plain tuple, the default value is True
Below are other AudioLDM variants with the respective training steps:
audioldm-s-full
: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parameters
audioldm-s-full-v2
: More than 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 128 UNet dim and 421M parameters
audioldm-m-full
: 1.5M training steps, with audio conditioning, 1024 CLAP audio dim, 192 UNet dim and 652M parameters
audioldm-l-full
: 1.5M training steps, no audio conditioning, 768 CLAP audio dim, 256 UNet dim and 975M parameters
ControlNet is a neural network structure that controls a pre-trained image Diffusion model by adding extra conditions. This checkpoint corresponds to the ControlNet conditioned on human pose estimation. Its function is to allow input of a conditioning image to use and manipulate the image generation process.
It accepts scribbles, edge maps, pose key points, depth maps, segmentation maps, normal maps as the condition input to guide the content of the generated image. In this section, apply the ControlNet model as described in the steps below.
Open a new Jupyter Notebook file. Rename it to sd-controlnet
Install the necessary packages
!pip install controlnet_aux matplotlib mediapipe
To use the model, import the required packages
from PIL import Image
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
from controlnet_aux import OpenposeDetector
from diffusers.utils import load_image
Load an image. Replace https://example.com/image.png
with your actual image source
openpose = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')
image = load_image("https://example.com/image.png")
image = openpose(image)
The above code block loads the pre-trained OpenposeDetector 2and processes the input image from the specified URL. The openpose
object estimates the human pose in the image and returns the processed image with pose information. To read your image, make sure it has a target file extension such as .png
or .jpg
.
Specify the model parameters with 16-bit weights
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, safety_checker=None, torch_dtype=torch.float16)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
The above code loads the ControlNetModel
and the StableDiffusionControlNetPipeline
. torch_dtype=torch.float16
sets the data type to 16-bit floating-point for improved memory efficiency and faster computations.
Input a text prompt to generate a new image using the model. Replace Chef in kitchen
with your desired prompt
pipe.enable_model_cpu_offload()
image = pipe("Chef in kitchen", image, num_inference_steps=20).images
image[0]
The above code block uses pipe
to generate a new image based on the prompt Chef in kitchen
and processes the image with pose information. The num_inference_steps
parameter sets the number of diffusion steps used in the generation process. The generated image
is then added to the image variable.
The following are the accepted model parameters:
prompt: Represents the input text prompt that guides the image generation process. When not defined, you have to pass prompt_embeds
instead
height: Defines the height in pixels of the generated image in the pipeline
width: Sets the width in pixels of the generated image in the pipeline
numinferencesteps: Sets the value of num_inference_steps
in the pipeline. It defines the number of steps involved in the inference process. By default, it's set to 50
, and balances generation speed and the result quality. A smaller value of de-noising steps leads to faster results, and a larger value enhances quality at the cost of a longer generation time
guidance_scale: Sets the value of the guidance_scale
parameter in the pipeline called Classifier-Free Diffusion Guidance. It's enabled by setting guidance_scale > 1
. A higher guidance scale generates images that are closely linked to the text prompt, usually at the expense of lower image quality. Values between 7
and 8.5
work well, and the default vale is 7.5
negative_prompt: Sets the value of the negative_prompt
parameter in the pipeline to specify the prompt that should not guide the image generation. When not defined, you have to pass negative_prompt_embeds
instead. When not using guidance, it's ignored when the guidance_scale
is less than 1
numimagesper_prompt: Sets the num_images_per_prompt
parameter value in the pipeline to determine the number of images to generate per prompt
prompt_embeds: Sets the prompt_embeds
value in the pipeline
negativepromptembeds: Sets the value of the negative_prompt_embeds
parameter in the pipeline. You can apply pre-generated negative text embeddings to tweak text inputs, for example, prompt weighting. When not provided, negative_prompt_embeds
generates using the negative_prompt
input argument
output_type: Sets the output_type
parameter value in the pipeline to define the output format of the generated image. You can choose between PIL
, Image
, or np.array
, with the default value set to PIL
callback: Sets the callback
parameter value in the pipeline
callback_steps: Sets the callback_steps
parameter value in the pipeline, the frequency at which the callback function returns. When not specified, the default value is 1
controlnetconditioningscale: Sets the controlnet_conditioning_scale
parameter value in the pipeline. ControlNet outputs multiply by controlnet_conditioning_scale
before addition to the residual in the original UNet. When multiple ControlNets are available in init, you can set the corresponding scale as a list with the default set to 1.0
guess_mode: Sets the guess_mode
parameter value in the pipeline. In this mode, the ControlNet encoder tries to recognize the content of the input image even if you remove all prompts. The default value is False
, but it's recommended to use a guidance_scale value between 3.0
and 5.0
The "Chef in kitchen" prompt generates an image like the one below:
Below are other variants available for the same model:
lllyasviel/sd-controlnet-canny
: Conditioned on Canny edges
lllyasviel/sd-controlnet-depth
: Conditioned on Depth estimation
lllyasviel/sd-controlnet-hed
: Conditioned on HED Boundary
lllyasviel/sd-controlnet-mlsd
: Conditioned on M-LSD straight line detection
lllyasviel/sd-controlnet-normal
: Conditioned on Normal Map Estimation
lllyasviel/sd-controlnet-openpose
: Conditioned on Human Pose Estimation
lllyasviel/sd-controlnet-scribble
: Conditioned on Scribble images
lllyasviel/sd-controlnet-seg
: Conditioned on Image Segmentation
In this article, you implemented Hugging Face diffusion models and used the models to generate results. To use other diffusion models, visit the respective model card pages to learn how to use them. Additionally, studying the model's documentation provides valuable insights into its specific details and configuration options.
For more information, visit the following documentation resources: