Stable Diffusion XL LoRA training
2024-01-22 von Artikel von OscarOscar

  • Howto
  • Machine-Learning

This article will show you how to...

  • prepare a dataset for SDXL training
  • find good settings
  • train a Lora with Kohya's GUI

Introduction

In this article, we will show you how to train a Lora for SDXL with great results on Characters/People. We will use Kohya's GUI for training, but you can also use the command line or other guis. We will also show you how to prepare your data and what settings we found to work best for our setup.

Motivation

Generative AI Models are becoming increasingly popular, with notable examples such as DALL-E 3, Midjourney, Firefly, and Stable Diffusion XL leading the way in image generation. Most of these models, however, are available only as a service and cannot be used locally. The release of Stable Diffusion XL (SDXL) marks a significant milestone by enabling the local generation of high-resolution images up to 1024x1024. While this may initially seem like a limitation for those simply seeking to create beautiful images, the true advantage lies in its ability to be fine-tuned for generating images in styles or subjects it was not originally trained on. This can be achieved in three ways: training a model from scratch, which is impractical for most due to the need for vast data and computing power; using and fine-tuning a pre-trained model like the SDXL base model from Stability AI, which is less resource-intensive but still demanding; or employing LoRA, which allows for an efficient adaptation of a pre-trained model without the need for extensive retraining.

What is a Lora?

LoRA, or Low-Rank Adaptation, represents a breakthrough in machine learning, particularly in the realm of Generative AI. This technology offers an efficient approach to customizing AI models. Unlike traditional methods that require extensive resources for training or fine-tuning, LoRA focuses on modifying only a small part of an existing model. This innovation opens new doors for the personalization and adaptation of AI models, making it more accessible to a wider range of users and applications.

Install Kohya's GUI

For this tutorial, we decided to use Kohya's GUI as it is quite popular in the Stable Diffusion community and easy to use. Go and get Kohya's GUI from Github. Follow the install instructions.

Prepare your data

First, select images that are tied to a subject. You can have images for multiple subjects, but there should be at least a couple of pictures per subject. When training a LoRA to generate images of real people, we found it can be beneficial to train multiple subjects with one Lora if the images of a particular person are of little quality (e.g. Instagram). This way the model can use the data from other subjects to for example increase the skin details of a person's face. A downside is, of course, that representation is less correct, and that in some cases also other features of the person's face can spill over into your generated image. Just keep that in mind during image selection. In this HowTo, we train a Lora for SDXL so the images should have a resolution of no less than 1024x1024. A higher resolution is desirable, but you will potentially need more VRAM. We find that resolutions that have 2000 to 3000 pixels on the long edge work great.

Folder Structure

first, create 4 folders:

Your_Training_Folder
|-- images
    |-- 20_yourFirstTriggerWord [class]
    `...
|-- log
|-- model
`-- source

images: In this folder, you can place a folder that starts with a number (20 works fine in most cases), followed by an underscore, followed by the name of your subject. The name of your subject should be the same as the trigger word you want to use for telling SDXL to create an image with the subject in it. We found that it is best to use a combination of characters that do not have a real meaning, to not confuse the model. For example, if you decide your trigger word should be "Chris", the chances are high, that the model associates other people with the name Chris.

The number in front of the trigger word is used to set the steps for the images inside the folder. For each subject you want to train, create one folder and place the images of that subject inside.

log / model / source: These folders are used by Kohya's GUI to store the training data. You can leave them empty for now. The model folder will be filled with the trained model after training and all intermediate models saved during training.

Trigger Word

Now we need to create a description of each image, called a caption. In Kohya's GUI navigate to Utilities, Caption, WD14-Captioning. Paste the path to the folder with the images you want to create the captions for and type in .txt as "Caption file extension". Now you need to specify your trigger word again. Type in your trigger word in the "Prefix to add to W14 caption" section, followed by a semicolon and a space. It should look like this: "trigger, ". The space at the end is important, as SDXL will likely not train for your trigger word if you have only one space separating your trigger from the description.

Train your Lora

Now let's look at some settings in Kohya's GUI. At the top of the page, navigate to LoRA, then Training. We found that the following settings work well for SDXL LoRA's with characters/people:

Source model: Here you can pick an SDXL checkpoint you downloaded. Stability AI's SDXL base model can be found here on Hugging Face. Don't forget to check the SDXL Model Checkbox.

Now navigate to Folders. Here you can specify the folders you created earlier. Here are the folder mappings:

Image folder = images

Output folder = model

Regularization folder = source

Loggings folder = log

Also, make sure to set a Model output name.

Then we can move over to the Parameters tab. The section is again separated into three tabs: Basic, Advanced and Samples.

Basic

The most important settings here are:

PARAMETERVALUECOMMENT
Learning rate0.001
Max reseolution1024,1024
Text Encoder learning rate0.00005

The text encoder (CLIP) is used to transform the prompt into an embedding space.

Unet learning rate0.0001UNet predicts the noise that is present in given latents.
Network Rank (Dimension)128

Neurons in hidden layer of small neural net in LoRA. Larger -> more "storage" for features.

Network Alpha16

The smaller alpha, the larger are the stored weights -> less likely to be rounded to 0.

Here we found an 8:1 Ratio for Dimension to Network alpha works great. Additionally, we set the following settings for our setup (RTX 3090 24Gb VRAM & 48Gb RAM):

PARAMETERVALUECOMMENT
No half VAEcheckedDisable half-precision (fp16) for VAE.
OptimizerAdamW8bit
Mixed precisionfp16

fp16 & bf16 save VRAM, while mixed precision ensures good results.

Save precisionfp16Datatype (precision) of saved model.
Batch size1
Epochs10
Save every N epochs1Saves model every N epochs.
Cache latentschecked

Caches compressed representation of training images in main memory.

Advanced

We only changed the settings: Don't upscale bucket resolution to checked and Bucket resolution steps to 64.

Now you only have to hit the Start training button and wait for the training to finish. You can monitor the training in the console output.

If you don't want to configure all the settings yourself, you can also download our config file here. You only need to change the paths to your folders for training and the location of your SDXL base model.

Results

Our dataset consists of one subject in three outfits in similar poses. Since we had only a couple of photos that were shot close up, we decided to include some crops in the dataset. That improved the rendition of facial features significantly. Due to copyright reasons, we can't show the entire training data, as well as images generated from the trained model. To give you some idea of what we achieved, we have some examples. These include the 3 outfits that were displayed in the training data. We used 23 unique images and 5 crops totaling 29 images. Most images displayed the first outfit (grey dress).

Some of our training pictures

The images you get from your model are highly dependent on your dataset, but also your prompts. For our negative prompt, we stayed consistent throughout our generated images.

negative prompt: worst quality, normal quality, low quality, low res, blurry, text, watermark, logo, banner, extra digits, cropped, jpeg artefacts, signature, username, error, sketch, duplicate, ugly, monochrome, horror, geometry, mutation, disgusting, bad anatomy, bad hands, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, worst face, three crus, extra crus, fused crus, worst feet, three feet, fused feet, fused thigh, three thigh, fused thigh, extra thigh, worst thigh, missing fingers, extra fingers, ugly fingers, long fingers, horn, extra eyes, huge eyes, 2girl, 2boy, amputation, disconnected limbs, cartoon, cg, 3d, unreal, animate

We added "teeth" to our negative prompt, whenever we wanted a smile with no teeth or no smile at all. For our positive prompt we used a variety of prompts. For each picture in the gallery, the title is the prompt we used to generate the image.

Some of our generated images

Conclusion:

If your goal is to generate images of a specific subject that is not represented in the training set of a given model, LoRA is a great way to achieve that. We found that the results are highly dependent on the dataset you use, but also on the prompts you use. We found that it is best to use a combination of prompts that describe the subject, but also the style of the image you want to generate. Also, we found adding images from multiple angles and settings with different outfits (for people) can improve results, especially the ability of the model to produce images that display differing settings. Many images we generated have corn in the background. This is because in all our images the subject was standing in front of a cornfield.

Kommentare