On 12 Feb 2024, Stability.ai released Stable Cascade “research preview” (non-commercial license), and over the weekend, ComfyUI was updated to support this new model! Time to give it a go!

About Stable Cascade

I am aware that Stable Cascade employs compressed latent spaces for faster inference.

This innovative text to image model introduces an interesting three-stage approach, setting new benchmarks for quality, flexibility, fine-tuning, and efficiency with a focus on further eliminating hardware barriers.

Stable Cascade differs from our Stable Diffusion lineup of models as it is built on a pipeline comprising three distinct models: Stages A, B, and C. This architecture allows for a hierarchical compression of images, achieving remarkable outputs while utilizing a highly compressed latent space.

However, I total layman, the pipeline is very similar that of SDXL. The typical SDXL pipeline is a 2-step Base + Refiner followed by VAE decode (refer to my post SDXL with ComfyUI), whereas, on the surface, Stable Cascade pipeline just changes the terminology to a 2-step Stage C + Stage B, followed by Stage A (VAE decode) instead (yes I know this is totally inaccurate) :P

Stable Cascade offers the model in two flavours: a full model and a lite model, in both BFloat32 and BFloat16 versions.

  • The full (3.6B + 1.5B parameters for Stage C and B respectively) and lite (1B + 700M parameters) models are altogether different models - they generate quite different outputs! Lite is 30-40% of the size of the full model.
  • On the other hand, there is nearly no perceptible difference between bf32 and bf16, the latter being half the size of the former.

FYI, both Float (fp) an BFloat (bf) are floating point representations, but use a different number of bits to represent the exponent and mantissa.

Model Sizes

I did a single speed test to confirm that the smaller the model, the faster the generation (and by extension, lower VRAM/RAM requirement). Here is the relative difference in generation time including load time - ComfyUI was re-started before each test.

  • For comparison, SDXL Base + Refiner (20 + 10 steps in total, batch size 2) completed in about 1 minute 37 seconds (100%).
  • Stable Cascade B+C Full bf32 (workflow is shown below, also 20 + 10 steps, batch size 2) completed in 2 minutes 27 seconds,
  • But Stable Cascade B+C Lite bf16 completed in just 41 seconds.
Model & Precision Stage C File Size Stage B File Size Generation Time
SDXL Base 6.94 GB Refiner 6.08 GB 100%
SC Full Stage C + B, both bf32 14.4 GB 6.25 GB 173%
SC Full Stage C + B, both bf16 7.18 GB 3.13 GB 141%
SC Lite Stage C + B, both bf32 4.12 GB 2.8 GB 48%
SC Lite Stage C + B, both bf16 2.06 GB 1.4 GB 43%
SC Lite Stage C + Full B, both bf16 4.12 GB 3.13 GB 54%

I prefer using the Full model for both stages though it takes longer. On my setup, Stable Cascade 10 + 20 steps takes roughly the same amount of time as SDXL 20 + 10 steps time - but produces far better images. In fact, I went as low as 5 + 2 steps and the output is still reasonable but a bit “flat” with “plasticky” skin since it lacks detail. If some parts look dithered / noisy, then Stage B needs more steps.

An alternative compromise would to use Lite Stage C + Full Stage B, which incurs only a small increase to generation time. To my eye, this method makes the image a little sharper with more detail.

Installation

The instructions no longer apply, since ComfyUI has released new checkpoints that merge what is needed to just 2 files! See my next post for the simpler workflow

To download the model and update ComfyUI:

  1. Head over to Stable Cascade on HuggingFace.
  2. Download these model files under the Files and versions tab to \ComfyUI\models\unet:
    • stage_b_bf16.safetensors, stage_b.safetensors, stage_b_lite_bf16.safetensors, or stage_b_lite.safetensors
    • stage_c_bf16.safetensors, stage_c.safetensors, stage_c_lite_bf16.safetensors, or stage_c_lite.safetensors
  3. Download this file to \ComfyUI\models\vae:
    • stage_a.safetensors
  4. From the text_encoder folder, download to ComfyUI\models\clip
    • model.safetensors, which I renamed to clip_g_sdxl.fp16.safetensors
  5. To update ComfyUI Windows portable, run update_comfyui.bat in the ComfyUI\update\ folder
  6. Now, start ComfyUI as normal

Workflow

This workflow is based on the example workflow provided by comfyanonymous. I am using the same prompt and seed I used in my first SDXL for ComfyUI post:

ComfyUI runing Stable Cascade locally

  • UNetLoader to load Stage C model,
  • CLIPLoader to load CLIP text encoder,
  • CLIPTextEncode to encode the positive and negative prompt,
  • The first KSampler node uses 20 steps with CFG set to 4.0
    • notice the new StableCascade_EmptyLatentImage node that generate the empty latent image with the compression factor of 42,
    • other samplers do work, I also used dpmpp_2m karras
    • I do not know what KSampler parameter values to use, so I did not change any!
  • Use the positive prompt conditioning as input to ConditioningZeroOut...
  • And then pass the zero’ed conditioning and the previous KSampler latent output to StableCascade_StageB_Conditioning,
  • The second KSampler uses 10 steps with CFG at 1.1 per the example:
    • the output conditioning of the last node as the positive input,
    • the previous zero’ed out conditioning as the negative input,
    • and another UNetLoader to load Stage B model.
    • here, other samplers really do not work so well, best to leave as euler_ancestral simple
  • Then VAELoader to load Stage A,
  • And finally, VAEDecode to generate the image.

I do not know what sizes work best. Apparently, Stable Cascade works best with 1024x1024 compression factor 42. 1536 x 1536 and 1024 x 1904 are okay. However, when I tried 640 x 640, 1920 x 1152 and 1344 x 768, I got pretty bad results (respectively - just bad, stretched torso and head cropped off)

Experiments

I haven’t really tried too much at this stage, but I do immediately notice that generated images have less oddities compared to SDXL! Especially when it comes to fingers and text. It’s still not perfect though, I do still get the odd extra digit or misspelt words.

But generally Stable Cascade is a big improvement! Look at the example below - accurate fingers and text without any fancy positive prompt and no negative prompt!

ComfyUI runing Stable Cascade locally - perfect text and hands

The pace of AI innovations and the constantly improving output never ceases to amaze! I continue to be impressed by Stability.ai’s open models; models that I can run locally on a PC without the latest CPU or GPU (though I am not sure about the new license).

Update 24 Feb 24: Please use the new Stable Cascade checkpoints and workflow