I've just broken down a really clever (and deployable!) technique from a recent paper that significantly tackles this problem. It's a two-stage process that you can basically bolt on to your existing image generation pipeline without retraining your base model.
The core idea is this: you filter out the garbage early and then polish the best ones.
The Two-Stage Fix
This technique cleverly combines Rejection Sampling and Iterative Refinement to move your final output into a higher-quality region of the model's data space.
1. Quality Filter (Rejection Sampling)
- You generate a small batch of images from your prompt (e.g., 4-8 images).
- You use a simple quality score (like a tweaked CFG score) to pick the absolute best one or two images from that batch. The rest get thrown out. This weeds out the obviously janky stuff from the start.
2. Fidelity Boost (Iterative Refinement)
- You take that best-of-the-batch image.
- You lightly "re-noise" it (add a tiny bit of diffusion noise).
- You then run it through a few steps of your diffusion model's denoising process again. This forces the model to re-evaluate and "fix" tiny imperfections, sharpen details, and make the image even more coherent.
- You can loop this polishing step a few times.
This smart sampling method is what gives the massive quality boost. The paper shows up to a 65% human preference rate for images generated this way.
Example: Fixing the Steampunk Robot's Hands
This technique is excellent for solving those frustrating compositional errors and detail inconsistencies that plague complex generations.
Prompt:
Steampunk robot serving a cup of tea, intricate brass and copper plating, leather apron, detailed oil painting style by Zdzisław Beksiński
Standard Output (Typical Single-Pass Result): A robot with a generally correct aesthetic, but the brass piping is inconsistent, and the hands are muddy and indistinct.
Expected Output (With Iterative Refinement Applied): The robot's brass plating is sharp and consistent, the gear mechanisms are clearly defined, and the hands holding the cup have coherent, detailed joints. The refinement loop cleans up the "muddy" parts of the oil painting texture.
This isn't about getting a different image, but about ensuring the image you do get is of significantly higher fidelity and free of those pesky low-probability errors.
I've written a detailed breakdown of the paper, how the two stages work, and why it's so effective, complete with more practical examples and benchmarks.
You can read the full breakdown here:
https://www.instruction.tips/post/iterative-image-refinement-diffusion-models
Let me know your thoughts or if you've implemented similar techniques!