FASHN VTON v1.5

Project Page GitHub Hugging Face Spaces arXiv

A virtual try-on model that generates photorealistic images directly in pixel space without requiring segmentation masks.

FASHN VTON v1.5 examples

Model Description

FASHN VTON v1.5 is a state-of-the-art virtual try-on model based on the MMDiT (Multimodal Diffusion Transformer) architecture. Given a person image and a garment image, the model generates a photorealistic image of the person wearing the garment. It supports both model-worn garments and flat-lay product shots.

Key innovations:

  • Pixel-space generation: Operates directly on RGB pixels with a 12x12 patch embedding, eliminating information loss from VAE encoding and preserving fine details in textures and patterns.
  • Maskless inference: Runs in segmentation-free mode by default, allowing garments to take their natural form without shape constraints from the original clothing.
  • Body identity preservation: Maintains tattoos, body characteristics, and cultural garments (e.g., hijabs).

Architecture

Component Specification
Base MMDiT (Multimodal Diffusion Transformer)
Parameters 972M
Hidden Size 1280
Attention Heads 10
Double-Stream Blocks 8 (cross-modal attention)
Single-Stream Blocks 16 (self-attention)
Patch Mixer Blocks 4 (preprocessing)
Patch Size 12x12
Output Resolution 576x864
Precision bfloat16 (Ampere+ GPUs)

Inputs

  • Person image: RGB image of the person to dress
  • Garment image: RGB image of the garment (model photo or flat-lay)
  • Category: "tops", "bottoms", or "one-pieces"
  • Pose keypoints: Extracted via DWPose (handled automatically by the pipeline)

Outputs

  • Photorealistic RGB image of the person wearing the specified garment

Usage

Installation

git clone https://github.com/fashn-AI/fashn-vton-1.5.git
cd fashn-vton-1.5
pip install -e .

Download Weights

python scripts/download_weights.py --weights-dir ./weights

This downloads:

  • model.safetensors β€” TryOnModel weights (~2 GB)
  • dwpose/ β€” DWPose ONNX models for pose detection

The human parser weights (~244 MB) are automatically downloaded on first use.

Quick Start

from fashn_vton import TryOnPipeline
from PIL import Image

# Initialize pipeline (auto-detects GPU)
pipeline = TryOnPipeline(weights_dir="./weights")

# Load images
person = Image.open("person.jpg").convert("RGB")
garment = Image.open("garment.jpg").convert("RGB")

# Run inference
result = pipeline(
    person_image=person,
    garment_image=garment,
    category="tops",  # "tops" | "bottoms" | "one-pieces"
)

# Save output
result.images[0].save("output.png")

CLI

python examples/basic_inference.py \
    --weights-dir ./weights \
    --person-image person.jpg \
    --garment-image garment.jpg \
    --category tops

Parameters

Parameter Type Default Description
category str required "tops", "bottoms", or "one-pieces"
garment_photo_type str "model" "model" for worn garments, "flat-lay" for product shots
num_samples int 1 Number of output images (1-4)
num_timesteps int 30 Sampling steps (20=fast, 30=balanced, 50=quality)
guidance_scale float 1.5 Classifier-free guidance strength
seed int 42 Random seed for reproducibility
segmentation_free bool True Maskless mode for better body preservation and unconstrained garment volume (less biased by original clothing shape)

Categories

Category Description Examples
tops Upper body garments T-shirts, blouses, jackets, sweaters
bottoms Lower body garments Pants, skirts, shorts
one-pieces Full body garments Dresses, jumpsuits, rompers

Training

FASHN VTON v1.5 was trained from scratch in pixel space using a two-phase approach:

  1. Phase 1: 18M masked try-on pairs
  2. Phase 2: 50/50 mix of masked pairs plus 4M synthetic triplets generated from the Phase 1 checkpoint

Training optimizations included token dropping up to 75% to reduce computational demands.

Performance

  • Inference time: ~5 seconds on NVIDIA H100
  • Memory: Requires ~8GB VRAM for inference
  • Precision: Automatically uses bfloat16 on Ampere+ GPUs (RTX 30xx/40xx, A100, H100)

Limitations

  • Resolution: Output resolution (576x864) is lower than some VAE-based architectures that support 1K+ resolution
  • Body shape preservation: May be imperfect due to synthetic triplet generation during training
  • Garment transitions: Original garment traces may remain when swapping from long-to-short or bulky-to-slim garments
  • Hardware requirements: Dedicated GPU recommended for reasonable inference speeds

Citation

@article{bochman2026fashnvton,
  title={FASHN VTON v1.5: Efficient Maskless Virtual Try-On in Pixel Space},
  author={Bochman, Dan and Bochman, Aya},
  journal={arXiv preprint},
  year={2026},
  note={Paper coming soon}
}

License

This model is released under the Apache-2.0 License.

Third-party components:

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Spaces using fashn-ai/fashn-vton-1.5 2