FASHN VTON v1.5

A virtual try-on model that generates photorealistic images directly in pixel space without requiring segmentation masks.

FASHN VTON v1.5 examples

Model Description

FASHN VTON v1.5 is a state-of-the-art virtual try-on model based on the MMDiT (Multimodal Diffusion Transformer) architecture. Given a person image and a garment image, the model generates a photorealistic image of the person wearing the garment. It supports both model-worn garments and flat-lay product shots.

Key innovations:

Pixel-space generation: Operates directly on RGB pixels with a 12x12 patch embedding, eliminating information loss from VAE encoding and preserving fine details in textures and patterns.
Maskless inference: Runs in segmentation-free mode by default, allowing garments to take their natural form without shape constraints from the original clothing.
Body identity preservation: Maintains tattoos, body characteristics, and cultural garments (e.g., hijabs).

Architecture

Component	Specification
Base	MMDiT (Multimodal Diffusion Transformer)
Parameters	972M
Hidden Size	1280
Attention Heads	10
Double-Stream Blocks	8 (cross-modal attention)
Single-Stream Blocks	16 (self-attention)
Patch Mixer Blocks	4 (preprocessing)
Patch Size	12x12
Output Resolution	576x864
Precision	bfloat16 (Ampere+ GPUs)

Inputs

Person image: RGB image of the person to dress
Garment image: RGB image of the garment (model photo or flat-lay)
Category: "tops", "bottoms", or "one-pieces"
Pose keypoints: Extracted via DWPose (handled automatically by the pipeline)

Outputs

Photorealistic RGB image of the person wearing the specified garment

Usage

Installation

git clone https://github.com/fashn-AI/fashn-vton-1.5.git
cd fashn-vton-1.5
pip install -e .

Download Weights

python scripts/download_weights.py --weights-dir ./weights

This downloads:

model.safetensors — TryOnModel weights (~2 GB)
dwpose/ — DWPose ONNX models for pose detection

The human parser weights (~244 MB) are automatically downloaded on first use.

Quick Start

from fashn_vton import TryOnPipeline
from PIL import Image

# Initialize pipeline (auto-detects GPU)
pipeline = TryOnPipeline(weights_dir="./weights")

# Load images
person = Image.open("person.jpg").convert("RGB")
garment = Image.open("garment.jpg").convert("RGB")

# Run inference
result = pipeline(
    person_image=person,
    garment_image=garment,
    category="tops",  # "tops" | "bottoms" | "one-pieces"
)

# Save output
result.images[0].save("output.png")

CLI

python examples/basic_inference.py \
    --weights-dir ./weights \
    --person-image person.jpg \
    --garment-image garment.jpg \
    --category tops

Parameters

Parameter	Type	Default	Description
`category`	str	required	"tops", "bottoms", or "one-pieces"
`garment_photo_type`	str	"model"	"model" for worn garments, "flat-lay" for product shots
`num_samples`	int	1	Number of output images (1-4)
`num_timesteps`	int	30	Sampling steps (20=fast, 30=balanced, 50=quality)
`guidance_scale`	float	1.5	Classifier-free guidance strength
`seed`	int	42	Random seed for reproducibility
`segmentation_free`	bool	True	Maskless mode for better body preservation and unconstrained garment volume (less biased by original clothing shape)

Category	Description	Examples
`tops`	Upper body garments	T-shirts, blouses, jackets, sweaters
`bottoms`	Lower body garments	Pants, skirts, shorts
`one-pieces`	Full body garments	Dresses, jumpsuits, rompers

Training

FASHN VTON v1.5 was trained from scratch in pixel space using a two-phase approach:

Phase 1: 18M masked try-on pairs
Phase 2: 50/50 mix of masked pairs plus 4M synthetic triplets generated from the Phase 1 checkpoint

Training optimizations included token dropping up to 75% to reduce computational demands.

Performance

Inference time: ~5 seconds on NVIDIA H100
Memory: Requires ~8GB VRAM for inference
Precision: Automatically uses bfloat16 on Ampere+ GPUs (RTX 30xx/40xx, A100, H100)

Limitations

Resolution: Output resolution (576x864) is lower than some VAE-based architectures that support 1K+ resolution
Body shape preservation: May be imperfect due to synthetic triplet generation during training
Garment transitions: Original garment traces may remain when swapping from long-to-short or bulky-to-slim garments
Hardware requirements: Dedicated GPU recommended for reasonable inference speeds

Citation

@article{bochman2026fashnvton,
  title={FASHN VTON v1.5: Efficient Maskless Virtual Try-On in Pixel Space},
  author={Bochman, Dan and Bochman, Aya},
  journal={arXiv preprint},
  year={2026},
  note={Paper coming soon}
}

License

This model is released under the Apache-2.0 License.

Third-party components:

DWPose (Apache-2.0)
YOLOX (Apache-2.0)
FASHN Human Parser (License)

fashn-ai
/

fashn-vton-1.5