Training Design for Text-to-Image Models: Lessons from Ablations
Welcome back! This is the second part of our series on training efficient text-to-image models from scratch.
In the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, in the open, and at scale. We focused primarily on architectural choices and motivated the core design decisions behind our model PRX. We also released an early, small (1.2B parameters) version of the model as a preview of what we are building (go try it if you haven't already 😉).
In this post, we shift our focus from architecture to training. The goal is to document what actually moved the needle for us when trying to make models train faster, converge more reliably, and learn better representations. The field is moving quickly and the list of “training tricks” keeps growing, so rather than attempting an exhaustive survey, we structured this as an experimental logbook: we reproduce (or adapt) a set of recent ideas, implement them in a consistent setup, and report how they affect optimization and convergence in practice. Finally, we do not only report these techniques in isolation; we also explore which ones remain useful when combined.
In the next post, we will publish the full training recipe as code, including the experiments in this post. We will also run and report on a public "speedrun" where we put the best pieces together into a single configuration and stress-test it end-to-end. This exercise will serve both as a stress test of our current training pipeline and as a concrete demonstration of how far careful training design can go under tight constraints. If you haven’t already, we invite you to join our Discord to continue the discussion. A significant part of this project has been shaped by exchanges with community members, and we place a high value on external feedback, ablations, and alternative interpretations of the results.
The Baseline
Before introducing any training-efficiency techniques, we first establish a clean reference run. This baseline is intentionally simple. It uses standard components, avoids auxiliary objectives, and does not rely on architectural shortcuts or tricks to save compute resources. Its role is to serve as a stable point of comparison for all subsequent experiments.
Concretely, this is a pure Flow Matching (Lipman et al., 2022) training setup (as introduced in Part 1) with no extra objectives and no architectural speed hacks.
We will use the small PRX-1.2B model we presented in the first post of this series (single stream architecture with global attention for the image tokens and text tokens) as our baseline and train it in Flux VAE latent space, keeping the configuration fixed across all comparisons unless stated otherwise.
The baseline training setup is as follows:
| Setting | Value |
|---|---|
| Steps | 100k |
| Dataset | Public 1M synthetic image generated with MidJourneyV6 |
| Resolution | 256×256 |
| Global batch size | 256 |
| Optimizer | AdamW |
| lr | 1e-4 |
| weight_decay | 0.0 |
| eps | 1e-15 |
| betas | (0.9, 0.95) |
| Text encoder | GemmaT5 |
| Positional encoding | Rotary (RoPE) |
| Attention mask | Padding mask |
| EMA | Disabled |
This baseline configuration provides a transparent and reproducible anchor. It allows us to attribute observed improvements and regressions to specific training interventions, rather than to shifting hyperparameters or hidden setup changes. Throughout the remainder of this post, every technique is evaluated against this reference with a single guiding question in mind:
Does this modification improve convergence or training efficiency relative to the baseline?
Examples of baseline model generations after 100K training steps.
Benchmarking Metrics
To keep this post grounded, we rely on a small set of metrics to monitor checkpoints over time. None of them is a perfect proxy for perceived image quality, but together they provide a practical scoreboard while we iterate.
Fréchet Inception Distance (FID): (Heusel et al., 2017) Measures how close the distributions of generated and real images are, using Inception-v3 feature statistics (mean and covariance). Lower values typically correlate with higher sample fidelity.
CLIP Maximum Mean Discrepancy (CMMD): (Jayasumana et al., 2024) Measures the distance between real and generated image distributions using CLIP image embeddings and Maximum Mean Discrepancy (MMD). Unlike FID, CMMD does not assume Gaussian feature distributions and can be more sample-efficient; in practice it often tracks perceptual quality better than FID, though it is still an imperfect proxy.
DINOv2 Maximum Mean Discrepancy (DINO-MMD): Same MMD-based distance as CMMD, but computed on DINOv2 (Oquab et al. 2023) image embeddings instead of CLIP. This provides a complementary view of distribution shift under a self-supervised vision backbone.
Network throughput: Average number of samples processed per second (samples/s), as a measure of end-to-end training efficiency.
With the scoreboard defined, we can now dive into the methods we explored, grouped into four buckets: Representation Alignment, Training Objectives, Token Routing and Sparsification, and Data.
Representation Alignment
Diffusion and flow models are typically trained with a single objective: predict a noise-like target (or vector field) from a corrupted input. Early in training, that one objective is doing two jobs at once: it must build a useful internal representation and learn to denoise on top of it. Representation alignment makes this explicit by keeping the denoising objective and adding an auxiliary loss that directly supervises intermediate features using a strong, frozen vision encoder. This tends to speed up early learning and bring the model’s features closer to those of modern self-supervised encoders. As a result, you often need less compute to hit the same quality.
A useful way to view it is to decompose the denoiser into an implicit encoder that produces intermediate hidden states, and a decoder that maps those states to the denoising target. The claim is that representation learning is the bottleneck: diffusion and flow transformers do learn discriminative features, but they lag behind foundation vision encoders when training is compute-limited. Therefore, borrowing a powerful representation space can make the denoising problem easier.
REPA (Yu et al., 2024)
REPA adds a representation matching term on top of the base flow-matching objective.
Let be a clean sample and be the noise sample. The model is trained on an interpolated state (for ) and predicts a vector field . In REPA, a pretrained vision encoder processes the clean sample to produce patch embeddings , where is the number of patch tokens and is the teacher embedding dimension. In parallel, the denoiser processes and produces intermediate hidden tokens (one token per patch). A small projection head maps these student hidden tokens into the teacher embedding space, and an auxiliary loss maximizes patch-wise similarity between corresponding teacher and student tokens:
Here indexes patch tokens, is the teacher embedding for patch , is the corresponding student hidden token at time , and is typically cosine similarity.
This term is combined with the main flow-matching loss:
with controlling the trade-off.
In practice, the student is trained to produce noise-robust, data-consistent patch representations from , so later layers can focus on predicting the vector field and generating details rather than rediscovering a semantic scaffold from scratch.
What we observed
We ran REPA on top of our baseline PRX training, using two frozen teachers: DINOv2 and DINOv3 (Siméoni et al., 2025). The pattern was very consistent: adding alignment improves quality metrics, and the stronger teacher helps more, at the cost of a bit of speed.
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.2 | 0.41 | 0.39 | 3.95 |
| REPA-Dinov3 | 14.64 | 0.35 | 0.3 | 3.46 |
| REPA-Dinov2 | 16.6 | 0.39 | 0.31 | 3.66 |
On the quality metrics, both teachers improve over the baseline. The effect is strongest with DINOv3, which achieves the best overall numbers in this run.
REPA is not free: we pay for an extra frozen teacher forward and the patch-level similarity loss, which shows up as a throughput drop from 3.95 batches/s to 3.66 (DINOv2) or 3.46 (DINOv3). In other words, DINOv3 prioritizes maximum representation quality at the cost of slower training, while DINOv2 offers a more efficient tradeoff, still delivering substantial gains with a smaller slowdown.
Our practical takeaway is that REPA is a strong lever for text-to-image training. In our setup, the throughput trade-off is real and the net speedup (time required to reach a given level of image quality) felt a bit less dramatic than what the authors of the paper report on ImageNet-style, class-conditioned generation. That said, the quality gains are still clearly significant. Qualitatively, we also saw the difference early: after ~100K steps, samples trained with alignment tended to lock in cleaner global structure and more coherent layouts, which makes it easy to see why REPA (and alignment variants more broadly) have become a go-to ingredient in modern T2I training recipes.
iREPA (Singh et al., 2025)
A natural follow-up to REPA is: what exactly should we be aligning? iREPA argues that the answer is spatial structure, not global semantics. Across a large sweep of 27 vision encoders, the authors find that ImageNet-style “global” quality (e.g., linear-probe accuracy on patch tokens) is only weakly predictive of downstream generation quality under REPA, while simple measures of patch-token spatial self-similarity correlate much more strongly with FID. Based on that diagnosis, iREPA makes two tiny but targeted changes to the REPA recipe to better preserve and transfer spatial information:
- Replace the usual MLP projection head with a lightweight 3×3 convolutional projection operating on the patch grid.
- Apply a spatial normalization to teacher patch tokens that removes a global overlay (mean across spatial locations) to increase local contrast.
Despite representing “less than 4 lines of code”, these tweaks consistently speed up convergence and improve quality across encoders, model sizes, and even REPA-adjacent training recipes.
What we observed
In our setup, we observed a similar kind of boost when applying the iREPA spatial tweaks on top of DINOv2: convergence was a bit smoother and the metrics improved more steadily over the first 100K steps. Interestingly, the same changes did not transfer as cleanly when applied on top of a DINOv3 teacher and they tended to degrade performance rather than help. We do not want to over-interpret that result: this could easily be an interaction with our specific architecture, resolution/patching, loss weighting, or even small implementation details. Still, given this inconsistency across teachers, we will likely not include these tweaks in our default recipe, even if they remain an interesting option to revisit when tuning for a specific setup.
About Using REPA During the Full Training:
The paper REPA Works Until It Doesn't: Early-Stopped, Holistic Alignment Supercharges Diffusion Training (Wang et al., 2025) highlights a key caveat: REPA is a powerful early accelerator, but it can plateau or even become a brake later in training. The authors describe a capacity mismatch. Once the generative model starts fitting the full data distribution (especially high-frequency details), forcing it to stay close to a frozen recognition encoder’s lower-dimensional embedding manifold becomes constraining. Their practical takeaway is simple: keep alignment for the “burn-in” phase, then turn it off with a stage-wise schedule.
We observed the same qualitative pattern in our own runs. When training our preview model, removing REPA after ~200K steps noticeably improved the overall feel of image quality, textures, micro-contrast, and fine detail continued to sharpen instead of looking slightly muted. For that reason, we also recommend treating representation alignment as a transient scaffold. Use it to get fast early progress, then drop it after a while once the model’s own generative features have caught up.
Alignment in the Token Latent Space
So far, “alignment” meant regularizing the generator’s internal features against a frozen teacher while treating the tokenizer / latent space as fixed. A more direct lever is to shape the latent space itself so the representation presented to the flow backbone is intrinsically easier to model, without sacrificing the reconstruction fidelity needed for editing and downstream workflows.
REPA-E (Leng et al., 2025) makes this concrete. Its starting point is a failure mode: if you simply backprop the diffusion / flow loss into the VAE, the tokenizer quickly learns a pathologically easy latent for the denoiser, which can even degrade final generation quality. REPA-E’s fix is a two-signal training recipe:
- keep the diffusion loss, but apply a stop-gradient so it only updates the latent diffusion model (not the VAE);
- update both the VAE and the diffusion model using an end-to-end REPA alignment loss.
Thanks to these two tricks, the tokenizer is explicitly optimized to produce latents that yield higher alignment and empirically better generations.
In parallel, Black Forest Labs’ FLUX.2 AE work frames latent design as a trade-off between learnability, quality, and compression.Their core argument is that improving learnability requires injecting semantic structure into the representation, rather than treating the tokenizer as a pure compression module. This motivates retraining the latent space to explicitly target “better learnability and higher image quality at the same time". They do not share the full recipe, but they do clearly state the key idea: make the AE’s latent space more learnable by adding semantic or representation alignment, and explicitly point to REPA-style alignment with a frozen vision encoder as the mechanism they build on and integrate into the FLUX.2 AE.
What we observed
To probe alignment in the latent space, we compared two pretrained autoencoders as drop-in tokenizers for the same flow backbone: a REPA-E-VAE (where we do add the REPA alignment objective, as in the paper) and the Flux2-AE (where we do not add REPA, following their recommendation). The results were, honestly, extremely impressive, both quantitatively and qualitatively. In samples, the gap is immediately visible: generations show more coherent global structure and cleaner layouts, with far fewer “early training” artifacts.
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |
| Flux2-AE | 12.07 | 0.09 | 0.08 | 1.79 |
| REPA-E-VAE | 12.08 | 0.26 | 0.18 | 3.39 |
A first striking point is that both latent-space interventions lower the FID by ~6 points (18.20 to ~12.08), which is a much larger jump than what we typically get from “just” aligning intermediate features. This strongly supports the core idea: if the tokenizer produces a representation that is intrinsically more learnable, the flow model benefits everywhere.
The two AEs then behave quite differently in the details. Flux2-AE dominates most metrics (very low CMMD and DINO_MMD, but it comes with a huge throughput penalty: batches/sec drops from 3.95 to 1.79. In our case this slowdown is explained by practical factors they also emphasize: the model is simply heavier, and it also produces a larger latent (32 channels), which increases the amount of work the diffusion backbone has to do per step.
REPA-E-VAE is the “balanced” option: it reaches essentially the same FID as Flux2-AE while keeping throughput much closer to the baseline (3.39 batches/sec).
Training Objectives: Beyond Vanilla Flow Matching
Architecture gets you capacity, but the training objective is what decides how that capacity is used. In practice, small changes to the loss often have outsized effects on convergence speed, conditional fidelity, and how quickly a model “locks in” global structure. In the sections below, we will go through the objectives we tested on top of our baseline rectified flow setup, starting with a simple but surprisingly effective modification: Contrastive Flow Matching.
Contrastive Flow Matching (Stoica et al., 2025)
Flow matching has a nice property in the unconditional case: trajectories are implicitly encouraged to be unique (flows should not intersect). But once we move to conditional generation (class- or text-conditioned), different conditions can still induce overlapping flows, which empirically shows up as “averaging” behavior: weaker conditional specificity, and muddier global structure. Contrastive flow matching addresses this directly by adding a contrastive term that pushes conditional flows away from other flows in the batch.
Contrastive flow matching makes class-conditional flows more distinct, reducing overlap seen in standard flow matching, and produces higher-quality images that better represent each class. Figure from arXiv:2506.05350.
For a given training triplet , standard conditional flow matching trains the model velocity to match the target transport direction. Contrastive flow matching keeps that positive term, but additionally samples a negative pair from the batch and penalizes the model if its predicted flow is also compatible with that other trajectory. In the paper’s notation, this becomes:
where controls the strength of the “push-away” term. Intuitively: match your own trajectory, and be incompatible with someone else’s.
The authors show that contrastive flow matching produces more discriminative trajectories and that this translates into both quality and efficiency gains: faster convergence (reported up to 9× fewer training iterations to reach similar FID) and fewer sampling steps (reported up to 5× fewer denoising steps) on ImageNet (Deng et al. 2009) and CC3M(Sharma et al., 2018) experiments.
A key advantage is that the objective is almost a drop-in replacement: you keep the usual flow-matching loss, then add a single contrastive “push-away” term using other samples in the same batch as negatives which provides the extra supervision without introducing additional model passes.
What we observed
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |
| Contrastive-FM | 20.03 | 0.40 | 0.36 | 3.75 |
On this run, contrastive flow matching yields a small but measurable improvement on the representation-driven metrics: CMMD goes from 0.41 → 0.40 and DINO-MMD from 0.39 → 0.36. The magnitude of the gain is smaller than what the paper reports on ImageNet, which is not too surprising: text conditioning is much more complex than discrete classes, and the training data distribution is likely less “separable” than ImageNet, making the contrastive signal harder to exploit.
We do not see an improvement in FID in this specific experiment (it slightly worsens), but the throughput cost is negligible in practice (3.95 → 3.75 batches/sec). Given the simplicity of the change and the consistent movement in the right direction for the conditioning/representation metrics, we will likely still keep contrastive flow matching in our training pipeline as a low-cost regularizer.
JiT (Li and He, 2025)
Back to Basics: Let Denoising Generative Models Denoise is probably one of our favorite recent papers in the diffusion space because it is not a new trick but a reset: stop asking the network to predict off-manifold quantities (noise or velocity) and just let it denoise. Most modern diffusion and flow models train the network to predict noise or a mixed quantity like velocity . Under the manifold assumption, natural images live on a low-dimensional manifold, while and are inherently off-manifold, so predicting them can be a harder learning problem than it looks.
Under the manifold assumption, clean images lie on the data manifold while noise and velocity do not. Thus training the model to predict clean images is fundamentally easier than training it to predict noise-like targets. Figure from arXiv:2511.13720.
The authors frame the problem with the standard linear interpolation between the clean image and the noise : and the corresponding flow velocity:
Instead of outputting directly, the model predicts a clean image estimate: and we convert it to a velocity prediction via:
Then we can keep the exact same flow-style objective in v-space:
This formulation makes the learning problem substantially easier in high dimensions: instead of predicting noise or velocity (which are essentially unconstrained in pixel space), the network predicts the clean image , i.e., something that lies on the data manifold. In practice, this makes it feasible to train large-patch Transformers directly on pixels without a VAE or tokenizer while keeping optimization stable and the total number of tokens manageable.
What we observed
We first evaluated x-prediction in the same setting as the rest of our objective experiments, namely training in the FLUX latent space at 256×256 resolution.

| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |
| X-Pred | 16.80 | 0.54 | 0.49 | 3.95 |
In this regime, the benefit of x-prediction is unclear. While FID improves slightly compared to the baseline, both CMMD and DINO-MMD degrade noticeably, and throughput is unchanged. This suggests that, when working in an already well-structured latent space, predicting clean images instead of velocity does not consistently dominate the baseline objective, and can even hurt representation-level alignment.
That said, this experiment is not where x-prediction really shines.
The exciting part is that x-prediction stabilizes high-dimensional training, making it feasible to use larger patches and denoise directly in pixel space, without a VAE, at much higher resolutions. Using JiT, we trained a model directly on 1024×1024 images with 32×32 patches, instead of operating in a compressed latent space. Despite the much higher resolution and the absence of a tokenizer, optimization remained stable and fast. We reached FID 17.42, DINO_MMD 0.56, and CMMD 0.71 with a throughput of 1.33 batches/sec.
These results are remarkable: training directly on 1024×1024 images is only about 3× slower than training in a 256×256 latent space, while operating on raw pixels. This strongly supports the core claim of Back to Basics: letting the model predict clean images makes the learning problem significantly easier, and opens the door to high-resolution, tokenizer-free text-to-image training without prohibitive compute costs.
As a result, we plan to use this formulation as the backbone of our upcoming speedrun experiments, to see how far we can push it when combined with the other efficiency and sparsification techniques discussed above. The main downside for now is that this approach does not let us benefit from the very nice properties of the FLUX.2 VAE; exploring whether some form of alignment or hybrid training could make these two worlds compatible is an open direction we plan to investigate further.
Token Routing and Sparsification to Reduce Compute Costs
So far, most of the techniques we discussed focus on making each training step more effective: improving the objective, shaping the representations, or accelerating convergence. The next lever is orthogonal: make each step cheaper.
For diffusion and flow transformers, the dominant cost is running deep transformer stacks over a large set of image/latent tokens where attention scales poorly with sequence length. Token sparsification methods target this directly by ensuring that only a subset of tokens pays the full compute price in the expensive parts of the network, while still preserving enough information flow to keep quality high.
Most masking approaches accelerate training by removing tokens from the forward pass, then asking the model to hallucinate the missing content from learned placeholders. That works surprisingly well, but it violates the spirit of iterative denoising. Instead of refining all the content in each step, we are reconstructing parts from scratch.
Two recent papers illustrate a cleaner alternative: instead of deleting information, they reorganize where compute is spent. TREAD and SPRINT share the same high-level objective of avoiding full-depth computation for every token at every layer, but they pursue it through complementary strategies.
TREAD's (Krause et al., 2025) core idea is to replace compute reduction through information loss, such as dropping or masking tokens, with compute reduction through information transport using token routing. It introduces a route: for each training sample, it randomly selects a fraction of tokens and temporarily bypasses a contiguous chunk of layers, then re-injects those tokens later. Tokens are not discarded. Instead, they avoid paying the cost of full depth. Concretely, for a denoiser with a stack of blocks , TREAD defines a route (start layer , end layer ). A subset of tokens follows the cheap path (identity) across , while the rest follows the normal full computation. Then both streams merge again at . In practice, the paper shows that routing up to 50% of tokens remains effective, while higher rates begin to degrade quality.
. TREAD enhances training efficiency by routing tokens around certain layers. Figure from arXiv:/2501.04765.
SPRINT (Park et al., 2025) extends this approach by introducing sparsity in the most computationally expensive parts of the network, while preserving a dense information pathway. Its recipe is intentionally structured: run dense early layers over all tokens to build reliable low-level features, then keep only a subset of tokens through the sparse middle layers where compute is heaviest, and finally go dense again by re-expanding and fusing sparse deep features with a dense residual stream from the early layers, before producing the output. The key distinction from TREAD is where robustness comes from: TREAD keeps tokens “present” but shallower (routing), whereas SPRINT allows many tokens to be absent in the middle blocks, relying on the dense residual path to preserve full-resolution information. This is what enables more aggressive sparsification in practice. The paper explores drop ratios around 75%, versus ~50% for TREAD.
SPRINT goes beyond TREAD by dropping most tokens in the middle layers while keeping a dense residual path to preserve full-resolution information. Figure from arXiv:/2510.21986.
What we observed
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |
| TREAD | 21.61 | 0.55 | 0.41 | 4.11 |
| SPRINT | 22.56 | 0.72 | 0.42 | 4.20 |
Under our standard 256×256 latent setup, both methods deliver the primary benefit we were targeting. TREAD goes from 3.95 → 4.11 batches/sec, and SPRINT pushes it a bit further to 4.20 batches/sec. The cost is that under our evaluation protocol, this extra throughput comes with a clear loss in quality: FID rises from 18.20 to 21.61 (TREAD) and 22.56 (SPRINT), with the same pattern observed in CMMD and DINO-MMD.
Taken at face value, routing yields a modest ~7–9% throughput gain, but it comes with worse metrics in this benchmark, with SPRINT (the more aggressive scheme) degrading quality slightly more than TREAD.
One important caveat is that token-sparse / routed models tend to score worse under vanilla Classifier-Free Guidance (CFG), and this effect is likely amplified here because these runs are still relatively undertrained in our setting. The authors of Guiding Token-Sparse Diffusion Models (Krause et al., 2025) argue this is partly an evaluation mismatch: routing changes the model’s effective capacity, and plain “conditional vs. unconditional” CFG often becomes less effective, which can artificially reduce quality. We deliberately did not use specialized guidance schemes to keep our benchmark consistent across methods, and at this stage it would also not be very meaningful to treat the sparse model as a “bad version of itself” for guidance. As a result, we consider these numbers directionally useful, but still pessimistic and worth interpreting with caution.
At 256×256, routing only gave modest gains because the model processes relatively few tokens. At 1024×1024, the picture changes completely. With 1024 tokens, routing finally targets the dominant cost, and the results are striking.
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 17.42 | 0.71 | 0.56 | 1.33 |
| TREAD | 14.10 | 0.46 | 0.37 | 1.64 |
| SPRINT | 16.90 | 0.51 | 0.41 | 1.89 |
Both TREAD and SPRINT deliver large throughput gains over the dense baseline, with SPRINT pushing speed the furthest. More importantly, this time the gains do not come at the expense of quality but quite the opposite. TREAD in particular stands out, with a dramatic drop in FID (17.42 → 14.10) alongside strong improvements in CMMD and DINO-MMD. SPRINT is slightly more aggressive and a bit noisier in quality, but still clearly improves over the baseline while being the fastest option.
In short, this is the regime where token routing really shines: high resolution, many tokens, and JiT-style pixel-space training. Here, routing is no longer a marginal optimization—it’s a major lever that improves both how fast and how well the model trains.
Data
After covering representation alignment, the core training objective, and token routing, we turned to the fourth axis that kept constantly mattered in practice: data. We found that the choice of training data, including how it is described through captions, can influence the trajectory of a training run as much as optimization techniques. Below are three concrete data experiments that consistently moved the needle in our setup.
Long vs. Short Captions
Captions are an essential part of the training set: for a text-to-image model, they are not just metadata, they are the supervision. The DALL·E 3 (Betker et al., 2023) research paper showed that richer captions can be one of the strongest levers for improving training signal and prompt-following. To isolate the effect in our setup, we kept everything else fixed and changed only the caption style to compare:
Long, descriptive captions (our baseline): multi-clause captions that mention composition, attributes, lighting, materials, and relationships.
Example
"A photograph depicts a fluffy lop-eared rabbit sitting on a weathered wooden surface outdoors. The rabbit is predominantly white with patches of light brown and tan fur, particularly on its head and ears. Its ears droop noticeably, and its fur appears soft and thick. The rabbit's eyes are dark and expressive. It is positioned slightly off-center, facing towards the left of the frame. Behind the rabbit, slightly out of focus, is a miniature dark red metal wheelbarrow. A partially visible orange apple sits to the left of the rabbit. Fallen autumn leaves, predominantly reddish-brown, are scattered around the rabbit and apple on the wooden surface. The background is a blurred but visible expanse of green grass, suggesting an outdoor setting. The lighting is soft and natural, likely diffused daylight, casting no harsh shadows. The overall atmosphere is calm, peaceful, and autumnal. The aesthetic is rustic and charming, with a focus on the rabbit as the main subject. The color palette is muted and natural, consisting mainly of whites, browns, oranges, and greens. The style is naturalistic and straightforward, without any overt artistic manipulation. The vibe is gentle and heartwarming."Short, one-line captions: minimal descriptions with much less structure.
Example
"A rabbit sitting on top of a wooden table."
What we observed
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 | 3.95 |
| Short-Captions | 36.84 | 0.98 | 1.14 | 3.95 |
The outcome was unambiguous: switching to short captions severely hurt convergence across all metrics. Long captions provide a richer supervision signal: beyond prompt adherence, there is a very practical optimization reason. More tokens usually means more information, and therefore more learning signal for the denoiser. When the conditioning text specifies composition, attributes, lighting, materials, and relationships, the model gets a sharper “target” for what the denoising trajectory should preserve and refine, especially early in training.
The fun paradox is that this extra detail often makes the learning problem easier, not harder: intuitively, one might expect longer prompts, with more attributes, constraints, and relationships, to increase complexity and burden the model. In practice, the opposite happens. Short captions leave many degrees of freedom unspecified, forcing the model to learn under ambiguity and implicitly average across multiple plausible interpretations. Long captions collapse that uncertainty by turning implicit choices into explicit constraints, allowing the denoiser to focus its capacity on refining a well-posed solution instead of guessing what matters.
Long captions are a strong training-time accelerator, but we still want the model to behave well on short prompts because that is how people actually use these systems. A simple workaround is to end training with a short fine-tuning stage on a mixture of long and short captions. That keeps the benefits of rich supervision early, while teaching the model to stay robust when conditioning is sparse.
Bootstrapping With Synthetic Images
Another data-related research question we explore is whether a low-cost synthetic corpus can accelerate early training compared to a real corpus of similar size. For this benchmark, we trained a model on a dataset of real images collected from Pexels and compare it with our Baseline which was trained on synthetic data generated with MidjourneyV6, both of which have around 1M images.
We evaluated both runs against the same Unsplash reference set, composed exclusively of real images.

| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ | batches/sec ↑ |
|---|---|---|---|---|
| Synthetic images | 18.20 | 0.41 | 0.39 | 3.95 |
| Real images | 16.6 | 0.5 | 0.46 | 3.95 |
The synthetic-trained model scores better on CMMD and DINO-MMD, while the model trained on real images achieves a lower FID. Rather than a contradiction, this split mostly reflects what these metrics emphasize.
FID is particularly sensitive to low-level image statistics: fine textures, high-frequency detail, noise patterns, and the subtle irregularities of real photography. Since our evaluation reference is composed of real images, a model trained on real photos naturally matches those statistics more closely, which translates into a better FID. Synthetic images, by contrast, often exhibit slightly different high-frequency signatures, cleaner edges, smoother micro-textures, more uniform noise, which are barely noticeable qualitatively but still get penalized by distributional metrics like FID.
Qualitatively, this difference is easy to spot. Models trained on synthetic data tend to produce images with cleaner global structure and stronger compositional and object coherence, but also exhibit a more synthetic appearance, characterized by smoother textures and reduced photographic noise. In contrast, models trained on real images better capture the irregular, fine-grained textures typical of natural photographs, though they often require more training to achieve comparable global structure.
One plausible explanation synthetic data remains so effective early on is that it exposes the model to a wider range of compositional collisions: unusual pairings of objects, attributes, styles, and viewpoints that rarely co-occur in natural datasets. While this can hurt realism at the texture level, it forces the model to explain a broader space of combinations, which appears to help with early disentanglement and structure learning.
Considered jointly, this suggests a simple but practical strategy: synthetic data is an efficient way to bootstrap training and lock in global structure quickly, while real images remain important later on if matching photographic texture statistics is the priority.
SFT With Alchemist: Small Dataset, Real Impact
Finally, we experimented with a targeted Supervised Fine-Tuning (SFT) pass using Alchemist (Startsev et al., 2025), a compact dataset explicitly curated for high-impact. Alchemist is small by design (3,350 image–text pairs), but is constructed through a sophisticated curation pipeline that starts from a web-scale pool and progressively distills it down to visually exceptional samples.
In our setup, we fine-tuned our preview models for 20K steps on Alchemist. Despite the dataset’s small size, it had an outsized effect: it adds a distinct “style layer” with better composition, more photographic polish, and richer scenes without a clear impact on generalization.
The samples below show a side-by-side comparison of generations from the same base model, before and after the Alchemist fine-tuning pass.
More Useful Tips for Training
Last but not least, we will briefly cover two practical training details that turned out to matter more than we expected. These factors are easily overlooked and in our case they had a clear impact on convergence and final image quality.
Muon Optimizer
We generally default to AdamW for our benchmarks because it’s predictable and easy to compare across runs. However, lately, we have seen a renewed interest in optimizers that try to behave more like a good preconditioner without the full overhead of second-order methods. One recent example is Muon (Jordan et al., 2024), which, at a high level, tries to improve optimization by applying a better-conditioned update step, often translating into faster convergence and cleaner progress early in training.
In our setup, Muon was one of the rare cases in which a change of optimizer produced an immediatly observable effect on the metrics.
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ |
|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 |
| Muon | 15.55 | 0.36 | 0.35 |
For this experiment, we used the official PyTorch implementation of Muon, which at the moment supports Distributed Data Parallel (DDP) training only. If you’re running Fully Sharded Data Parallel (FSDP), there are community variants available; for example here.
While we refrain from broad conclusions based on a single benchmark, these results indicate that optimizer choice extends beyond stability considerations and can yield tangible gains in time-to-quality.
Precision Gotcha: Casting vs. Storing weights in BF16
We eventually identified an error in our setup, where the denoiser weights were mistakenly stored in bfloat16 for a period of time.
To be clear, using the BF16 autocast is great. Running the forward and backward passes in BF16 or mixed precision is standard and usually what you want for speed and memory. The problem arises from keeping the parameters in BF16 precision, which negatively impacts numerically sensitive operations.
In practice, some layers and operations are much less tolerant to reduced parameter precision:
- normalization layers (e.g. LayerNorm / RMSNorm statistics),
- attention softmax/logits paths,
- RoPE,
- optimizers’ internal state / update dynamics.
| Method | FID ↓ | CMMD ↓ | DINO-MMD ↓ |
|---|---|---|---|
| Baseline | 18.20 | 0.41 | 0.39 |
| BF16 weights (bug) | 21.87 | 0.61 | 0.57 |
So the rule we now follow very strictly is: use BF16 autocast for compute, but keep weights (and optimizer state) in FP32 or at least ensure numerically sensitive modules stay FP32.
It is not a glamorous trick but it is exactly the kind of “silent” detail that can cost you multiple days of work if you do not notice it early.
Summary
We ran a systematic set of ablations on PRX training, comparing a range of optimization, representation, efficiency, and data choices against a clean flow-matching baseline using both quality metrics and throughput.
The biggest gains came from alignment: REPA boosts early convergence (best used as a burn-in, then turned off), and better latents/tokenizers (REPA-E/FLUX2-AE) give a large jump in quality with clear speed trade-offs. Objective tweaks were mixed—contrastive FM helped slightly, while x-prediction mattered most by enabling stable 1024² pixel training. Token routing (TREAD/SPRINT) is minor at 256² but becomes a major win at high resolution. Data and practical details also mattered: long captions are critical, synthetic vs. real data shifts texture vs. structure, small SFT adds polish, Muon helped, and BF16-stored weights quietly hurt training.
What's next?
That’s it for Part 2! If you want to play with an earlier public checkpoint from this series, the PRX-1024 T2I beta is still available here.
We're really excited about what’s next: in the coming weeks we will release the full source code of the PRX training framework, and we will do a public 24-hour “speedrun” where we combine the best ideas from this post into a single run and see how far the full recipe can go in one day.
If you made it this far, first of all thank you very much for your interest. Furthermore, we would love to have you join our Discord community where we discuss PRX progress and results, along with everything related to diffusion and text-to-image models.











































