Buckets:

256 GB
18,609 files
Updated 1 day ago
NameSize
.cache
.ipynb_checkpoints
.venv
data
flux2distill
models
monet_cache
outputs
prompts
recovered
report
scripts
CLAUDE.md16.8 kB
xet
README.md7.42 kB
xet
RESULTS.md8.97 kB
xet
TODO.md8.12 kB
xet
block_surgery_plan.md24.2 kB
xet
block_surgery_todo.md4.5 kB
xet
init-plan.md11.4 kB
xet
plan.md9.29 kB
xet
recovered.zip161 kB
xet
README.md

flux2distill — FLUX.2 [klein] 4B compression

Compress FLUX.2 [klein] distilled 4B (4-step, CFG-free MM-DiT) into a smaller, faster model. Dev/prototyping rig on 1× A100-80GB (SDPA, no FlashAttention); designed to lift onto a B200. See plan.md for the active design + decision log, RESULTS.md for all numbers.

ACTIVE TRACK — W4A8 SVDQuant (post-training quantization)

Our own fake-quant SVDQuant: per-Linear smooth → (whitened) SVD low-rank (16-bit) + iterative refine → 4-bit residual, 8-bit per-token activations. Quality measured on the same held-out velocity-matching loss as the surgery track (so they're comparable). Best: r64 plain+refine = 0.0446 @ 3.43× smaller — ~4–5× closer to the teacher than the entire block-surgery frontier. Full methodology + grid + per-cell montages: report/QUANT_REPORT.{md,pdf}.

source .venv/bin/activate; export PYTHONPATH=.        # torch 2.12+cu126; system python has no torch
# one grid cell (build + eval), its own logs:  args = RANK variant WHITEN REFINE
bash scripts/run_cell.sh 64 plain_refine 0 3          # -> outputs/abl_c300_r64_plain_refine/
python3 scripts/make_quant_report_assets.py           # analysis figures
python3 scripts/build_report_pdf.py                   # report/QUANT_REPORT.pdf (incl. all montages)

Run experiments ONE AT A TIME with per-run logs + a Monitor (no batched bg loops). Calibration uses the cached data/monet_cache latents (no image download for the 300-img grid). The 2000-img calib re-sweep (scripts/11data/monet_calib) is the queued next experiment — see TODO.md.

Backup / sync to the HF bucket

Work is archived to the HF bucket hf://buckets/Mercity/FluxDistill via hf sync. Set HF_TOKEN (do NOT commit a token — read it from the env), and exclude the regenerable / huge / secret paths. --no-delete is the default, so local deletions do not propagate to the bucket (additive backup). Preview with --dry-run first; each --exclude pattern needs its own flag.

export HF_TOKEN=hf_...                      # your token; rotate if it ever leaks
hf sync ./ hf://buckets/Mercity/FluxDistill \
  --exclude "*.pyc" \
  --exclude "**/__pycache__/**" \
  --exclude ".venv/**" \                    # 11 GB, regenerable
  --exclude ".cache/**" \                   # HF/pip caches
  --exclude "tmp/**" \                      # scratch logs
  --exclude "models/**"                     # 23 GB — the PUBLIC teacher, already on HF (drop this line to include it)
# add --dry-run to preview the plan (uploads nothing); --no-delete is default (deletions don't propagate)

SHELVED TRACK — block surgery (depth-prune → surrogates → distill)

Topped out at ~1.15–1.26× and was quality-bounded (best 0.231 vs quant's ~0.045). Kept for record (block_surgery_plan.md, block_surgery_todo.md, scripts 01–10). The rest of this README documents that track. NOTE: its .pt model states were deleted to reclaim space (sample images / logs / selection.json kept). Original design + decision log below.

Status (2026-05-31)

Stage State
Env + klein-4B download + arch verification
Surgery: block selection + warm-started surrogates → student
Inference (teacher & student) ✅ teacher 0.45s/img, student ~0.31s/img @512/4steps
Eval: 28-prompt set + multi-agent visual review outputs/eval/baseline/REVIEW.md
Data: monet URL→VAE-latent cache data/monet_cache/
Basic distillation training loop ✅ velocity-match + FM grounding, Muon+AdamW

Key finding: a per-token low-rank+GELU surrogate cannot reproduce attention's token-mixing, so dropping 12 of 20 single blocks (v1) collapses the model. v2 keeps most blocks full and drops only the 6 least-important single blocks (by leave-one-out ablation) → 3.16B, functional pre-training. The route back to ~2B is a token-mixing surrogate (local-window / linear attention) — see plan.md TODO.

Models produced

  • outputs/student/ — v1 (drop 12 by SVD-energy) — non-functional (reference).
  • outputs/student_v2/ — v2 (drop 6 by importance) — 3.16B, functional baseline.
  • outputs/train_v2/ — v2 after the basic recovery run (+ sample grids).

Layout

flux2distill/
  config.py        # all knobs (model / surgery / data / train / eval)
  surrogate.py     # LowRankResidualSurrogate (x + B·σ(A·x)) + lstsq/SVD init
  surgery.py       # importance ablation, SVD-energy selection, build/attach student
  calibration.py   # surrogate warm-start gradient fit
  losses.py        # velocity matching + flow-matching grounding
  data.py          # cached-latent dataset
  model_utils.py   # load teacher/student, Muon/AdamW param split, param counts
  eval_utils.py    # prompt parsing, student loader, comparison grids
  optim/muon.py    # Muon optimizer (2D weights)
scripts/
  01_inspect_model.py     # introspect transformer module tree / params
  02_teacher_smoke.py     # teacher 4-step generation sanity
  03_build_student.py     # v1 surgery (SVD-energy, drop 12)
  04_gen_eval.py [tag]    # teacher-vs-student images across prompt set
  05_build_student_v2.py [drop_k]  # v2 surgery (importance, drop 6)
  06_cache_data.py [N]    # monet URL → VAE latents cache
  07_train.py [steps]     # FLAWED baseline run (trained all weights → diverged); kept as record
  08_train_recover.py [steps] [adamw|muon] [lr]  # CORRECT: surrogate-only, frozen base, cosine+clip
prompts/eval_prompts.txt  # 28 prompts, tagged by capability
plan.md                   # design + decision log + findings

Run order

export PYTHONPATH=.
python3 scripts/01_inspect_model.py            # (optional) verify architecture
python3 scripts/02_teacher_smoke.py            # teacher works
python3 scripts/05_build_student_v2.py 6       # build the v2 student (drop 6)
python3 scripts/04_gen_eval.py baseline        # teacher vs student image pairs
python3 scripts/06_cache_data.py 200           # cache training data
python3 -u scripts/08_train_recover.py 300 adamw 1e-4   # surrogate-only recovery (correct recipe)

Training recipe (research-led)

Only the 6 surrogate modules (~19M, 0.6%) are trained; the pretrained network is frozen (training the kept blocks at high LR was what diverged — see plan.md). Surrogates are adapter-like, so the diffusion/LoRA regime applies: AdamW @ 1e-4, cosine decay to a 15%-of-base floor (not 0), grad-clip 1.0, fp32 master on the trained params. Muon's lr~0.02 is a bulk-pretraining value (nanoGPT/Kimi) — reserved for the later full-recovery run (the §8 Muon-vs-AdamW A/B), not adapter training. The loop logs a fixed held-out eval velocity-loss (objective metric), per-step sample images, grad norm, and saves the best checkpoint; a divergence guard auto-stops if eval-loss exceeds 3× baseline.

Notes / upgrades for the big run (B200)

  • Surrogate v2 → token-mixing (the real lever to reach ~2B): local-window or linear attention.
  • FlashAttention-4 / FlexAttention; larger batch; torch.compile.
  • fp32 master weights + fp32 moments (current dev run trains in bf16).
  • Trajectory velocity matching on the 4 schedule sigmas (current run samples σ~U(0,1) on cached latents).
  • Feature matching on retained blocks (masked KD); offline latent shards at 300k scale.
Total size
256 GB
Files
18,609
Last updated
Jun 1
Pre-warmed CDN
US EU US EU

Contributors