Buckets:
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| .cache | 428 items | ||
| .ipynb_checkpoints | 2 items | ||
| .venv | +10k items | ||
| data | 2 items | ||
| flux2distill | 23 items | ||
| models | 52 items | ||
| monet_cache | 2 items | ||
| outputs | 437 items | ||
| prompts | 1 items | ||
| recovered | 48 items | ||
| report | 6 items | ||
| scripts | 29 items | ||
| CLAUDE.md | 16.8 kB xet | ea4d8047 | |
| README.md | 7.42 kB xet | 7a17b01a | |
| RESULTS.md | 8.97 kB xet | 2338d377 | |
| TODO.md | 8.12 kB xet | e50aa2a6 | |
| block_surgery_plan.md | 24.2 kB xet | de7177f2 | |
| block_surgery_todo.md | 4.5 kB xet | 06090e73 | |
| init-plan.md | 11.4 kB xet | c67552ce | |
| plan.md | 9.29 kB xet | 004f50f8 | |
| recovered.zip | 161 kB xet | 957becb6 |
flux2distill — FLUX.2 [klein] 4B compression
Compress FLUX.2 [klein] distilled 4B (4-step, CFG-free MM-DiT) into a smaller, faster model.
Dev/prototyping rig on 1× A100-80GB (SDPA, no FlashAttention); designed to lift onto a B200.
See plan.md for the active design + decision log, RESULTS.md for all numbers.
ACTIVE TRACK — W4A8 SVDQuant (post-training quantization)
Our own fake-quant SVDQuant: per-Linear smooth → (whitened) SVD low-rank (16-bit) + iterative refine → 4-bit residual, 8-bit per-token activations. Quality measured on the same held-out
velocity-matching loss as the surgery track (so they're comparable). Best: r64 plain+refine =
0.0446 @ 3.43× smaller — ~4–5× closer to the teacher than the entire block-surgery frontier.
Full methodology + grid + per-cell montages: report/QUANT_REPORT.{md,pdf}.
source .venv/bin/activate; export PYTHONPATH=. # torch 2.12+cu126; system python has no torch
# one grid cell (build + eval), its own logs: args = RANK variant WHITEN REFINE
bash scripts/run_cell.sh 64 plain_refine 0 3 # -> outputs/abl_c300_r64_plain_refine/
python3 scripts/make_quant_report_assets.py # analysis figures
python3 scripts/build_report_pdf.py # report/QUANT_REPORT.pdf (incl. all montages)
Run experiments ONE AT A TIME with per-run logs + a Monitor (no batched bg loops). Calibration uses
the cached data/monet_cache latents (no image download for the 300-img grid). The 2000-img calib
re-sweep (scripts/11 → data/monet_calib) is the queued next experiment — see TODO.md.
Backup / sync to the HF bucket
Work is archived to the HF bucket hf://buckets/Mercity/FluxDistill via hf sync. Set
HF_TOKEN (do NOT commit a token — read it from the env), and exclude the regenerable / huge /
secret paths. --no-delete is the default, so local deletions do not propagate to the bucket
(additive backup). Preview with --dry-run first; each --exclude pattern needs its own flag.
export HF_TOKEN=hf_... # your token; rotate if it ever leaks
hf sync ./ hf://buckets/Mercity/FluxDistill \
--exclude "*.pyc" \
--exclude "**/__pycache__/**" \
--exclude ".venv/**" \ # 11 GB, regenerable
--exclude ".cache/**" \ # HF/pip caches
--exclude "tmp/**" \ # scratch logs
--exclude "models/**" # 23 GB — the PUBLIC teacher, already on HF (drop this line to include it)
# add --dry-run to preview the plan (uploads nothing); --no-delete is default (deletions don't propagate)
SHELVED TRACK — block surgery (depth-prune → surrogates → distill)
Topped out at ~1.15–1.26× and was quality-bounded (best 0.231 vs quant's ~0.045). Kept for record
(block_surgery_plan.md, block_surgery_todo.md, scripts 01–10). The rest of this README
documents that track. NOTE: its .pt model states were deleted to reclaim space (sample images /
logs / selection.json kept). Original design + decision log below.
Status (2026-05-31)
| Stage | State |
|---|---|
| Env + klein-4B download + arch verification | ✅ |
| Surgery: block selection + warm-started surrogates → student | ✅ |
| Inference (teacher & student) | ✅ teacher 0.45s/img, student ~0.31s/img @512/4steps |
| Eval: 28-prompt set + multi-agent visual review | ✅ outputs/eval/baseline/REVIEW.md |
| Data: monet URL→VAE-latent cache | ✅ data/monet_cache/ |
| Basic distillation training loop | ✅ velocity-match + FM grounding, Muon+AdamW |
Key finding: a per-token low-rank+GELU surrogate cannot reproduce attention's
token-mixing, so dropping 12 of 20 single blocks (v1) collapses the model. v2 keeps
most blocks full and drops only the 6 least-important single blocks (by leave-one-out
ablation) → 3.16B, functional pre-training. The route back to ~2B is a token-mixing
surrogate (local-window / linear attention) — see plan.md TODO.
Models produced
outputs/student/— v1 (drop 12 by SVD-energy) — non-functional (reference).outputs/student_v2/— v2 (drop 6 by importance) — 3.16B, functional baseline.outputs/train_v2/— v2 after the basic recovery run (+ sample grids).
Layout
flux2distill/
config.py # all knobs (model / surgery / data / train / eval)
surrogate.py # LowRankResidualSurrogate (x + B·σ(A·x)) + lstsq/SVD init
surgery.py # importance ablation, SVD-energy selection, build/attach student
calibration.py # surrogate warm-start gradient fit
losses.py # velocity matching + flow-matching grounding
data.py # cached-latent dataset
model_utils.py # load teacher/student, Muon/AdamW param split, param counts
eval_utils.py # prompt parsing, student loader, comparison grids
optim/muon.py # Muon optimizer (2D weights)
scripts/
01_inspect_model.py # introspect transformer module tree / params
02_teacher_smoke.py # teacher 4-step generation sanity
03_build_student.py # v1 surgery (SVD-energy, drop 12)
04_gen_eval.py [tag] # teacher-vs-student images across prompt set
05_build_student_v2.py [drop_k] # v2 surgery (importance, drop 6)
06_cache_data.py [N] # monet URL → VAE latents cache
07_train.py [steps] # FLAWED baseline run (trained all weights → diverged); kept as record
08_train_recover.py [steps] [adamw|muon] [lr] # CORRECT: surrogate-only, frozen base, cosine+clip
prompts/eval_prompts.txt # 28 prompts, tagged by capability
plan.md # design + decision log + findings
Run order
export PYTHONPATH=.
python3 scripts/01_inspect_model.py # (optional) verify architecture
python3 scripts/02_teacher_smoke.py # teacher works
python3 scripts/05_build_student_v2.py 6 # build the v2 student (drop 6)
python3 scripts/04_gen_eval.py baseline # teacher vs student image pairs
python3 scripts/06_cache_data.py 200 # cache training data
python3 -u scripts/08_train_recover.py 300 adamw 1e-4 # surrogate-only recovery (correct recipe)
Training recipe (research-led)
Only the 6 surrogate modules (~19M, 0.6%) are trained; the pretrained network is frozen (training the kept blocks at high LR was what diverged — see plan.md). Surrogates are adapter-like, so the diffusion/LoRA regime applies: AdamW @ 1e-4, cosine decay to a 15%-of-base floor (not 0), grad-clip 1.0, fp32 master on the trained params. Muon's lr~0.02 is a bulk-pretraining value (nanoGPT/Kimi) — reserved for the later full-recovery run (the §8 Muon-vs-AdamW A/B), not adapter training. The loop logs a fixed held-out eval velocity-loss (objective metric), per-step sample images, grad norm, and saves the best checkpoint; a divergence guard auto-stops if eval-loss exceeds 3× baseline.
Notes / upgrades for the big run (B200)
- Surrogate v2 → token-mixing (the real lever to reach ~2B): local-window or linear attention.
- FlashAttention-4 / FlexAttention; larger batch;
torch.compile. - fp32 master weights + fp32 moments (current dev run trains in bf16).
- Trajectory velocity matching on the 4 schedule sigmas (current run samples σ~U(0,1) on cached latents).
- Feature matching on retained blocks (masked KD); offline latent shards at 300k scale.
- Total size
- 256 GB
- Files
- 18,609
- Last updated
- Jun 1
- Pre-warmed CDN
- US EU US EU