Familiarity-Flow OneBox 8-Layer
Flow-matching policy for stereo-image-conditioned 3D grasp-offset prediction, trained on the OneBox synthetic Isaac-Sim dataset. The full learning dynamics — value of the prediction, geometry of the flow, and Jacobian-of-conditioning OOD signal — are studied in the Familiarity-Flow repo.
Intended primarily as the conditioning-energy OOD-detection backend for robotic-policy gating, exposed through the familiarity-planner package.
This checkpoint comes from a 150,000-step extended-training study
that explored flow / OOD-separation dynamics well past the conventional
convergence point. See
docs/long_run_analysis.md
in the repo for the full write-up (multi-descent behaviour observed, not
the monotone-plateau or terminal-collapse initially hypothesised).
Checkpoint summary
| Field | Value |
|---|---|
| Architecture | FlowMatchingPolicy, 8 cross-attention layers |
| Vision encoder | DINOv2-B (ViT-B/14, frozen) |
| Action space | ℝ³ (3-DoF grasp offset) |
| Time sampling | Beta(1.5, 1) (π₀ schedule) |
| Training data | OneBox (synthetic Isaac Sim, ZED-Mini stereo) |
| Training steps | 128,250 (best val_loss checkpoint of 150k-step run) |
| Best val_loss | 0.0639 |
| Best val L2 error | 0.1462 |
| Parameters | 244 M total, 35.6 M trainable (encoder frozen) |
| License | MIT |
OOD-separation at this checkpoint (step 128,250)
| Metric | ID | OOD (clutter) | WILD (real) | OOD/ID | WILD/ID |
|---|---|---|---|---|---|
| CE | 0.642 | 3.341 | 2.077 | 5.20× | 3.23× |
| DCE | 0.062 | 0.303 | 0.186 | 4.87× | 2.99× |
AUROC(ID vs OOD) and AUROC(ID vs WILD) are both 1.000 (rank-based separation is perfect and has been since step ≈ 8k).
Reported directly from the training log at
outputs/csv/onebox/version_15 in the repo.
vs the previous checkpoint (step 21,850, val_loss 0.0726)
Strictly better or tied on every metric we measured:
| Previous | This checkpoint | Δ | |
|---|---|---|---|
| val/loss | 0.0726 | 0.0639 | −12.0% |
| val/l2_error | 0.1755 | 0.1462 | −16.7% |
| ood/loss | 4.414 | 4.241 | −3.9% |
| ood/l2_error | 1.371 | 1.271 | −7.3% |
| CE WILD/ID | 2.79× | 3.23× | +15.8% |
| DCE OOD/ID | 4.32× | 4.87× | +12.7% |
| DCE WILD/ID | 2.41× | 2.99× | +24.1% |
(CE OOD/ID drifted −2.1%, well inside the run-to-run variance observed during the extended run.)
Threshold-shift note: absolute CE/DCE values in this checkpoint are ~3× larger than in the previous one (CE_ID 0.225 → 0.642). A downstream OOD detector using an absolute threshold needs to be re-calibrated — ratios are preserved but the raw scale is not.
Usage
Download
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
repo_id="TomNotch/familiarity-flow-onebox-8L",
filename="onebox_8L.ckpt",
)
Load directly (Familiarity-Flow must be installed)
from familiarity_flow.lightning.module import FlowMatchingModule
module = FlowMatchingModule.load_from_checkpoint(ckpt_path, map_location="cuda")
module.eval()
policy = module.ema_policy # EMA-averaged weights used for inference
Score a batch for OOD-ness
# images: list of stereo image tensors, each shaped (B, 3, 224, 224)
ce = policy.ood_score(images, num_steps=10) # shape: (B,)
# Higher CE = more OOD
Via familiarity-planner
from familiarity_planner.familiarity import Familiarity
fam = Familiarity(
"conditioning_energy",
checkpoint_path="TomNotch/familiarity-flow-onebox-8L", # auto-downloaded
)
score = fam(stereo_observation) # smaller = more familiar
Method
Conditional flow matching with linear interpolation and independent coupling (Lipman et al., ICLR 2023). The conditioning energy
is measured along the deterministic Euler ODE trajectory from noise
(x_1 ∼ N(0, I)) to the predicted action (x_0). Its endpoint-Jacobian
cousin DCE measures the squared Frobenius norm of ∂φ/∂c where φ is
the full ODE map. Both scale as out-of-distribution inputs excite the
learned velocity field's sensitivity to conditioning — a signal that
falls out of the geometry of the flow without any auxiliary classifier.
Limitations
- Trained on a single synthetic domain (OneBox Isaac Sim renderings). Generalisation across robots, object sets, or camera rigs is not claimed.
- Action head predicts only a 3-DoF grasp offset; not a full pose or trajectory.
- OOD-detection quality (CE/DCE) is strong on the OneBox
clutterandwildeval sets used during training — behaviour on arbitrary out-of-domain inputs is untested. - Not for deployment on physical robots without independent validation. Intended as a research artefact and as a concrete backend for methodology study.
Related work
- Lipman et al., Flow Matching for Generative Modeling, ICLR 2023 (arXiv:2210.02747)
- Black et al., π₀: A Vision-Language-Action Flow Model for General Robot Control (arXiv:2410.24164)
- Chen et al., Neural Ordinary Differential Equations, NeurIPS 2018 (arXiv:1806.07366)
- Liu et al., Simple and Principled Uncertainty Estimation (SNGP), NeurIPS 2020 (arXiv:2006.10108)
- Nakkiran et al., Deep Double Descent, ICLR 2020 (arXiv:1912.02292)
Author
Mukai (Tom Notch) Yu — Carnegie Mellon University, Robotics Institute. Course project for 16-832 / 16-761 (Spring 2026).