Familiarity-Flow OneBox 8-Layer

Flow-matching policy for stereo-image-conditioned 3D grasp-offset prediction, trained on the OneBox synthetic Isaac-Sim dataset. The full learning dynamics — value of the prediction, geometry of the flow, and Jacobian-of-conditioning OOD signal — are studied in the Familiarity-Flow repo.

Intended primarily as the conditioning-energy OOD-detection backend for robotic-policy gating, exposed through the familiarity-planner package.

This checkpoint comes from a 150,000-step extended-training study that explored flow / OOD-separation dynamics well past the conventional convergence point. See docs/long_run_analysis.md in the repo for the full write-up (multi-descent behaviour observed, not the monotone-plateau or terminal-collapse initially hypothesised).


Checkpoint summary

Field Value
Architecture FlowMatchingPolicy, 8 cross-attention layers
Vision encoder DINOv2-B (ViT-B/14, frozen)
Action space ℝ³ (3-DoF grasp offset)
Time sampling Beta(1.5, 1) (π₀ schedule)
Training data OneBox (synthetic Isaac Sim, ZED-Mini stereo)
Training steps 128,250 (best val_loss checkpoint of 150k-step run)
Best val_loss 0.0639
Best val L2 error 0.1462
Parameters 244 M total, 35.6 M trainable (encoder frozen)
License MIT

OOD-separation at this checkpoint (step 128,250)

Metric ID OOD (clutter) WILD (real) OOD/ID WILD/ID
CE 0.642 3.341 2.077 5.20× 3.23×
DCE 0.062 0.303 0.186 4.87× 2.99×

AUROC(ID vs OOD) and AUROC(ID vs WILD) are both 1.000 (rank-based separation is perfect and has been since step ≈ 8k).

Reported directly from the training log at outputs/csv/onebox/version_15 in the repo.

vs the previous checkpoint (step 21,850, val_loss 0.0726)

Strictly better or tied on every metric we measured:

Previous This checkpoint Δ
val/loss 0.0726 0.0639 −12.0%
val/l2_error 0.1755 0.1462 −16.7%
ood/loss 4.414 4.241 −3.9%
ood/l2_error 1.371 1.271 −7.3%
CE WILD/ID 2.79× 3.23× +15.8%
DCE OOD/ID 4.32× 4.87× +12.7%
DCE WILD/ID 2.41× 2.99× +24.1%

(CE OOD/ID drifted −2.1%, well inside the run-to-run variance observed during the extended run.)

Threshold-shift note: absolute CE/DCE values in this checkpoint are ~3× larger than in the previous one (CE_ID 0.225 → 0.642). A downstream OOD detector using an absolute threshold needs to be re-calibrated — ratios are preserved but the raw scale is not.


Usage

Download

from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(
    repo_id="TomNotch/familiarity-flow-onebox-8L",
    filename="onebox_8L.ckpt",
)

Load directly (Familiarity-Flow must be installed)

from familiarity_flow.lightning.module import FlowMatchingModule

module = FlowMatchingModule.load_from_checkpoint(ckpt_path, map_location="cuda")
module.eval()
policy = module.ema_policy   # EMA-averaged weights used for inference

Score a batch for OOD-ness

# images: list of stereo image tensors, each shaped (B, 3, 224, 224)
ce = policy.ood_score(images, num_steps=10)   # shape: (B,)
# Higher CE = more OOD

Via familiarity-planner

from familiarity_planner.familiarity import Familiarity

fam = Familiarity(
    "conditioning_energy",
    checkpoint_path="TomNotch/familiarity-flow-onebox-8L",   # auto-downloaded
)
score = fam(stereo_observation)   # smaller = more familiar

Method

Conditional flow matching with linear interpolation and independent coupling (Lipman et al., ICLR 2023). The conditioning energy

CE(c)=10vθc(xt,t,c)F2dt\mathrm{CE}(c) = \int_1^0 \left\lVert \frac{\partial v_\theta}{\partial c}(x_t, t, c) \right\rVert_F^2 \, \mathrm{d}t

is measured along the deterministic Euler ODE trajectory from noise (x_1 ∼ N(0, I)) to the predicted action (x_0). Its endpoint-Jacobian cousin DCE measures the squared Frobenius norm of ∂φ/∂c where φ is the full ODE map. Both scale as out-of-distribution inputs excite the learned velocity field's sensitivity to conditioning — a signal that falls out of the geometry of the flow without any auxiliary classifier.


Limitations

  • Trained on a single synthetic domain (OneBox Isaac Sim renderings). Generalisation across robots, object sets, or camera rigs is not claimed.
  • Action head predicts only a 3-DoF grasp offset; not a full pose or trajectory.
  • OOD-detection quality (CE/DCE) is strong on the OneBox clutter and wild eval sets used during training — behaviour on arbitrary out-of-domain inputs is untested.
  • Not for deployment on physical robots without independent validation. Intended as a research artefact and as a concrete backend for methodology study.

Related work

  • Lipman et al., Flow Matching for Generative Modeling, ICLR 2023 (arXiv:2210.02747)
  • Black et al., π₀: A Vision-Language-Action Flow Model for General Robot Control (arXiv:2410.24164)
  • Chen et al., Neural Ordinary Differential Equations, NeurIPS 2018 (arXiv:1806.07366)
  • Liu et al., Simple and Principled Uncertainty Estimation (SNGP), NeurIPS 2020 (arXiv:2006.10108)
  • Nakkiran et al., Deep Double Descent, ICLR 2020 (arXiv:1912.02292)

Author

Mukai (Tom Notch) Yu — Carnegie Mellon University, Robotics Institute. Course project for 16-832 / 16-761 (Spring 2026).

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Papers for TomNotch/familiarity-flow-onebox-8L