SAM3 Browser INT8 β Quantized ONNX Models
INT8-quantized ONNX models for running SAM3 (Segment Anything Model 3) entirely in the browser via ONNX Runtime Web.
Files
| File | Size | Description |
|---|---|---|
sam3_image_encoder.onnx |
466 MB | ViT backbone β encodes the input image into feature maps |
sam3_language_encoder.onnx |
387 MB | CLIP text encoder β converts text prompts into embeddings |
sam3_decoder.onnx |
35 MB | DETR-style decoder β produces boxes, scores, and pixel masks |
clip_tokenizer.json |
1.5 MB | CLIP BPE tokenizer vocabulary (encoder + merge table + byte encoder) |
| Total | ~889 MB |
Tokenizer
clip_tokenizer.json contains the full CLIP BPE tokenizer data needed to tokenize text prompts for the language encoder. It includes:
encoder: BPE token β integer ID mapping (~49,408 entries)merges: BPE merge rules (~48,894 pairs)byte_encoder: byte-to-unicode mapping for UTF-8 handling
This is extracted from OpenAI's CLIP bpe_simple_vocab_16e6.txt.gz and packaged as JSON for browser use. Load it at runtime (~500 KB gzipped by CDN) and use it to tokenize prompts into int64 token sequences of length 32, padded with [START=49406, ...tokens..., END=49407, 0, 0, ...].
Quantization
Dynamic INT8 quantization via onnxruntime.quantization.quantize_dynamic with QUInt8 weights. Original FP32 models totaled ~3.5 GB.
Quality is preserved: the quantized pipeline scores 0.9495 on a test image (vs 0.9471 for FP32).
Usage
These models are designed for browser inference. Load them with ONNX Runtime Web:
Pipeline
- Tokenizer: text prompt β
clip_tokenizer.jsonβint64[1, 32](CLIP BPE tokens) - Image encoder: input
image(uint8, shape[3, 1008, 1008]) β 6 output tensors (vision pos encodings + backbone FPN features) - Language encoder: input
tokens(int64, shape[1, 32]) βtext_attention_mask,text_memory,text_embeds - Decoder: combines encoder outputs + prompt tensors β
boxes,scores,masks
Decoder inputs
| Input | Type | Shape | Source |
|---|---|---|---|
original_height |
int64 | scalar | Original image height |
original_width |
int64 | scalar | Original image width |
vision_pos_enc_2 |
float32 | from encoder | Image encoder output |
backbone_fpn_0/1/2 |
float32 | from encoder | Image encoder outputs |
language_mask |
float32 | from lang encoder | = text_attention_mask |
language_features |
float32 | from lang encoder | = text_memory |
box_coords |
float32 | [1, 1, 4] |
Zeros for text-only prompting |
box_labels |
int64 | [1, 1] |
Ones |
box_masks |
bool | [1, 1] |
Ones |
Performance
| Environment | Score | Total time |
|---|---|---|
| Python FP32 (CPU) | 0.9471 | 7.6s |
| Python INT8 (CPU) | 0.9495 | 4.5s |
| Browser WASM | 0.9402 | 94.7s |
| Browser WebGPU | ~0.94 (est.) | ~6-18s (est.) |
Source
Original SAM3 ONNX models from vietanhdev/segment-anything-3-onnx-models, quantized for browser deployment.