sam3-bf16

facebook/sam3 converted to MLX (bfloat16, 1.72 GB).

Open-vocabulary object detection, instance segmentation, and video tracking on Apple Silicon (~860M parameters).

Quick Start

pip install mlx-vlm

from PIL import Image
from mlx_vlm.utils import load_model, get_model_path
from mlx_vlm.models.sam3.generate import Sam3Predictor
from mlx_vlm.models.sam3.processing_sam3 import Sam3Processor

model_path = get_model_path("mlx-community/sam3-bf16")
model = load_model(model_path)
processor = Sam3Processor.from_pretrained(str(model_path))
predictor = Sam3Predictor(model, processor, score_threshold=0.3)

Object Detection

image = Image.open("photo.jpg")
result = predictor.predict(image, text_prompt="a dog")

for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

Instance Segmentation

result = predictor.predict(image, text_prompt="a person")

# result.boxes   -> (N, 4) xyxy bounding boxes
# result.masks   -> (N, H, W) binary segmentation masks
# result.scores  -> (N,) confidence scores

import numpy as np
overlay = np.array(image).copy()
W, H = image.size
for i in range(len(result.scores)):
    mask = result.masks[i]
    if mask.shape != (H, W):
        mask = np.array(Image.fromarray(mask.astype(np.float32)).resize((W, H)))
    binary = mask > 0
    overlay[binary] = (overlay[binary] * 0.5 + np.array([255, 0, 0]) * 0.5).astype(np.uint8)

Box-Guided Detection

import numpy as np
boxes = np.array([[100, 50, 400, 350]])  # xyxy pixel coords
result = predictor.predict(image, text_prompt="a cat", boxes=boxes)

Semantic Segmentation

import mlx.core as mx

inputs = processor.preprocess_image(image)
text_inputs = processor.preprocess_text("a cat")
outputs = model.detect(
    mx.array(inputs["pixel_values"]),
    mx.array(text_inputs["input_ids"]),
    mx.array(text_inputs["attention_mask"]),
)
mx.eval(outputs)

pred_masks = outputs["pred_masks"]      # (B, 200, 288, 288) instance masks
semantic_seg = outputs["semantic_seg"]  # (B, 1, 288, 288) semantic segmentation

Video Tracking (CLI)

python -m mlx_vlm.models.sam3.track_video --video input.mp4 --prompt "a car" --model mlx-community/sam3-bf16

Flag	Default	Description
`--video`	(required)	Input video path
`--prompt`	(required)	Text prompt
`--output`	`<input>_tracked.mp4`	Output video path
`--model`	`facebook/sam3`	Model path or HF repo
`--threshold`	`0.15`	Score threshold
`--every`	`2`	Detect every N frames

Original Model

facebook/sam3 · Paper · Code

License

The original SAM3 model weights are released by Meta under the SAM License, a custom permissive license that grants a non-exclusive, worldwide, royalty-free license to use, reproduce, distribute, and modify the SAM Materials. Key points:

Commercial and research use is permitted
Derivative works must include a copy of the SAM License and attribution to Meta
Provided "AS IS" without warranty
Subject to applicable trade controls

This MLX conversion is a derivative work. By using it, you agree to the terms of Meta's SAM License. See the full license text for details.

Downloads last month: 43

Safetensors

Model size

0.9B params

Tensor type

BF16

MLX

Hardware compatibility

Quantized

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/sam3-bf16

Base model

facebook/sam3

Finetuned

(11)

this model