sam3-bf16

facebook/sam3 converted to MLX (bfloat16, 1.72 GB).

Open-vocabulary object detection, instance segmentation, and video tracking on Apple Silicon (~860M parameters).

Quick Start

pip install mlx-vlm
from PIL import Image
from mlx_vlm.utils import load_model, get_model_path
from mlx_vlm.models.sam3.generate import Sam3Predictor
from mlx_vlm.models.sam3.processing_sam3 import Sam3Processor

model_path = get_model_path("mlx-community/sam3-bf16")
model = load_model(model_path)
processor = Sam3Processor.from_pretrained(str(model_path))
predictor = Sam3Predictor(model, processor, score_threshold=0.3)

Object Detection

image = Image.open("photo.jpg")
result = predictor.predict(image, text_prompt="a dog")

for i in range(len(result.scores)):
    x1, y1, x2, y2 = result.boxes[i]
    print(f"[{result.scores[i]:.2f}] box=({x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f})")

Instance Segmentation

result = predictor.predict(image, text_prompt="a person")

# result.boxes   -> (N, 4) xyxy bounding boxes
# result.masks   -> (N, H, W) binary segmentation masks
# result.scores  -> (N,) confidence scores

import numpy as np
overlay = np.array(image).copy()
W, H = image.size
for i in range(len(result.scores)):
    mask = result.masks[i]
    if mask.shape != (H, W):
        mask = np.array(Image.fromarray(mask.astype(np.float32)).resize((W, H)))
    binary = mask > 0
    overlay[binary] = (overlay[binary] * 0.5 + np.array([255, 0, 0]) * 0.5).astype(np.uint8)

Box-Guided Detection

import numpy as np
boxes = np.array([[100, 50, 400, 350]])  # xyxy pixel coords
result = predictor.predict(image, text_prompt="a cat", boxes=boxes)

Semantic Segmentation

import mlx.core as mx

inputs = processor.preprocess_image(image)
text_inputs = processor.preprocess_text("a cat")
outputs = model.detect(
    mx.array(inputs["pixel_values"]),
    mx.array(text_inputs["input_ids"]),
    mx.array(text_inputs["attention_mask"]),
)
mx.eval(outputs)

pred_masks = outputs["pred_masks"]      # (B, 200, 288, 288) instance masks
semantic_seg = outputs["semantic_seg"]  # (B, 1, 288, 288) semantic segmentation

Video Tracking (CLI)

python -m mlx_vlm.models.sam3.track_video --video input.mp4 --prompt "a car" --model mlx-community/sam3-bf16
Flag Default Description
--video (required) Input video path
--prompt (required) Text prompt
--output <input>_tracked.mp4 Output video path
--model facebook/sam3 Model path or HF repo
--threshold 0.15 Score threshold
--every 2 Detect every N frames

Original Model

facebook/sam3PaperCode

License

The original SAM3 model weights are released by Meta under the SAM License, a custom permissive license that grants a non-exclusive, worldwide, royalty-free license to use, reproduce, distribute, and modify the SAM Materials. Key points:

  • Commercial and research use is permitted
  • Derivative works must include a copy of the SAM License and attribution to Meta
  • Provided "AS IS" without warranty
  • Subject to applicable trade controls

This MLX conversion is a derivative work. By using it, you agree to the terms of Meta's SAM License. See the full license text for details.

Downloads last month
43
Safetensors
Model size
0.9B params
Tensor type
BF16
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for mlx-community/sam3-bf16

Base model

facebook/sam3
Finetuned
(11)
this model