Gemma3-1B-IT (ExecuTorch, XNNPACK, INT8-INT4)

This repository contains an ExecuTorch .pte export of google/gemma-3-1b-it for CPU inference on XNNPACK.

The export uses quantization for linear and embedding layers.

Contents

  • model.pte: ExecuTorch program
  • tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json, chat_template.jinja: tokenizer/chat artifacts
  • config.json, generation_config.json: upstream model metadata

Export Configuration

Command used:

optimum-cli export executorch \
  --model "google/gemma-3-1b-it" \
  --task "text-generation" \
  --recipe "xnnpack" \
  --use_custom_sdpa \
  --use_custom_kv_cache \
  --qlinear "8da4w" \
  --qembedding "8w" \
  --max_seq_len 1024 \
  --dtype "float32" \
  --device "cpu" \
  --output_dir "<output_dir>"

Quantization settings:

  • --qlinear 8da4w: INT8 dynamic activations + INT4 weights for linear layers
  • --qembedding 8w: INT8 weights for embeddings

Tooling Versions

  • ExecuTorch: 1.2.0a0+c7c7c0a (source commit: c7c7c0a442)
  • Optimum ExecuTorch: 0.2.0.dev0 (commit: 5bf1aeb587e9b1f3572b0bd60265c5dafd007b73)

Run with ExecuTorch Llama Runner

Build (from ExecuTorch repo root):

cmake --workflow --preset llm-release
cd examples/models/llama
cmake --workflow --preset llama-release
cd ../../

Run:

cmake-out/examples/models/llama/llama_main \
  --model_path "model.pte" \
  --tokenizer_path "tokenizer.json" \
  --prompt "Once upon a time" \
  --seq_len 128 \
  --num_bos 1

Python Validation

from optimum.executorch import ExecuTorchModelForCausalLM
from transformers import AutoTokenizer

model = ExecuTorchModelForCausalLM.from_pretrained(".")
tokenizer = AutoTokenizer.from_pretrained(".")

text = model.text_generation(
    tokenizer=tokenizer,
    prompt="Once upon a time",
    max_seq_len=128,
)
print(text)

License

Base model license and usage terms follow Gemma terms from the upstream model: google/gemma-3-1b-it

Downloads last month
55
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for larryliu0820/Gemma3-1B-IT-INT8-INT4-ExecuTorch-XNNPACK

Finetuned
(512)
this model