Gemma3-1B-IT (ExecuTorch, XNNPACK, INT8-INT4)
This repository contains an ExecuTorch .pte export of google/gemma-3-1b-it for CPU inference on XNNPACK.
The export uses quantization for linear and embedding layers.
Contents
model.pte: ExecuTorch programtokenizer.json,tokenizer.model,tokenizer_config.json,special_tokens_map.json,chat_template.jinja: tokenizer/chat artifactsconfig.json,generation_config.json: upstream model metadata
Export Configuration
Command used:
optimum-cli export executorch \
--model "google/gemma-3-1b-it" \
--task "text-generation" \
--recipe "xnnpack" \
--use_custom_sdpa \
--use_custom_kv_cache \
--qlinear "8da4w" \
--qembedding "8w" \
--max_seq_len 1024 \
--dtype "float32" \
--device "cpu" \
--output_dir "<output_dir>"
Quantization settings:
--qlinear 8da4w: INT8 dynamic activations + INT4 weights for linear layers--qembedding 8w: INT8 weights for embeddings
Tooling Versions
- ExecuTorch:
1.2.0a0+c7c7c0a(source commit:c7c7c0a442) - Optimum ExecuTorch:
0.2.0.dev0(commit:5bf1aeb587e9b1f3572b0bd60265c5dafd007b73)
Run with ExecuTorch Llama Runner
Build (from ExecuTorch repo root):
cmake --workflow --preset llm-release
cd examples/models/llama
cmake --workflow --preset llama-release
cd ../../
Run:
cmake-out/examples/models/llama/llama_main \
--model_path "model.pte" \
--tokenizer_path "tokenizer.json" \
--prompt "Once upon a time" \
--seq_len 128 \
--num_bos 1
Python Validation
from optimum.executorch import ExecuTorchModelForCausalLM
from transformers import AutoTokenizer
model = ExecuTorchModelForCausalLM.from_pretrained(".")
tokenizer = AutoTokenizer.from_pretrained(".")
text = model.text_generation(
tokenizer=tokenizer,
prompt="Once upon a time",
max_seq_len=128,
)
print(text)
License
Base model license and usage terms follow Gemma terms from the upstream model:
google/gemma-3-1b-it
- Downloads last month
- 55