Instructions to use AIcell/guava-05-22 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AIcell/guava-05-22 with Transformers:
# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("AIcell/guava-05-22") model = AutoModelForImageTextToText.from_pretrained("AIcell/guava-05-22") - Notebooks
- Google Colab
- Kaggle
guava-05-22
Full fine-tune of Qwen/Qwen3.5-4B for closed-loop tool-calling robot
manipulation. Part of the guava
project.
This checkpoint: models at step 228 (epoch 3.0), final training loss = 0.2162.
⚠ Loading: use the multimodal auto-class
Qwen/Qwen3.5-4B is a vision-language model. Load with
AutoModelForImageTextToText (or Qwen3_5ForConditionalGeneration
directly), NOT AutoModelForCausalLM — the latter returns the
text-only variant without language_model and will fail at generation.
Training hyperparameters
| Base model | Qwen/Qwen3.5-4B |
| Dtype | bfloat16 |
| Tuner | Full fine-tune (LM trained, ViT + aligner frozen) |
| Epochs | 3.0 |
| LR / schedule | 1e-05 / cosine, 0.05 warmup |
| Per-device batch / grad accum | 2 / 2 |
| Max length | 10240 |
| Final loss @ step 228 | 0.2162 |
System prompt
This model was trained against a specific prompt — see
system_prompt.txt. Use that exact content as
the system message; any other prompt produces a distribution shift.
Usage (transformers, no PEFT)
import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"AIcell/guava-05-22", torch_dtype=torch.bfloat16, device_map="cuda",
)
proc = AutoProcessor.from_pretrained("AIcell/guava-05-22")
system_prompt = open("system_prompt.txt").read().strip()
scene_img = Image.open("scene.png").convert("RGB")
messages = [
{"role": "system", "content": [{"type": "text", "text": system_prompt}]},
{"role": "user", "content": [
{"type": "image", "image": scene_img},
{"type": "text", "text":
"Task: <your task description>.\n\n"
"Gripper is at [...] rotation [...] width X%."},
]},
]
inputs = proc.apply_chat_template(
messages, add_generation_prompt=True,
tokenize=True, return_dict=True, return_tensors="pt",
).to("cuda")
with torch.no_grad():
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(proc.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Per-turn assistant output: a <think>…</think> block followed by
exactly one <tool_call>{"name": "<tool>", "arguments": {…}}</tool_call>
(or Task complete. / Task failed. to terminate).
vLLM serving (no LoRA flags needed)
vllm serve AIcell/guava-05-22 \
--port 8000 --max-model-len 24576 \
--reasoning-parser qwen3 --tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image": 20}'
Source
Training script, eval harness, and upload tooling: https://github.com/hdacnw/guava
- Downloads last month
- 34