Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box

Community Article Published April 1, 2026

Intel® Xe GPUs, featuring newly launched Intel Arc Pro B70/B65, are designed to meet the needs of modern AI inference and provide an all-in-one inference platform. With enhanced memory capacity, they aim to simplify the adoption and ease of use.

Intel’s upstreaming first strategy on open-source AI frameworks like PyTorch, Hugging Face transformers, vLLM and SGLang builds a solid foundation for a day-0 experience on Intel® Xe GPUs. For years, Intel has been working closely with the open-source community on kernel optimizations and feature enabling. Here are the key features of Gemma 4 and how they are supported on Intel hardware:

  • Attention: Gemma 4 uses 2 variants of attention in different layers: sliding attention and full attention. On Intel® Xe GPUs, vLLM attention kernels in Triton work out-of-box, and flash attention kernels optimized with Intel SYCL*TLA provide additional performance boost. For Hugging Face transformers, both variants are supported through PyTorch kernels out-of-the-box.

  • Gemma4MoE: The MoE path leverages a highly optimized FusedMoE backend. Intel upstreamed optimized FusedMoE kernels for Intel Xe GPU in vLLM and Hugging Face transformers, so MoE layers can work out-of-the box.

  • Vision Tower and Audio Tower: These are transformer models running on Hugging Face transformers as of now. With solid Hugging Face transformers support, these 2 towers are enabled on Intel® Xe GPUs.

Table of Contents

Getting started with vLLM

1. Environment Setup

Build Docker Images with latest vLLM main branch

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ git checkout 66e86f1
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .

Launch vLLM container

$ docker run -it \
             --rm \
             --network=host \
             --device /dev/dri:/dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             --ipc=host \
             --privileged \
             --entrypoint bash \
             vllm-xpu-env

Install latest transformers main branch in container

uv pip uninstall transformers 
uv pip install git+https://github.com/huggingface/transformers

2. Run

The following command lines are for demonstration purposes. You can try different model parallelism configurations per your requirements and Intel GPU type. We validated below configurations on Intel Arc® Pro B60:

Launch OpenAI-Compatible vLLM Server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more. This functionality lets you serve models and interact with them using an HTTP client.

You can use vllm serve command to launch server with tensor parallelism.

$ vllm serve $<MODEL_PATH> --tensor-parallel-size $<TP_SIZE> --enforce-eager --attention-backend TRITON_ATTN

Text Generation

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "How are you?"
                    }
                ]
            }
        ]
    }'

Image Captioning

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image in one sentence."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "$<IMAGE_ADDRESS>"
                        }
                    }
                ]
            }
        ]
    }'

Audio Captioning

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this audio in one sentence."
                    },
                    {
                        "type": "audio_url",
                        "audio_url": {
                            "url": "$<AUDIO_ADDRESS>"
                        }
                    }
                ]
            }
        ]
    }'

Getting started with Hugging Face Transformers

1. Environment Setup

Install latest transformers main branch

$ uv venv .my-env
$ source .my-env/bin/activate
 
$ git clone https://github.com/huggingface/transformers.git
$ cd transformers
$ uv pip install '.[torch]'
 
# install XPU PyTorch
$ uv pip install torch torchvision torchaudio torchao --index-url https://download.pytorch.org/whl/xpu --no-cache-dir

2. Run

The following command lines are for demonstration purposes. You can try different model parallelism configurations per your requirements and Intel GPU type. We validated below configurations on Intel Arc® Pro B60:

We use below test.py python script to run text generation, image captioning and audio captioning tasks.

import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModelForImageTextToText
import torch.distributed as dist
 
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-id")
    parser.add_argument("--task", choices=["text", "image", "audio"], default="text")
    parser.add_argument("--dtype", default="bfloat16")
    parser.add_argument("--device", default="xpu")
    parser.add_argument("--max-new-tokens", type=int, default=1024)
    parser.add_argument("--tp", action="store_true", help="Enable tensor parallel loading with tp_plan=auto")
    parser.add_argument("--tp-size", type=int, default=None, help="Optional TP degree, defaults to WORLD_SIZE")
    return parser.parse_args()
 
def get_dtype(dtype_name):
    return getattr(torch, dtype_name.removeprefix("torch."))
 
def get_rank():
    return int(os.environ.get("RANK", "0"))
 
def run_text_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
    load_kwargs = {
        "dtype": dtype,
    }
 
    if use_tp:
        load_kwargs["tp_plan"] = "auto"
        if tp_size is not None:
            load_kwargs["tp_size"] = tp_size
 
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    if not use_tp:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {"role": "user", "content": "hi, how is the weather today?"},
    ]
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = processor(text=text, return_tensors='pt').to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
        print(generated_text)
 
def run_text_image_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
    from PIL import Image
    import requests
    from io import BytesIO
 
    load_kwargs = {
        "dtype": dtype,
    }
 
    if use_tp:
        load_kwargs["tp_plan"] = "auto"
        if tp_size is not None:
            load_kwargs["tp_size"] = tp_size
 
    model = AutoModelForImageTextToText.from_pretrained(model_id, **load_kwargs)
    if not use_tp:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": "Describe this image in one sentence."},
            ],
        },
    ]
    url = "http://images.cocodataset.org/val2017/000000077595.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = processor(text=text, images=image, return_tensors='pt').to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
        print(generated_text)
 
def run_text_audio_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, tp_size=None):
    from io import BytesIO
    from urllib.request import urlopen
 
    import librosa
 
    load_kwargs = {
        "dtype": dtype,
    }
 
    if use_tp:
        load_kwargs["tp_plan"] = "auto"
        if tp_size is not None:
            load_kwargs["tp_size"] = tp_size
 
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    if not use_tp:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio_url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/31a30e5cd27b5f87f2f5a9c2a9fae33d1ae1b29d/mary_had_lamb.mp3"},
                {"type": "text", "text": "Describe this audio in one sentence."},
            ],
        },
    ]
 
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
 
    audio_url = messages[0]["content"][0]["audio_url"]
    audio, sampling_rate = librosa.load(BytesIO(urlopen(audio_url).read()),  sr=processor.feature_extractor.sampling_rate)
    inputs = processor(text=text, audio=audio, sampling_rate=sampling_rate, return_tensors="pt")
    inputs = inputs.to(model.device)
 
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_ids = outputs[:, inputs.input_ids.shape[1] :]
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print(generated_text)
 
def main():
    args = parse_args()
    dtype = get_dtype(args.dtype)
 
    runners = {
        "text": run_text_generation,
        "image": run_text_image_generation,
        "audio": run_text_audio_generation,
    }
    runners[args.task](
        args.model_id,
        dtype,
        args.device,
        max_new_tokens=args.max_new_tokens,
        use_tp=args.tp,
        tp_size=args.tp_size,
    )
 
    if dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1:
        dist.barrier()
 
if __name__ == "__main__":
    main()

For small models like gemma-4-E2B-it and gemma-4-E4B-it which can fit in single card, you can just run it with

$ python test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio>

For large model like gemma-4-31B-it and gemma-4-26B-A4B-it, you can easily use tensor parallelism by specifying --tp and proper --tp-size in your command to leverage multiple cards. For example, we use --tp-size 2 in 2-card configuration:

$ torchrun --nproc-per-node 2 test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio> --tp --tp-size 2

Take a try!

Community

I'm trying this out on my dual B60 system, but following the guide and running the vllm serve command I get the following error:

root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_xpu.so: undefined symbol: _ZN3ccl2v128reducti

Using Claude I managed to resolve it by adding an additional export to my paht:

export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')

But then it errors out with a level_zero backend failure.

/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/opt/venv/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 15, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 19, in <module>
    from vllm.config.model import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/model.py", line 30, in <module>
    from vllm.transformers_utils.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 19, in <module>
    from transformers.models.auto.image_processing_auto import get_image_processor_config
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 24, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 34, in <module>
    from .processing_utils import ImagesKwargs, Unpack
  File "/opt/venv/lib/python3.12/site-packages/transformers/processing_utils.py", line 79, in <module>
    from .modeling_utils import PreTrainedAudioTokenizerBase
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 73, in <module>
    from .integrations.sdpa_attention import sdpa_attention_forward
  File "/opt/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 12, in <module>
    _is_torch_xpu_available = is_torch_xpu_available()
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 313, in is_torch_xpu_available
    return hasattr(torch, "xpu") and torch.xpu.is_available()
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 74, in is_available
    return device_count() > 0
           ^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 68, in device_count
    return torch._C._xpu_getDeviceCount()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: level_zero backend failed with error: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Tried debugging further with help of Claude, but didn't seem to make it much further. Any chance the guide could be revisited? Seems like there is something wrong with the way the docker container is currently built.

·
Article author
edited 1 day ago

Thanks for trying out!
For the first issue:
great to know that you figured out. This is an temp issue because recent update to pytorch 2.11 and we will resolve it soon. Or maybe use any commit before #34644 - 2111997f96f33b118ec0c562cc2df5681862cff3 as temp workaround.

For second issue:
I noticed that you're using 2*B60, so total device memory is <48GB and google/gemma-4-26B-A4B-it has total 51.6GB in weight. Please try TP=4

Sign up or log in to comment