Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box

Community Article Published April 1, 2026

Intel® Xe GPUs, featuring newly launched Intel Arc Pro B70/B65, are designed to meet the needs of modern AI inference and provide an all-in-one inference platform. With enhanced memory capacity, they aim to simplify the adoption and ease of use.

Intel’s upstreaming first strategy on open-source AI frameworks like PyTorch, Hugging Face transformers, vLLM and SGLang builds a solid foundation for a day-0 experience on Intel® Xe GPUs. For years, Intel has been working closely with the open-source community on kernel optimizations and feature enabling. Here are the key features of Gemma 4 and how they are supported on Intel hardware:

Attention: Gemma 4 uses 2 variants of attention in different layers: sliding attention and full attention. On Intel® Xe GPUs, vLLM attention kernels in Triton work out-of-box, and flash attention kernels optimized with Intel SYCL*TLA provide additional performance boost. For Hugging Face transformers, both variants are supported through PyTorch kernels out-of-the-box.
Gemma4MoE: The MoE path leverages a highly optimized FusedMoE backend. Intel upstreamed optimized FusedMoE kernels for Intel Xe GPU in vLLM and Hugging Face transformers, so MoE layers can work out-of-the box.
Vision Tower and Audio Tower: These are transformer models running on Hugging Face transformers as of now. With solid Hugging Face transformers support, these 2 towers are enabled on Intel® Xe GPUs.

Getting started with vLLM
Getting started with Hugging Face Transformers

Getting started with vLLM

1. Environment Setup

Build Docker Images with latest vLLM main branch

$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ git checkout 3ca6ca2
$ docker build -f docker/Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .

Launch vLLM container

$ docker run -it \
             --rm \
             --network=host \
             --device /dev/dri:/dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             --ipc=host \
             --privileged \
             --entrypoint bash \
             vllm-xpu-env

2. Run

The following command lines are for demonstration purposes. You can try different model parallelism configurations per your requirements and Intel GPU type. We validated below configurations on Intel Arc® Pro B60:

gemma-4-E2B-it on single card

gemma-4-E4B-it on single card

gemma-4-31B-it with tensor parallelism on 4 cards

gemma-4-26B-A4B-it with tensor parallelism on 4 cards

Launch OpenAI-Compatible vLLM Server

vLLM provides an HTTP server that implements OpenAI's Completions API, Chat API, and more. This functionality lets you serve models and interact with them using an HTTP client.

You can use vllm serve command to launch server with tensor parallelism.

$ vllm serve $<MODEL_PATH> --tensor-parallel-size $<TP_SIZE> --enforce-eager --attention-backend TRITON_ATTN

Text Generation

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "How are you?"
                    }
                ]
            }
        ]
    }'

Image Captioning

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this image in one sentence."
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "$<IMAGE_ADDRESS>"
                        }
                    }
                ]
            }
        ]
    }'

Audio Captioning

$ curl -X POST "http://localhost:8000/v1/chat/completions" \
    -H "Content-Type: application/json" \
    --data '{
        "model": "$<MODEL_PATH>",
        "messages": [
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Describe this audio in one sentence."
                    },
                    {
                        "type": "audio_url",
                        "audio_url": {
                            "url": "$<AUDIO_ADDRESS>"
                        }
                    }
                ]
            }
        ]
    }'

Getting started with Hugging Face Transformers

1. Environment Setup

Install latest transformers main branch

$ uv venv .my-env
$ source .my-env/bin/activate
 
$ git clone https://github.com/huggingface/transformers.git
$ cd transformers
$ uv pip install '.[torch]'
 
# install XPU PyTorch
$ uv pip install torch torchvision torchaudio torchao --index-url https://download.pytorch.org/whl/xpu --no-cache-dir

2. Run

The following command lines are for demonstration purposes. You can try different model parallelism configurations per your requirements and Intel GPU type. We validated below configurations on Intel Arc® Pro B60:

gemma-4-E2B-it on single card

gemma-4-E4B-it on single card

gemma-4-31B-it with tensor parallelism on 4 cards

gemma-4-26B-A4B-it with expert parallelism on 4 cards

We use below test.py python script to run text generation, image captioning and audio captioning tasks.

import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, AutoModelForImageTextToText
from transformers.distributed import DistributedConfig
import torch.distributed as dist
 
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-id")
    parser.add_argument("--task", choices=["text", "image", "audio"], default="text")
    parser.add_argument("--dtype", default="bfloat16")
    parser.add_argument("--device", default="xpu")
    parser.add_argument("--max-new-tokens", type=int, default=1024)
    parser.add_argument("--tp", action="store_true", help="Enable tensor parallel loading with tp_plan=auto")
    parser.add_argument("--ep", action="store_true", help="Enable expert parallel loading")
    return parser.parse_args()
 
def get_dtype(dtype_name):
    return getattr(torch, dtype_name.removeprefix("torch."))
 
def get_rank():
    return int(os.environ.get("RANK", "0"))

def get_world_size():
    if dist.is_available() and dist.is_initialized():
        return dist.get_world_size()
    return int(os.environ.get("WORLD_SIZE", "1"))
 
def run_text_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, use_ep=False):
    load_kwargs = {
        "dtype": dtype,
    }

    if use_tp:
        load_kwargs["tp_plan"] = "auto"
    elif use_ep:
        load_kwargs["distributed_config"] = DistributedConfig(enable_expert_parallel=True)
 
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    if not use_tp and not use_ep:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {"role": "user", "content": "hi, how is the weather today?"},
    ]
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = processor(text=text, return_tensors='pt').to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
        print(generated_text)
 
def run_text_image_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, use_ep=False):
    from PIL import Image
    import requests
    from io import BytesIO

    load_kwargs = {
        "dtype": dtype,
    }

    if use_tp:
        load_kwargs["tp_plan"] = "auto"
    elif use_ep:
        load_kwargs["distributed_config"] = DistributedConfig(enable_expert_parallel=True)
 
    model = AutoModelForImageTextToText.from_pretrained(model_id, **load_kwargs)
    if not use_tp and not use_ep:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": "Describe this image in one sentence."},
            ],
        },
    ]
    url = "http://images.cocodataset.org/val2017/000000077595.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    inputs = processor(text=text, images=image, return_tensors='pt').to(model.device)
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
        print(generated_text)
 
def run_text_audio_generation(model_id, dtype, device_str, max_new_tokens=1024, use_tp=False, use_ep=False):
    from io import BytesIO
    from urllib.request import urlopen

    import librosa

    load_kwargs = {
        "dtype": dtype,
    }

    if use_tp:
        load_kwargs["tp_plan"] = "auto"
    elif use_ep:
        load_kwargs["distributed_config"] = DistributedConfig(enable_expert_parallel=True)
 
    model = AutoModelForCausalLM.from_pretrained(model_id, **load_kwargs)
    if not use_tp and not use_ep:
        model = model.to(device_str)
    model = model.eval()
 
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio_url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/31a30e5cd27b5f87f2f5a9c2a9fae33d1ae1b29d/mary_had_lamb.mp3"},
                {"type": "text", "text": "Describe this audio in one sentence."},
            ],
        },
    ]
 
    processor = AutoProcessor.from_pretrained(model_id, use_fast=False)
    text = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
 
    audio_url = messages[0]["content"][0]["audio_url"]
    audio, sampling_rate = librosa.load(BytesIO(urlopen(audio_url).read()),  sr=processor.feature_extractor.sampling_rate)
    inputs = processor(text=text, audio=audio, sampling_rate=sampling_rate, return_tensors="pt")
    inputs = inputs.to(model.device)
 
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
 
    if get_rank() == 0:
        generated_ids = outputs[:, inputs.input_ids.shape[1] :]
        generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        print(generated_text)
 
def main():
    args = parse_args()
    dtype = get_dtype(args.dtype)
 
    runners = {
        "text": run_text_generation,
        "image": run_text_image_generation,
        "audio": run_text_audio_generation,
    }
    runners[args.task](
        args.model_id,
        dtype,
        args.device,
        max_new_tokens=args.max_new_tokens,
        use_tp=args.tp,
        use_ep=args.ep,
    )
 
    if dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1:
        dist.barrier()
 
if __name__ == "__main__":
    main()

For small models like gemma-4-E2B-it and gemma-4-E4B-it which can fit in single card, you can just run it with

$ python test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio>

For large model like gemma-4-31B-it and gemma-4-26B-A4B-it, you can easily use tensor parallelism by specifying --tp and expert parallelism by specifying --ep in your command to leverage multiple cards.

for gemma-4-31B-it we apply tensor parallelism on 4 cards

$ torchrun --nproc-per-node 4 test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio> --tp

for gemma-4-26B-A4B-it we apply expert parallelism on 4 cards

$ torchrun --nproc-per-node 4 test.py --model-id <MODEL_PATH> --task <pick one from text, image, audio> --ep

Take a try!

Models mentioned in this article 4

Run Gemma 4 on Intel® Xeon® Out-Of-the-Box

April 1, 2026

Community

joostm8

Apr 9

•

edited Apr 14

edit: managed to get it all up and running, see my reply below.

I'm trying this out on my dual B60 system, but following the guide and running the vllm serve command I get the following error:

root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_xpu.so: undefined symbol: _ZN3ccl2v128reducti

Using Claude I managed to resolve it by adding an additional export to my paht:

export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')

But then it errors out with a level_zero backend failure.

/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/__init__.py", line 14, in <module>
    import vllm.env_override  # noqa: F401
    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/vllm/env_override.py", line 87, in <module>
    import torch
  File "/opt/venv/lib/python3.12/site-packages/torch/__init__.py", line 442, in <module>
    from torch._C import *  # noqa: F403
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# export LD_LIBRARY_PATH=/opt/intel/oneapi/ccl/2021.17/lib:$(echo $LD_LIBRARY_PATH | sed 's|/opt/intel/oneapi/ccl/2021.15/lib/||g')
root@2a02-1810-c3f-6500-637b-49e9-54be-21c2:/workspace/vllm# vllm serve google/gemma-4-26B-A4B-it --tensor-parallel-size 2 --enforce-eager --attention-backend TRITON_ATTN
Traceback (most recent call last):
  File "/opt/venv/bin/vllm", line 4, in <module>
    from vllm.entrypoints.cli.main import main
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/__init__.py", line 3, in <module>
    from vllm.entrypoints.cli.benchmark.latency import BenchmarkLatencySubcommand
  File "/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/cli/benchmark/latency.py", line 5, in <module>
    from vllm.benchmarks.latency import add_cli_args, main
  File "/opt/venv/lib/python3.12/site-packages/vllm/benchmarks/latency.py", line 15, in <module>
    from vllm.engine.arg_utils import EngineArgs
  File "/opt/venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 35, in <module>
    from vllm.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 19, in <module>
    from vllm.config.model import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/config/model.py", line 30, in <module>
    from vllm.transformers_utils.config import (
  File "/opt/venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 19, in <module>
    from transformers.models.auto.image_processing_auto import get_image_processor_config
  File "/opt/venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 24, in <module>
    from ...image_processing_utils import ImageProcessingMixin
  File "/opt/venv/lib/python3.12/site-packages/transformers/image_processing_utils.py", line 34, in <module>
    from .processing_utils import ImagesKwargs, Unpack
  File "/opt/venv/lib/python3.12/site-packages/transformers/processing_utils.py", line 79, in <module>
    from .modeling_utils import PreTrainedAudioTokenizerBase
  File "/opt/venv/lib/python3.12/site-packages/transformers/modeling_utils.py", line 73, in <module>
    from .integrations.sdpa_attention import sdpa_attention_forward
  File "/opt/venv/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 12, in <module>
    _is_torch_xpu_available = is_torch_xpu_available()
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/transformers/utils/import_utils.py", line 313, in is_torch_xpu_available
    return hasattr(torch, "xpu") and torch.xpu.is_available()
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 74, in is_available
    return device_count() > 0
           ^^^^^^^^^^^^^^
  File "/opt/venv/lib/python3.12/site-packages/torch/xpu/__init__.py", line 68, in device_count
    return torch._C._xpu_getDeviceCount()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: level_zero backend failed with error: 2147483646 (UR_RESULT_ERROR_UNKNOWN)

Tried debugging further with help of Claude, but didn't seem to make it much further. Any chance the guide could be revisited? Seems like there is something wrong with the way the docker container is currently built.

xuechendi

Article author Apr 9

•

edited Apr 9

Thanks for trying out!
For the first issue:
great to know that you figured out. This is an temp issue because recent update to pytorch 2.11 and we will resolve it soon. Or maybe use any commit before #34644 - 2111997f96f33b118ec0c562cc2df5681862cff3 as temp workaround.

For second issue:
I noticed that you're using 2*B60, so total device memory is <48GB and google/gemma-4-26B-A4B-it has total 51.6GB in weight. Please try TP=4

miraculli

Apr 13

I can not get E2B or E4B running on my Arc B60.
sycl-ls shows the GPU inside the Docker container but running vllm serve throws
terminate called after throwing an instance of 'sycl::_V1::exception'
I tried with the exact 66e86f1 version and with main

yintongl

Article author Apr 13

Hi,
Thanks for trying out.
Could you please provide more details of your SW stack and error log?

stefan-it

25 days ago

•

edited 25 days ago

My B70 is going to arrive next week, can't wait to try it out 😍 Thanks for your hard work on getting it running 🤗

stefan-it

21 days ago

Hi @yintongl ,

do you have any recommendations for getting the B70 working locally on Ubuntu 26.04?

I tried to follow these steps, but there are no Ubuntu 26.04 packages in the ppa.

I installed some Intel packages:

sudo apt install -y libze-dev intel-ocloc
sudo apt install intel-opencl-icd

and PyTorch using

uv pip install torch torchvision torchaudio torchao --index-url https://download.pytorch.org/whl/xpu --no-cache-dir

However:

print(f"XPU available: {torch.xpu.is_available()}")

unfortunately outputs:

/home/stefan/Repositories/intel-test/.my-env/lib/python3.14/site-packages/torch/xpu/__init__.py:68: UserWarning: XPU device count is zero! (Triggered internally at /pytorch/c10/xpu/XPUFunctions.cpp:113.)
  return torch._C._xpu_getDeviceCount()
XPU available: False

Is my system missing some Intel packages?

stefan-it

21 days ago

I think the solution is to install the Compute Runtime:

https://github.com/intel/compute-runtime/releases

I downloaded all deb files from the latest release and installed them:

(.my-env) stefan@stefan-bench:~/Repositories/intel-test$ python
Python 3.14.4 (main, Apr  8 2026, 04:02:31) [GCC 15.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> 
>>> import torch
>>> 
>>> 
>>> print(f"XPU available: {torch.xpu.is_available()}")
XPU available: True
>>> 
>>>

It seems to work now 🥳

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Run Gemma 4 on Intel® Arc™ GPUs Out-Of-the-Box

Table of Contents

Getting started with vLLM

1. Environment Setup

Build Docker Images with latest vLLM main branch

Launch vLLM container

2. Run

Launch OpenAI-Compatible vLLM Server

Text Generation

Image Captioning

Audio Captioning

Getting started with Hugging Face Transformers

1. Environment Setup

Install latest transformers main branch

2. Run

Models mentioned in this article 4

Run Gemma 4 on Intel® Xeon® Out-Of-the-Box

Community

Models mentioned in this article 4