Why is granite-docling-258M so slow?

#37
by hgarp-prozis - opened

Hi everyone,

I’ve been testing the ibm-granite/granite-docling-258M model (not the ONNX version) and I’m running into severe performance issues, even when using a powerful GPU (RTX 6000 Ada / 48 GB VRAM).

The model takes an unexpectedly long time to process even a single page, and resource usage doesn’t seem to justify the delay. Given that this is a relatively small model (~258 M parameters), I would expect it to be reasonably fast even on CPU, or at least near real-time on GPU.

My main questions:

Is there any internal throttling or hidden preprocessing (e.g. image segmentation, OCR fallback) that could explain the slowdown?

Is there any more detailed documentation on the model architecture, runtime flow, or inference pipeline?

Are there recommended settings or flags (like disabling unused components, OCR, or auto-device detection) to make it run faster, especially on CPU?

I’ve tested both direct loading and pipeline-based inference, and both exhibit the same latency pattern. If anyone from the development team or community has achieved near-real-time results β€” could you please share your configuration and runtime stats?

only vllm module works

only vllm module works

hgarp-prozis, do you have any sample code on using vllm to serve granite?

for what it's worth, the bf16 gguf with llama.cpp is pretty fast on a 4090: https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/assets/new_arxiv.png gets OCRed at 506t/s.

Yeah I tried the webGPU demo from Xenova and it was slower than expected on my 32GB M1 Pro.

Several billion-parameter models with LM studio output tokens wayyyyy faster than this 258M parameter model for some reason.

I think that when discussing performance it would be useful to be specific about what exactely is being evaluated and how. What kind of file? If pdf, how many pages? There's a huge difference in speed between running the full pipeline (and even this one can be run directly from bash, or from their docker's gradio UI) and just the model (what engine? what quant?).
In my case, the same file i mentioned above gets processed in ~35 seconds by the pipeline started from the docker UI, and in 6 seconds with llama-server and an bf16 gguf.

Interestingly the response i get using the "simple", non-pipeline model call is more accurate the the one from the pipeline. Example output:

pipeline (~35seconds):

Figure 2. Estimated captured of the planet assuming the planet radiates blackbody The captured fux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over 0 all   wavelengths, = B(A,T) dX. The captured fux fraction 1s shown for [0.6-2.85 um] (red line); Hubble WFC3 [1.12-1.64 (dashed green line); NIRSpec  G395H [2.7-5.15 um] (dash dotted blue line) . The red-shaded region shows the temperature range on WASP-121 b based on our Tef estimates. Red dashed lines indicate the boundaries of the planet'8 temperature range within the NIRISS SOSS captured flux fraction From this we estimate that these observations capture beon orbital phase. the minimum temperature from the NAMELESS this estimate decreases to 50%. In either case; the wavelength coverage of NIRISS exceeds that of any other instrument . flux xrnax i.e., ing Using fit,

llama-server -m danchev/ibm-granite-docling-258M-GGUF (~5seconds, 486t/s)

Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ$_{min}$ λ$_{min}$ B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ. The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 ¡m] (red line); Hubble WFC3 [1.12-1.64 ¡m] (dashed green line); NIRSpec G395H [2.7-5.15 ¡m] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.

While not perfect (it missed the / between NIRISS and SOSS), the direct model call seems to me more accurate while also being much faster. Of course the output of the direct model call needs to pe post-processed as it includes a bunch of loc_ tags, but overall, unless i'm missing something, the direct model call approach seems preferable.

IBM Granite org

Hi all! We'll definitely dig into performance concerns and keep everyone posted.

One important thing to keep in mind with VLMs versus LLMs is that each VLM translates images to tokens differently. Some simply scale the image to a fixed size and encode that, others break the image up into tiles and encode each separately, resulting in far more input tokens. Granite Docling does the later, so a fair amount of the slowness will be caused by simply having a lot of prompt tokens to process in prefill. These types of image tiling strategies can be thought of as a form of test-time compute where images that get tiled more aggressively essentially give the model more tokens to look at. Additionally, the image preprocessing itself can be slow (rescaling, resizing). This portion will depend a lot on which implementation you're using for these preprocessing steps, so you could easily see pretty large variations between inference engines.

For folks experiencing slowness with the model, it would be great if you can share the following:

  • What inference engine are you using to run the model?
  • Are you using this with the docling library, or the model directly?
  • If using the model directly, what are the dimensions of the input image, and what prompt text are you sending with it?
  • If using the docling library, any reproducible code snippets and inputs would be extremely helpful.

Thanks for all the great interest in this model and project!

@gabegoodhart Thank you so much for your comments and insights, and especially for creating this framework and putting so much work on making it available in so many different ways!

This might not be directly related to the topic on this discussion (speed), and if needed i can start a new one, but was wondering if you can share your thoughts about the differences in accuracy between using the model directly vs the docling gradio UI in docker that i described above? Thanks a lot.

Hi @gabegoodhart ,

I ran the granite-docling-258M with transformers on an RTX 4070 GPU. I have used the example code that is shared in the model card to process one image.

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from pathlib import Path

# Load model and processor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForImageTextToText.from_pretrained(
    pretrained_model_name_or_path="ibm-granite/granite-docling-258M",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

print(f"DocTags: \n{doctags}\n")


# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")

## export as any format.
# Path("out/").mkdir(parents=True, exist_ok=True)
# HTML:
# output_path_html = Path("out/") / "example.html"
# doc.save_as_html(output_path_html)
# Markdown:
# output_path_md = Path("out/") / "example.md"
# doc.save_as_markdown(output_path_md)

The output is good, but processing an entire PDF file is going to take hours with the current generation speed that I am getting.

I switched to llama.cpp, I downloaded the half-precision gguf model from ggml-org. I was able to process the same image in 3.03s with a generation speed equal to 394.17 t/s.

Here is how I ran the model with llama.cpp:

llama-mtmd-cli \
    --model ~/.cache/llama.cpp/granite-docling-258M-f16.gguf \
    --mmproj ~/.cache/llama.cpp/mmproj-granite-docling-258M-f16.gguf \
    --n-gpu-layers 999

I loaded the image:

> /image /path/to/image.png
/path/to/image.png image loaded

After that I gave the prompt:

> Convert this page to docling.

Hit Ctrl+C to see the generation speed:

llama_perf_context_print:        load time =     183.19 ms
llama_perf_context_print: prompt eval time =     427.23 ms /   877 tokens (    0.49 ms per token,  2052.75 tokens per second)
llama_perf_context_print:        eval time =    4401.00 ms /  1774 runs   (    2.48 ms per token,   403.09 tokens per second)
llama_perf_context_print:       total time =  127936.52 ms /  2651 tokens
llama_perf_context_print:    graphs reused =       1766

The generation speed is 403.09 t/s, which is much better than what I was getting with transformers.

From 5 minutes to just 3 seconds, that is a 100x speedup! I really wonder why the transformers implementation is so slow.

I have a question, the output from llama.cpp is different than the one from transformers

llama.cpp:

<loc_114><loc_27><loc_385><loc_34>Energy Budget of WASP-121b from JWST/NIRISS Phase Curve
<loc_454><loc_28><loc_462><loc_34>9
<loc_41><loc_42><loc_241><loc_87>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.
<loc_41><loc_89><loc_241><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. NaΓ―vely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, Οƒ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameer. Following Coulombe et al. (2025) we set an upper prior limit of 3/2 on all albedo slices as a fully Lambertian sphere ( A$_{i}$ = 1) corresponds to a geometric albedo of A$_{g}$ = 2/3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice.
<loc_41><loc_207><loc_241><loc_270>We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N$_{slice}$ = 4, 6, 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in.
<loc_73><loc_277><loc_212><loc_283>2.5. Planetary Effective Temperature
<loc_41><loc_286><loc_241><loc_346>Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partimer & Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b (~ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget.
<loc_41><loc_348><loc_241><loc_364>We convert the fitted F$_{p}$ / F$_{βˆ—}$ emission spectra to brightness temperature by wavelength,
<loc_60><loc_368><loc_240><loc_388>T _ { \text {bright} } = \frac { b c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , \text {planet} } } + 1 \right ) \right ] ^ { - 1 } \quad , \quad ( 1 6 )
<loc_41><loc_391><loc_178><loc_398>where the planet's thermal emission is
<loc_85><loc_404><loc_240><loc_418>B _ { \lambda , \, p l a n e t } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , \, s t a r } \, .
<loc_41><loc_425><loc_241><loc_455>There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz & Cowan 2015;
<loc_273><loc_50><loc_454><loc_134><line_chart><loc_261><loc_141><loc_462><loc_265>Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λmax λ$_{min}$ B ( λ,T) dλ/ ∫ ∞ 0 B ( λ,T ) dλ . The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 ¡m] (red line); Hubble WFC3 [1.12-1.64 ¡m] (dashed green line); NIRSpec G395H [2.7-5.15 ¡m] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.
<loc_261><loc_274><loc_462><loc_360>Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F$_{p}$ / F$_{s}$ emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T$_{eff}$ for a given orbital phase as
<loc_317><loc_363><loc_460><loc_382>T _ { e f f } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { b r i g h t , i } } { \sum _ { i = 1 } ^ { N } w _ { i } } ,
<loc_262><loc_385><loc_462><loc_415>where w$_{i}$ is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement,
<loc_306><loc_417><loc_462><loc_437>w _ { i } = \frac { \int _ { \lambda _ { i } + 1 } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } ,
<loc_262><loc_441><loc_462><loc_455>with T$_{est}$ representing an estimated effective temperature at the orbital phase of interest. When computing

transformers:

<doctag><page_header><loc_115><loc_27><loc_385><loc_34>Energy Budget of WASP-121 b from JWST/NIRISS Phase Curve</page_header>
<page_header><loc_454><loc_28><loc_459><loc_34>9</page_header>
<text><loc_41><loc_42><loc_239><loc_88>while the kernel weights are structured as ( N$_{slice}$ , N$_{time}$ ). This precomputation significantly accelerates our calculations, which is essential since the longitudinal slices are at least partially degenerate with one another. Consequently, the fits require more steps and walkers to ensure proper convergence.</text>
<text><loc_41><loc_89><loc_239><loc_206>To address this, we follow a similar approach to our sinusoidal fits using emcee , but we increase the total number of steps to 100,000 and use 100 walkers. Na¨ıvely, the fit would include 2 N$_{slice}$ + 1 parameters: N$_{slice}$ for the albedo values, N$_{slice}$ for the emission parameters, and one additional scatter parameter, σ . However, since night-side slices do not contribute to the reflected light component, we exclude these albedo values from the fit. In any case, our choice of 100 walkers ensures a sufficient number of walkers per free parameter. Following Coulombe et al. (2025) we set an upper prior limit of 3 / 2 on all albedo slices as a fully Lambertian sphere ( A$_{i}$ = 1 ) corresponds to a geometric albedo of A$_{g}$ = 2 / 3. For thermal emission we impose a uniform prior between 0 and 500 ppm for each slice.</text>
<text><loc_41><loc_207><loc_239><loc_269>We choose to fit our detrended lightcurves considering 4, 6 and 8 longitudinal slices ( N$_{slice}$ = 4 , 6 , 8). However, we show the results of the simplest 4 slice model. As in our previous fits, we conduct an initial run with 25,000 steps (25% of the total run) and use the maximumprobability parameters from this preliminary fit as the starting positions for the final 75,000-step run. We then discard the first 60% of the final run as burn-in.</text>
<section_header_level_1><loc_73><loc_276><loc_207><loc_283>2.5. Planetary Effective Temperature</section_header_level_1>
<text><loc_41><loc_286><loc_239><loc_348>Phase curves are the only way to probe thermal emission from the day and nightside of an exoplanet and hence determine its global energy budget (Partier & Crossfield 2018). The wavelength range of NIRISS/SOSS covers a large portion of the emitted flux of WASP-121 b ( ∼ 50-83%; see Figure 2), enabling a precise and robust constraint of the planet's energy budget.</text>
<text><loc_41><loc_349><loc_239><loc_364>We convert the fitted F$_{p}$ / F$_{βˆ—}$ emission spectra to brightness temperature by wavelength,</text>
<formula><loc_60><loc_368><loc_238><loc_387>T _ { b r i g h t } = \frac { h c } { k \lambda } \cdot \left [ \ln \left ( \frac { 2 b c ^ { 2 } } { \lambda ^ { 5 } B _ { \lambda , p l a n e t } } + 1 \right ) \right ] ^ { - 1 } ,</formula>
<text><loc_41><loc_391><loc_178><loc_398>where the planet's thermal emission is</text>
<formula><loc_84><loc_403><loc_238><loc_419>B _ { \lambda , \text {planet} } = \frac { F _ { p } / F _ { * } } { ( R _ { p } / R _ { * } ) ^ { 2 } } \cdot B _ { \lambda , \text {star} } \, .</formula>
<text><loc_41><loc_425><loc_239><loc_455>There are many ways of converting brightness temperatures to effective temperature, including the ErrorWeighted Mean (EWM), Power-Weighted mean (PWM) and with a Gaussian Process (Schwartz & Cowan 2015;</text>
<chart><loc_273><loc_49><loc_454><loc_134><line_chart><caption><loc_261><loc_141><loc_459><loc_264>Figure 2. Estimated captured flux of the planet assuming the planet radiates as a blackbody. The captured flux is calculated as the ratio of the integrated blackbody emission within the instrument's band pass to the total emission over all wavelengths, i.e., γ = ∫ λ$_{max}$ λ$_{min}$ B ( λ, T ) dλ/ ∫ ∞ 0 B ( λ, T ) dλ . The captured flux fraction is shown for NIRISS SOSS [0.6-2.85 ¡ m] (red line); Hubble WFC3 [1.12-1.64 ¡ m] (dashed green line); NIRSpec G395H [2.7-5.15 ¡ m] (dash dotted blue line). The red-shaded region shows the temperature range on WASP-121 b based on our T$_{eff}$ estimates. Red dashed lines indicate the boundaries of the planet's temperature range within the NIRISS SOSS captured flux fraction. From this we estimate that these observations capture between 55% and 82% of the planet's bolometric flux, depending on orbital phase. Using the minimum temperature from the NAMELESS fit, this estimate decreases to 50%. In either case, the wavelength coverage of NIRISS exceeds that of any other instrument.</caption></chart>
<text><loc_261><loc_273><loc_459><loc_359>Pass et al. 2019). In this work, we elect to compute our effective temperature estimates with a novel method that is essentially a combination of the PWM and EWM. We create the effective temperature by using a simple Monte Carlo process. First, we perturb our F$_{p}$ / F$_{s}$ emission spectra at each point in the orbit by a Gaussian based on the measurement uncertainty. Our new emission spectrum is then used to create an estimate of the brightness temperature spectrum. This process is repeated at each orbital phase. We then estimate the effective temperature, T$_{eff}$ for a given orbital phase as</text>
<formula><loc_317><loc_362><loc_459><loc_382>T _ { \text {eff} } = \frac { \sum _ { i = 1 } ^ { N } w _ { i } T _ { \text {bright,} } , } { \sum _ { i = 1 } ^ { N } w _ { i } } ,</formula>
<text><loc_261><loc_384><loc_459><loc_414>where w$_{i}$ is the weight for the i -th wavelength given by the fraction of the planet's bolometric flux that falls within that wavelength bin scaled by the inverse variance of the measurement,</text>
<formula><loc_305><loc_417><loc_459><loc_437>w _ { i } = \frac { \int _ { \lambda _ { i } } ^ { \lambda _ { i } + 1 } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } { \int _ { 0 } ^ { \infty } B ( \lambda _ { i } , T _ { \text {est} } ) \, d \lambda } \cdot \frac { 1 } { \sigma _ { i } ^ { 2 } } ,</formula>
<text><loc_261><loc_440><loc_459><loc_454>with T$_{est}$ representing an estimated effective temperature at the orbital phase of interest. When computing</text>
</doctag><|end_of_text|>

It seems that the output from llama.cpp lacks the appropriate Docling tags (like text, formula, etc.)

When I try to pass the output from llama.cpp to the following code, I get an empty string, but it works fine for the output from transformers

from IPython.display import Markdown, display
from docling_core.types.doc.document import DoclingDocument

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doc_tags], [image])
document = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
extracted_text_markdown = document.export_to_markdown()
display(Markdown(extracted_text_markdown))

Do you know a solution to this issue?

Thanks in advance.

Edit:

I have noticed that the bouding boxes are separated by newline characters. I have used the following code to get the final text from the raw output.

import re


def extract_inner_text(text_chunk: str) -> str:
    return re.sub(r"<.*?>", "", text_chunk, flags=re.DOTALL).strip()


extracted_text_llama_cpp = ""
for line in doc_tags_llama_cpp.splitlines():
    extracted_text_llama_cpp += extract_inner_text(line) + "\n"

print(extracted_text_llama_cpp)
IBM Granite org

@InformaticsSolutions

and especially for creating this framework and putting so much work on making it available in so many different ways!

I definitely appreciate it! The real starrs (pun intended) of the show here are @PeterWJStaar @dolfim-ibm and the rest of the Docling team. I'm just the messenger (and llama.cpp interface).

but was wondering if you can share your thoughts about the differences in accuracy between using the model directly vs the docling gradio UI in docker that i described above?

This would definitely make a good discussion on its own since I expect there are a number of folks with similar questions. I'll let the Docling team comment on the details, but my rough answer is that using the full package has a lot more ability to be customized to the specific documents. The library itself supports multiple backends, both for format conversion and visual parsing. The default settings are tuned for a good quality/speed tradeoff, but you may find that with the default settings you get worse quality than the raw model if the raw model itself isn't part of the default pipeline (I got to here tracing the defaults, but it goes deeper than that to figure out what is part of the defaults).

@ImadSaddik

I ran the granite-docling-258M with transformers on an RTX 4070 GPU

This is really great detail, thanks for sharing it all! I'm on a mac, so typically go through the mps backend and can't easily verify the slowness you're seeing on CUDA, but with a slightly modified version of your script, I'm definitely seeing it peg my GPU for a long time (still running as we speak).

docling-repro.py
import torch
import time
from datetime import timedelta
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
from pathlib import Path

# Load model and processor
DEVICE = "cpu"
if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
print(f"USING DEVICE: {DEVICE}")
start_time = time.time()
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    pretrained_model_name_or_path=model_path,
    torch_dtype=torch.bfloat16,
).to(DEVICE)
print("==> Done loading model: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
print("==> Done preparing inputs: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
print("==> Done generating: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

print(f"DocTags: \n{doctags}\n")


# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
print(f"Markdown:\n{doc.export_to_markdown()}\n")
print("==> Done: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

## export as any format.
# Path("out/").mkdir(parents=True, exist_ok=True)
# HTML:
# output_path_html = Path("out/") / "example.html"
# doc.save_as_html(output_path_html)
# Markdown:
# output_path_md = Path("out/") / "example.md"
# doc.save_as_markdown(output_path_md)
USING DEVICE: mps
`torch_dtype` is deprecated! Use `dtype` instead!
==> Done loading model: 0.582172s
==> Done preparing inputs: 1.439638s
# Ctrl-C after several minutes

I also see that the GPU VRAM steadily climbs. It started out around 30GB total (my baseline is around 12GB with other workloads), and was up to about 50GB when I stopped it. This definitely feels buggy, so we'll dig deeper and see if we can get to the bottom of it.

@gabegoodhart

Thanks for the reply and for the work you and the team are doing.

I am happy with using llama.cpp for now, but I will keep an eye on this issue. Hopefully, it will get fixed soon.

IBM Granite org

One interesting piece of debugging: If I fully disable the image inputs, I still see much slower generation than I would expect, so this appears to be an issue in the language model and not the number of image tokens or the preprocessing stack.

Let me try that too.

I have tried keeping just the text input and I observed the same thing.

IBM Granite org

Interestingly, on CPU, it's 2x faster than on mps with just the llama model portion. Something is definitely working incorrectly here. If I monitor my GPU utilization with nvtop while running it, I see short bursts of GPU utilization, but not sustained usage like I would expect from a model fully allocated to the device.

In my case, the GPU is fully utilized.

gpu_utilization

IBM Granite org

I created a copy of the text model that acts purely as a LlamaForCausalLM model and I still see the same behavior.

config.json
{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 100264,
  "dtype": "bfloat16",
  "eos_token_id": 100257,
  "head_dim": 64,
  "hidden_act": "silu",
  "hidden_size": 576,
  "initializer_range": 0.02,
  "intermediate_size": 1536,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "num_attention_heads": 9,
  "num_hidden_layers": 30,
  "num_key_value_heads": 3,
  "pad_token_id": 100257,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 100000.0,
  "tie_word_embeddings": true,
  "use_cache": false,
  "vocab_size": 100352
}
docling-repro-text-only.py
import torch
import time
from datetime import timedelta
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.image_utils import load_image
from pathlib import Path

# Load model and processor
DEVICE = "cpu"
if torch.cuda.is_available():
    DEVICE = "cuda"
elif torch.backends.mps.is_available():
    DEVICE = "mps"
print(f"USING DEVICE: {DEVICE}")
start_time = time.time()
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M-text-only"
tokenizer = AutoTokenizer.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(model_path).to(DEVICE)

print("==> Done loading model: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

# Prepare inputs
image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
messages = [
    {
        "role": "user",
        "content": [
            # {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

inputs = tokenizer(prompt, return_tensors="pt")
inputs = inputs.to(DEVICE)
print("==> Done preparing inputs: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=512, use_cache=True)
print("==> Done generating: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
result = tokenizer.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

print("==> Done: {}s".format(timedelta(seconds=time.time() - start_time).total_seconds()))

This clearly shows the same behavior (CPU is ~3x faster). My assumption is that there is not some glaring bug in modeling_llama.py otherwise this would have surfaced a long time ago for llama models. This leads me to believe that there's something about the shape of Granite Docling that is causing the overhead.

Here is what I have done, I have copied the folder that contains the docling model and renamed it to text only like you did in the script and replaced the config.json file.

I loaded that new model using AutoModelForCausalLM. I get this output telling that some weights were not used

Unused weights
Some weights of the model checkpoint at /home/imad-saddik/.cache/huggingface/hub/models--ibm-granite--granite-docling-258M-text-only/snapshots/982fe3b40f2fa73c365bdb1bcacf6c81b7184bfe/ were not used when initializing LlamaForCausalLM: ['model.connector.modality_projection.proj.weight', 'model.text_model.embed_tokens.weight', 'model.text_model.layers.0.input_layernorm.weight', 'model.text_model.layers.0.mlp.down_proj.weight', 'model.text_model.layers.0.mlp.gate_proj.weight', 'model.text_model.layers.0.mlp.up_proj.weight', 'model.text_model.layers.0.post_attention_layernorm.weight', 'model.text_model.layers.0.self_attn.k_proj.weight', 'model.text_model.layers.0.self_attn.o_proj.weight', 'model.text_model.layers.0.self_attn.q_proj.weight', 'model.text_model.layers.0.self_attn.v_proj.weight', 'model.text_model.layers.1.input_layernorm.weight', 'model.text_model.layers.1.mlp.down_proj.weight', 'model.text_model.layers.1.mlp.gate_proj.weight', 'model.text_model.layers.1.mlp.up_proj.weight', 'model.text_model.layers.1.post_attention_layernorm.weight', 'model.text_model.layers.1.self_attn.k_proj.weight', 'model.text_model.layers.1.self_attn.o_proj.weight', 'model.text_model.layers.1.self_attn.q_proj.weight', 'model.text_model.layers.1.self_attn.v_proj.weight', 'model.text_model.layers.10.input_layernorm.weight', 'model.text_model.layers.10.mlp.down_proj.weight', 'model.text_model.layers.10.mlp.gate_proj.weight', 'model.text_model.layers.10.mlp.up_proj.weight', 'model.text_model.layers.10.post_attention_layernorm.weight', 'model.text_model.layers.10.self_attn.k_proj.weight', 'model.text_model.layers.10.self_attn.o_proj.weight', 'model.text_model.layers.10.self_attn.q_proj.weight', 'model.text_model.layers.10.self_attn.v_proj.weight', 'model.text_model.layers.11.input_layernorm.weight', 'model.text_model.layers.11.mlp.down_proj.weight', 'model.text_model.layers.11.mlp.gate_proj.weight', 'model.text_model.layers.11.mlp.up_proj.weight', 'model.text_model.layers.11.post_attention_layernorm.weight', 'model.text_model.layers.11.self_attn.k_proj.weight', 'model.text_model.layers.11.self_attn.o_proj.weight', 'model.text_model.layers.11.self_attn.q_proj.weight', 'model.text_model.layers.11.self_attn.v_proj.weight', 'model.text_model.layers.12.input_layernorm.weight', 'model.text_model.layers.12.mlp.down_proj.weight', 'model.text_model.layers.12.mlp.gate_proj.weight', 'model.text_model.layers.12.mlp.up_proj.weight', 'model.text_model.layers.12.post_attention_layernorm.weight', 'model.text_model.layers.12.self_attn.k_proj.weight', 'model.text_model.layers.12.self_attn.o_proj.weight', 'model.text_model.layers.12.self_attn.q_proj.weight', 'model.text_model.layers.12.self_attn.v_proj.weight', 'model.text_model.layers.13.input_layernorm.weight', 'model.text_model.layers.13.mlp.down_proj.weight', 'model.text_model.layers.13.mlp.gate_proj.weight', 'model.text_model.layers.13.mlp.up_proj.weight', 'model.text_model.layers.13.post_attention_layernorm.weight', 'model.text_model.layers.13.self_attn.k_proj.weight', 'model.text_model.layers.13.self_attn.o_proj.weight', 'model.text_model.layers.13.self_attn.q_proj.weight', 'model.text_model.layers.13.self_attn.v_proj.weight', 'model.text_model.layers.14.input_layernorm.weight', 'model.text_model.layers.14.mlp.down_proj.weight', 'model.text_model.layers.14.mlp.gate_proj.weight', 'model.text_model.layers.14.mlp.up_proj.weight', 'model.text_model.layers.14.post_attention_layernorm.weight', 'model.text_model.layers.14.self_attn.k_proj.weight', 'model.text_model.layers.14.self_attn.o_proj.weight', 'model.text_model.layers.14.self_attn.q_proj.weight', 'model.text_model.layers.14.self_attn.v_proj.weight', 'model.text_model.layers.15.input_layernorm.weight', 'model.text_model.layers.15.mlp.down_proj.weight', 'model.text_model.layers.15.mlp.gate_proj.weight', 'model.text_model.layers.15.mlp.up_proj.weight', 'model.text_model.layers.15.post_attention_layernorm.weight', 'model.text_model.layers.15.self_attn.k_proj.weight', 'model.text_model.layers.15.self_attn.o_proj.weight', 'model.text_model.layers.15.self_attn.q_proj.weight', 'model.text_model.layers.15.self_attn.v_proj.weight', 'model.text_model.layers.16.input_layernorm.weight', 'model.text_model.layers.16.mlp.down_proj.weight', 'model.text_model.layers.16.mlp.gate_proj.weight', 'model.text_model.layers.16.mlp.up_proj.weight', 'model.text_model.layers.16.post_attention_layernorm.weight', 'model.text_model.layers.16.self_attn.k_proj.weight', 'model.text_model.layers.16.self_attn.o_proj.weight', 'model.text_model.layers.16.self_attn.q_proj.weight', 'model.text_model.layers.16.self_attn.v_proj.weight', 'model.text_model.layers.17.input_layernorm.weight', 'model.text_model.layers.17.mlp.down_proj.weight', 'model.text_model.layers.17.mlp.gate_proj.weight', 'model.text_model.layers.17.mlp.up_proj.weight', 'model.text_model.layers.17.post_attention_layernorm.weight', 'model.text_model.layers.17.self_attn.k_proj.weight', 'model.text_model.layers.17.self_attn.o_proj.weight', 'model.text_model.layers.17.self_attn.q_proj.weight', 'model.text_model.layers.17.self_attn.v_proj.weight', 'model.text_model.layers.18.input_layernorm.weight', 'model.text_model.layers.18.mlp.down_proj.weight', 'model.text_model.layers.18.mlp.gate_proj.weight', 'model.text_model.layers.18.mlp.up_proj.weight', 'model.text_model.layers.18.post_attention_layernorm.weight', 'model.text_model.layers.18.self_attn.k_proj.weight', 'model.text_model.layers.18.self_attn.o_proj.weight', 'model.text_model.layers.18.self_attn.q_proj.weight', 'model.text_model.layers.18.self_attn.v_proj.weight', 'model.text_model.layers.19.input_layernorm.weight', 'model.text_model.layers.19.mlp.down_proj.weight', 'model.text_model.layers.19.mlp.gate_proj.weight', 'model.text_model.layers.19.mlp.up_proj.weight', 'model.text_model.layers.19.post_attention_layernorm.weight', 'model.text_model.layers.19.self_attn.k_proj.weight', 'model.text_model.layers.19.self_attn.o_proj.weight', 'model.text_model.layers.19.self_attn.q_proj.weight', 'model.text_model.layers.19.self_attn.v_proj.weight', 'model.text_model.layers.2.input_layernorm.weight', 'model.text_model.layers.2.mlp.down_proj.weight', 'model.text_model.layers.2.mlp.gate_proj.weight', 'model.text_model.layers.2.mlp.up_proj.weight', 'model.text_model.layers.2.post_attention_layernorm.weight', 'model.text_model.layers.2.self_attn.k_proj.weight', 'model.text_model.layers.2.self_attn.o_proj.weight', 'model.text_model.layers.2.self_attn.q_proj.weight', 'model.text_model.layers.2.self_attn.v_proj.weight', 'model.text_model.layers.20.input_layernorm.weight', 'model.text_model.layers.20.mlp.down_proj.weight', 'model.text_model.layers.20.mlp.gate_proj.weight', 'model.text_model.layers.20.mlp.up_proj.weight', 'model.text_model.layers.20.post_attention_layernorm.weight', 'model.text_model.layers.20.self_attn.k_proj.weight', 'model.text_model.layers.20.self_attn.o_proj.weight', 'model.text_model.layers.20.self_attn.q_proj.weight', 'model.text_model.layers.20.self_attn.v_proj.weight', 'model.text_model.layers.21.input_layernorm.weight', 'model.text_model.layers.21.mlp.down_proj.weight', 'model.text_model.layers.21.mlp.gate_proj.weight', 'model.text_model.layers.21.mlp.up_proj.weight', 'model.text_model.layers.21.post_attention_layernorm.weight', 'model.text_model.layers.21.self_attn.k_proj.weight', 'model.text_model.layers.21.self_attn.o_proj.weight', 'model.text_model.layers.21.self_attn.q_proj.weight', 'model.text_model.layers.21.self_attn.v_proj.weight', 'model.text_model.layers.22.input_layernorm.weight', 'model.text_model.layers.22.mlp.down_proj.weight', 'model.text_model.layers.22.mlp.gate_proj.weight', 'model.text_model.layers.22.mlp.up_proj.weight', 'model.text_model.layers.22.post_attention_layernorm.weight', 'model.text_model.layers.22.self_attn.k_proj.weight', 'model.text_model.layers.22.self_attn.o_proj.weight', 'model.text_model.layers.22.self_attn.q_proj.weight', 'model.text_model.layers.22.self_attn.v_proj.weight', 'model.text_model.layers.23.input_layernorm.weight', 'model.text_model.layers.23.mlp.down_proj.weight', 'model.text_model.layers.23.mlp.gate_proj.weight', 'model.text_model.layers.23.mlp.up_proj.weight', 'model.text_model.layers.23.post_attention_layernorm.weight', 'model.text_model.layers.23.self_attn.k_proj.weight', 'model.text_model.layers.23.self_attn.o_proj.weight', 'model.text_model.layers.23.self_attn.q_proj.weight', 'model.text_model.layers.23.self_attn.v_proj.weight', 'model.text_model.layers.24.input_layernorm.weight', 'model.text_model.layers.24.mlp.down_proj.weight', 'model.text_model.layers.24.mlp.gate_proj.weight', 'model.text_model.layers.24.mlp.up_proj.weight', 'model.text_model.layers.24.post_attention_layernorm.weight', 'model.text_model.layers.24.self_attn.k_proj.weight', 'model.text_model.layers.24.self_attn.o_proj.weight', 'model.text_model.layers.24.self_attn.q_proj.weight', 'model.text_model.layers.24.self_attn.v_proj.weight', 'model.text_model.layers.25.input_layernorm.weight', 'model.text_model.layers.25.mlp.down_proj.weight', 'model.text_model.layers.25.mlp.gate_proj.weight', 'model.text_model.layers.25.mlp.up_proj.weight', 'model.text_model.layers.25.post_attention_layernorm.weight', 'model.text_model.layers.25.self_attn.k_proj.weight', 'model.text_model.layers.25.self_attn.o_proj.weight', 'model.text_model.layers.25.self_attn.q_proj.weight', 'model.text_model.layers.25.self_attn.v_proj.weight', 'model.text_model.layers.26.input_layernorm.weight', 'model.text_model.layers.26.mlp.down_proj.weight', 'model.text_model.layers.26.mlp.gate_proj.weight', 'model.text_model.layers.26.mlp.up_proj.weight', 'model.text_model.layers.26.post_attention_layernorm.weight', 'model.text_model.layers.26.self_attn.k_proj.weight', 'model.text_model.layers.26.self_attn.o_proj.weight', 'model.text_model.layers.26.self_attn.q_proj.weight', 'model.text_model.layers.26.self_attn.v_proj.weight', 'model.text_model.layers.27.input_layernorm.weight', 'model.text_model.layers.27.mlp.down_proj.weight', 'model.text_model.layers.27.mlp.gate_proj.weight', 'model.text_model.layers.27.mlp.up_proj.weight', 'model.text_model.layers.27.post_attention_layernorm.weight', 'model.text_model.layers.27.self_attn.k_proj.weight', 'model.text_model.layers.27.self_attn.o_proj.weight', 'model.text_model.layers.27.self_attn.q_proj.weight', 'model.text_model.layers.27.self_attn.v_proj.weight', 'model.text_model.layers.28.input_layernorm.weight', 'model.text_model.layers.28.mlp.down_proj.weight', 'model.text_model.layers.28.mlp.gate_proj.weight', 'model.text_model.layers.28.mlp.up_proj.weight', 'model.text_model.layers.28.post_attention_layernorm.weight', 'model.text_model.layers.28.self_attn.k_proj.weight', 'model.text_model.layers.28.self_attn.o_proj.weight', 'model.text_model.layers.28.self_attn.q_proj.weight', 'model.text_model.layers.28.self_attn.v_proj.weight', 'model.text_model.layers.29.input_layernorm.weight', 'model.text_model.layers.29.mlp.down_proj.weight', 'model.text_model.layers.29.mlp.gate_proj.weight', 'model.text_model.layers.29.mlp.up_proj.weight', 'model.text_model.layers.29.post_attention_layernorm.weight', 'model.text_model.layers.29.self_attn.k_proj.weight', 'model.text_model.layers.29.self_attn.o_proj.weight', 'model.text_model.layers.29.self_attn.q_proj.weight', 'model.text_model.layers.29.self_attn.v_proj.weight', 'model.text_model.layers.3.input_layernorm.weight', 'model.text_model.layers.3.mlp.down_proj.weight', 'model.text_model.layers.3.mlp.gate_proj.weight', 'model.text_model.layers.3.mlp.up_proj.weight', 'model.text_model.layers.3.post_attention_layernorm.weight', 'model.text_model.layers.3.self_attn.k_proj.weight', 'model.text_model.layers.3.self_attn.o_proj.weight', 'model.text_model.layers.3.self_attn.q_proj.weight', 'model.text_model.layers.3.self_attn.v_proj.weight', 'model.text_model.layers.4.input_layernorm.weight', 'model.text_model.layers.4.mlp.down_proj.weight', 'model.text_model.layers.4.mlp.gate_proj.weight', 'model.text_model.layers.4.mlp.up_proj.weight', 'model.text_model.layers.4.post_attention_layernorm.weight', 'model.text_model.layers.4.self_attn.k_proj.weight', 'model.text_model.layers.4.self_attn.o_proj.weight', 'model.text_model.layers.4.self_attn.q_proj.weight', 'model.text_model.layers.4.self_attn.v_proj.weight', 'model.text_model.layers.5.input_layernorm.weight', 'model.text_model.layers.5.mlp.down_proj.weight', 'model.text_model.layers.5.mlp.gate_proj.weight', 'model.text_model.layers.5.mlp.up_proj.weight', 'model.text_model.layers.5.post_attention_layernorm.weight', 'model.text_model.layers.5.self_attn.k_proj.weight', 'model.text_model.layers.5.self_attn.o_proj.weight', 'model.text_model.layers.5.self_attn.q_proj.weight', 'model.text_model.layers.5.self_attn.v_proj.weight', 'model.text_model.layers.6.input_layernorm.weight', 'model.text_model.layers.6.mlp.down_proj.weight', 'model.text_model.layers.6.mlp.gate_proj.weight', 'model.text_model.layers.6.mlp.up_proj.weight', 'model.text_model.layers.6.post_attention_layernorm.weight', 'model.text_model.layers.6.self_attn.k_proj.weight', 'model.text_model.layers.6.self_attn.o_proj.weight', 'model.text_model.layers.6.self_attn.q_proj.weight', 'model.text_model.layers.6.self_attn.v_proj.weight', 'model.text_model.layers.7.input_layernorm.weight', 'model.text_model.layers.7.mlp.down_proj.weight', 'model.text_model.layers.7.mlp.gate_proj.weight', 'model.text_model.layers.7.mlp.up_proj.weight', 'model.text_model.layers.7.post_attention_layernorm.weight', 'model.text_model.layers.7.self_attn.k_proj.weight', 'model.text_model.layers.7.self_attn.o_proj.weight', 'model.text_model.layers.7.self_attn.q_proj.weight', 'model.text_model.layers.7.self_attn.v_proj.weight', 'model.text_model.layers.8.input_layernorm.weight', 'model.text_model.layers.8.mlp.down_proj.weight', 'model.text_model.layers.8.mlp.gate_proj.weight', 'model.text_model.layers.8.mlp.up_proj.weight', 'model.text_model.layers.8.post_attention_layernorm.weight', 'model.text_model.layers.8.self_attn.k_proj.weight', 'model.text_model.layers.8.self_attn.o_proj.weight', 'model.text_model.layers.8.self_attn.q_proj.weight', 'model.text_model.layers.8.self_attn.v_proj.weight', 'model.text_model.layers.9.input_layernorm.weight', 'model.text_model.layers.9.mlp.down_proj.weight', 'model.text_model.layers.9.mlp.gate_proj.weight', 'model.text_model.layers.9.mlp.up_proj.weight', 'model.text_model.layers.9.post_attention_layernorm.weight', 'model.text_model.layers.9.self_attn.k_proj.weight', 'model.text_model.layers.9.self_attn.o_proj.weight', 'model.text_model.layers.9.self_attn.q_proj.weight', 'model.text_model.layers.9.self_attn.v_proj.weight', 'model.text_model.norm.weight', 'model.vision_model.embeddings.patch_embedding.bias', 'model.vision_model.embeddings.patch_embedding.weight', 'model.vision_model.embeddings.position_embedding.weight', 'model.vision_model.encoder.layers.0.layer_norm1.bias', 'model.vision_model.encoder.layers.0.layer_norm1.weight', 'model.vision_model.encoder.layers.0.layer_norm2.bias', 'model.vision_model.encoder.layers.0.layer_norm2.weight', 'model.vision_model.encoder.layers.0.mlp.fc1.bias', 'model.vision_model.encoder.layers.0.mlp.fc1.weight', 'model.vision_model.encoder.layers.0.mlp.fc2.bias', 'model.vision_model.encoder.layers.0.mlp.fc2.weight', 'model.vision_model.encoder.layers.0.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.0.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.0.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.1.layer_norm1.bias', 'model.vision_model.encoder.layers.1.layer_norm1.weight', 'model.vision_model.encoder.layers.1.layer_norm2.bias', 'model.vision_model.encoder.layers.1.layer_norm2.weight', 'model.vision_model.encoder.layers.1.mlp.fc1.bias', 'model.vision_model.encoder.layers.1.mlp.fc1.weight', 'model.vision_model.encoder.layers.1.mlp.fc2.bias', 'model.vision_model.encoder.layers.1.mlp.fc2.weight', 'model.vision_model.encoder.layers.1.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.1.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.1.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.10.layer_norm1.bias', 'model.vision_model.encoder.layers.10.layer_norm1.weight', 'model.vision_model.encoder.layers.10.layer_norm2.bias', 'model.vision_model.encoder.layers.10.layer_norm2.weight', 'model.vision_model.encoder.layers.10.mlp.fc1.bias', 'model.vision_model.encoder.layers.10.mlp.fc1.weight', 'model.vision_model.encoder.layers.10.mlp.fc2.bias', 'model.vision_model.encoder.layers.10.mlp.fc2.weight', 'model.vision_model.encoder.layers.10.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.10.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.10.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.11.layer_norm1.bias', 'model.vision_model.encoder.layers.11.layer_norm1.weight', 'model.vision_model.encoder.layers.11.layer_norm2.bias', 'model.vision_model.encoder.layers.11.layer_norm2.weight', 'model.vision_model.encoder.layers.11.mlp.fc1.bias', 'model.vision_model.encoder.layers.11.mlp.fc1.weight', 'model.vision_model.encoder.layers.11.mlp.fc2.bias', 'model.vision_model.encoder.layers.11.mlp.fc2.weight', 'model.vision_model.encoder.layers.11.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.11.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.11.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.2.layer_norm1.bias', 'model.vision_model.encoder.layers.2.layer_norm1.weight', 'model.vision_model.encoder.layers.2.layer_norm2.bias', 'model.vision_model.encoder.layers.2.layer_norm2.weight', 'model.vision_model.encoder.layers.2.mlp.fc1.bias', 'model.vision_model.encoder.layers.2.mlp.fc1.weight', 'model.vision_model.encoder.layers.2.mlp.fc2.bias', 'model.vision_model.encoder.layers.2.mlp.fc2.weight', 'model.vision_model.encoder.layers.2.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.2.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.2.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.3.layer_norm1.bias', 'model.vision_model.encoder.layers.3.layer_norm1.weight', 'model.vision_model.encoder.layers.3.layer_norm2.bias', 'model.vision_model.encoder.layers.3.layer_norm2.weight', 'model.vision_model.encoder.layers.3.mlp.fc1.bias', 'model.vision_model.encoder.layers.3.mlp.fc1.weight', 'model.vision_model.encoder.layers.3.mlp.fc2.bias', 'model.vision_model.encoder.layers.3.mlp.fc2.weight', 'model.vision_model.encoder.layers.3.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.3.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.3.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.4.layer_norm1.bias', 'model.vision_model.encoder.layers.4.layer_norm1.weight', 'model.vision_model.encoder.layers.4.layer_norm2.bias', 'model.vision_model.encoder.layers.4.layer_norm2.weight', 'model.vision_model.encoder.layers.4.mlp.fc1.bias', 'model.vision_model.encoder.layers.4.mlp.fc1.weight', 'model.vision_model.encoder.layers.4.mlp.fc2.bias', 'model.vision_model.encoder.layers.4.mlp.fc2.weight', 'model.vision_model.encoder.layers.4.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.4.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.4.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.5.layer_norm1.bias', 'model.vision_model.encoder.layers.5.layer_norm1.weight', 'model.vision_model.encoder.layers.5.layer_norm2.bias', 'model.vision_model.encoder.layers.5.layer_norm2.weight', 'model.vision_model.encoder.layers.5.mlp.fc1.bias', 'model.vision_model.encoder.layers.5.mlp.fc1.weight', 'model.vision_model.encoder.layers.5.mlp.fc2.bias', 'model.vision_model.encoder.layers.5.mlp.fc2.weight', 'model.vision_model.encoder.layers.5.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.5.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.5.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.6.layer_norm1.bias', 'model.vision_model.encoder.layers.6.layer_norm1.weight', 'model.vision_model.encoder.layers.6.layer_norm2.bias', 'model.vision_model.encoder.layers.6.layer_norm2.weight', 'model.vision_model.encoder.layers.6.mlp.fc1.bias', 'model.vision_model.encoder.layers.6.mlp.fc1.weight', 'model.vision_model.encoder.layers.6.mlp.fc2.bias', 'model.vision_model.encoder.layers.6.mlp.fc2.weight', 'model.vision_model.encoder.layers.6.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.6.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.6.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.7.layer_norm1.bias', 'model.vision_model.encoder.layers.7.layer_norm1.weight', 'model.vision_model.encoder.layers.7.layer_norm2.bias', 'model.vision_model.encoder.layers.7.layer_norm2.weight', 'model.vision_model.encoder.layers.7.mlp.fc1.bias', 'model.vision_model.encoder.layers.7.mlp.fc1.weight', 'model.vision_model.encoder.layers.7.mlp.fc2.bias', 'model.vision_model.encoder.layers.7.mlp.fc2.weight', 'model.vision_model.encoder.layers.7.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.7.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.7.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.8.layer_norm1.bias', 'model.vision_model.encoder.layers.8.layer_norm1.weight', 'model.vision_model.encoder.layers.8.layer_norm2.bias', 'model.vision_model.encoder.layers.8.layer_norm2.weight', 'model.vision_model.encoder.layers.8.mlp.fc1.bias', 'model.vision_model.encoder.layers.8.mlp.fc1.weight', 'model.vision_model.encoder.layers.8.mlp.fc2.bias', 'model.vision_model.encoder.layers.8.mlp.fc2.weight', 'model.vision_model.encoder.layers.8.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.8.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.8.self_attn.v_proj.weight', 'model.vision_model.encoder.layers.9.layer_norm1.bias', 'model.vision_model.encoder.layers.9.layer_norm1.weight', 'model.vision_model.encoder.layers.9.layer_norm2.bias', 'model.vision_model.encoder.layers.9.layer_norm2.weight', 'model.vision_model.encoder.layers.9.mlp.fc1.bias', 'model.vision_model.encoder.layers.9.mlp.fc1.weight', 'model.vision_model.encoder.layers.9.mlp.fc2.bias', 'model.vision_model.encoder.layers.9.mlp.fc2.weight', 'model.vision_model.encoder.layers.9.self_attn.k_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.k_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.out_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.out_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.q_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.q_proj.weight', 'model.vision_model.encoder.layers.9.self_attn.v_proj.bias', 'model.vision_model.encoder.layers.9.self_attn.v_proj.weight', 'model.vision_model.post_layernorm.bias', 'model.vision_model.post_layernorm.weight']
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /home/imad-saddik/.cache/huggingface/hub/models--ibm-granite--granite-docling-258M-text-only/snapshots/982fe3b40f2fa73c365bdb1bcacf6c81b7184bfe/ and are newly initialized: ['lm_head.weight', 'model.embed_tokens.weight', 'model.layers.0.input_layernorm.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.0.post_attention_layernorm.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.input_layernorm.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.1.post_attention_layernorm.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.10.input_layernorm.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.10.post_attention_layernorm.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.11.input_layernorm.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.11.post_attention_layernorm.weight', 'model.layers.11.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.12.input_layernorm.weight', 'model.layers.12.mlp.down_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.12.post_attention_layernorm.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.13.input_layernorm.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.13.post_attention_layernorm.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.14.input_layernorm.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.14.post_attention_layernorm.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.15.input_layernorm.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.15.post_attention_layernorm.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.18.input_layernorm.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.18.post_attention_layernorm.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.19.input_layernorm.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.2.input_layernorm.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.2.post_attention_layernorm.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.22.input_layernorm.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.23.post_attention_layernorm.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.26.post_attention_layernorm.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.27.input_layernorm.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.29.post_attention_layernorm.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.3.input_layernorm.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.3.post_attention_layernorm.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.4.input_layernorm.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.4.post_attention_layernorm.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.5.input_layernorm.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.5.post_attention_layernorm.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.6.input_layernorm.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.6.post_attention_layernorm.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.7.input_layernorm.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.7.post_attention_layernorm.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.8.input_layernorm.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.8.post_attention_layernorm.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.9.input_layernorm.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.9.post_attention_layernorm.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.norm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The module name  (originally ) is not a valid Python identifier. Please rename the original module to avoid import issues.
==> Done loading model: 1.447446s

After that I ran the generation on the GPU and it finished in 5.5 seconds.

any progress in this issue? @gabegoodhart

Hi! I'm trying to make this work with a much smaller GPU here so this is of great interest to me. I am only able to run the model (this version is the only one that works: https://huggingface.co/ggml-org/granite-docling-258M-GGUF) with llama-cpp, even with Transformers on CPU it simply doesn't terminate processing.

One data point which may or may not be interesting, I notice the same variability in the output as @ImadSaddik above, but between the plain llama-cli and llama-mtmd-cli. With llama-cli no DocTags are returned (so no OTSL tags either), only <loc_N> tags. I have to use llama-mtmd-cli to get the DocTags. Have you observed this in other situations? It seems to me that the problem stems from the fact that the initial <doctag> tag isn't being generated.

One data point which may or may not be interesting, I notice the same variability in the output as @ImadSaddik above, but between the plain llama-cli and llama-mtmd-cli. With llama-cli no DocTags are returned (so no OTSL tags either), only <loc_N> tags. I have to use llama-mtmd-cli to get the DocTags. Have you observed this in other situations? It seems to me that the problem stems from the fact that the initial <doctag> tag isn't being generated.

Replying to myself here: with llama-cli you need to add --special to get DocTags, which I believe is the equivalent of skip_special_tokens=False. This is mentioned here, where there is also a lot of other information which is probably pertinent to the slowness issues: https://github.com/ggml-org/llama.cpp/issues/16678

There is, I believe, some weirdness related to bf16 precision used internally to the model, which may depend on what GPU you have. GGML does a better job of this on unsupported GPUs than PyTorch.

I really don't know why <doctag> is a "special" token, but oh well...

Hi @dhdeco , thanks for sharing this. I didn't know that you could add the --special flag to get the <doctag> token.

I checked my old messages and I see that I wasn't getting that tag when I used llama.cpp. It worked fine when I used the transformers library, though. I might try it again tomorrow.

You said:

I really don't know why <doctag> is a "special" token, but oh well...

Mainly, special tokens act like instructions. They tell the model "this is a command" rather than just "this is text to read." If it wasn't special, the model might get confused and treat it like a normal word.

Edit: I retried, but the pipeline is still slow. It took ~5min to process a single image.

Sooo, is there any update on why it is so slow? I am running it with vllm, serving it like this "vllm serve ibm-granite/granite-docling-258M --revision untied"
I am using PNG as input, 200 DPI, 1654 as max dimension.

IBM Granite org

I folks! Sorry, this one got lost in the holiday shuffle. I'll keep digging on it a bit. It seems like where I left off, there was something core to the text model that was causing the slowness.

IBM Granite org

I all, I've got good news and bad news: The good news is that I think we're close to a root cause of the slowness. The bad news is that it's a pretty fundamental problem with the model shape when used with torch. I took Claude Code on a wild ride through performance analysis, and what we found together was the following:

For models with small hidden_size (<600), kernel dispatching overhead dominates compute making GPUs extremely inefficient, especially the mps backend. Critically, this is not unique to granite-docling. All models that are "deep and thin" will have this problem.

Here's the summary of all the analysis Claude did. I'll also put together a gist with all of the detailed scripts and reports that Claude generated.

NOTE: All of these tests were run without image inputs, so this isolated the language model portion


granite-docling-258M (VLM, 258M params, 100k vocab)

Rank Framework Hardware Throughput (t/s) Time/Token (ms) vs Fastest vs Slowest
πŸ₯‡ MLX-VLM M3 Max (Metal) 390.96 2.6 1.00x 23.78x faster
πŸ₯ˆ PyTorch CUDA (GB10) 72.15 13.9 5.42x slower 4.39x faster
πŸ₯‰ PyTorch CPU (M3 Max) 69.03 14.5 5.66x slower 4.20x faster
4️⃣ PyTorch MPS (M3 Max) 16.44 60.8 23.78x slower 1.00x

SmolLM-135M (LLM, 135M params, 49k vocab)

Rank Framework Hardware Throughput (t/s) Time/Token (ms) vs Fastest vs Slowest
πŸ₯‡ MLX M3 Max (Metal) 300.68 3.3 1.00x 17.97x faster
πŸ₯ˆ PyTorch CUDA (GB10) 98.82 10.1 3.04x slower 5.91x faster
πŸ₯‰ PyTorch CPU (M3 Max) 72.31 13.8 4.16x slower 4.32x faster
4️⃣ PyTorch MPS (M3 Max) 16.73 59.8 17.97x slower 1.00x
IBM Granite org

A gist became impractical, so I put the intermediate files in a repo: https://github.com/gabe-l-hart/granite-docling-perf-analysis-claude

Hi @gabegoodhart , thanks for sharing your performance analysis.

I conducted my own experiments on my NVIDIA GPU and have some great news. Here are my findings.

Problem

Generation was taking ~5 minutes for a single page on my RTX 4070, despite the hardware being capable of much faster speeds.

1759 tokens were generated in ~5 minutes. That results in a generation speed of only ~6t/s.

Solution

Setting use_cache to True solved the issue. Here is where you need to set that:

generated_ids = granite_docling_model.generate(**inputs, max_new_tokens=8192, use_cache=True)

This change increased throughput from ~6t/s to ~111t/s. The generation now takes 16s instead of ~5min. That is an 18.5x speedup. I verified the output to ensure correctness, and it looks good.

Why did this work?

This specific model configuration was failing to enable KV Caching by default when model.generate() was called without explicitly stating use_cache.

  • With cache: Generation complexity is O(N) -> Fast (~111 tokens/sec)
  • Without cache: Generation complexity is O(NΒ²) -> Slow (~6 tokens/sec for long contexts)

The experiment Gabe conducted regarding this architecture being inefficient on PyTorch is correct (MLX on Mac is indeed ~3x faster than this GPU result). However, even with that inefficiency, the GPU should be delivering ~111 t/s, not ~6 t/s.

Detailed investigation

Here are the exact steps I took to debug this, using the repository that Gabe created.

Baseline benchmarking (Text only)

I wanted to find the theoretical maximum speed of the language model on this hardware. I generated 100 tokens using text-only prompts and got ~117 tokens/sec. This is good, though still 3x slower than MLX based on Gabe's results.

Vision-language benchmarking (Image)

To see if the vision encoder was the bottleneck, I used the same image from the official example and got ~96 tokens/sec. This shows that the vision encoder adds minimal overhead and total throughput remains high.

Long generation analysis

I wanted to determine if performance degraded over time. I tracked the throughput every 100 tokens during the slow 5-minute run (1700 tokens). The results showed that generation was consistently slow (6 tokens/sec) from the very first token. This confirmed it was a configuration problem, not a gradual slowdown.

Test dtypes and callbacks

I tested both dtype and torch_dtype and found that both loaded as bfloat16. I also checked if Python callbacks (SpeedMonitor) were blocking the GPU; they had zero impact on performance.

KV cache verification

Finally, I decided to verify if KV caching was actually working. I set use_cache=True and it solved the problem immediately.

If you look at the model's config at this line, you will see that use_cache is set to True. There may be an issue with the transformers library (I used v4.57.3) causing that default value to be ignored.


The problem is now solved on my end. I hope you can test this to see if it works on your system too!

I wrote this script to verify the value of use_cache:

from transformers import AutoModelForImageTextToText

model_name = "ibm-granite/granite-docling-258M"
print(f"Loading {model_name}")
model = AutoModelForImageTextToText.from_pretrained(model_name)

print(f"Top-level config.use_cache : {model.config.use_cache}")
print(f"Text-model config.use_cache: {model.model.text_model.config.use_cache}")

When I run it, I get the following output:

Loading ibm-granite/granite-docling-258M
Top-level config.use_cache : True
Text-model config.use_cache: False

This effectively confirms a "bug" in the ibm-granite/granite-docling-258M model repository on Hugging Face. The text_config is overriding the global default, silently ruining performance for anyone who runs generate() without arguments.

Here is the content of model.model.config:

Idefics3Config {
  "architectures": [
    "Idefics3ForConditionalGeneration"
  ],
  "bos_token_id": 100264,
  "dtype": "bfloat16",
  "eos_token_id": 100257,
  "image_token_id": 100270,
  "model_type": "idefics3",
  "pad_token_id": 100257,
  "scale_factor": 4,
  "text_config": {
    "_name_or_path": "models/granitev06_hf_ai4k_sft_data_v4",
    "architectures": [
      "LlamaForCausalLM"
    ],
    "attention_bias": false,
    "attention_dropout": 0.0,
    "bos_token_id": 100264,
    "dtype": "bfloat16",
    "eos_token_id": 100257,
    "head_dim": 64,
    "hidden_act": "silu",
    "hidden_size": 576,
    "initializer_range": 0.02,
    "intermediate_size": 1536,
    "max_position_embeddings": 8192,
    "mlp_bias": false,
    "model_type": "llama",
    "num_attention_heads": 9,
    "num_hidden_layers": 30,
    "num_key_value_heads": 3,
    "pad_token_id": 100257,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_scaling": null,
    "rope_theta": 100000.0,
    "tie_word_embeddings": true,
    "use_cache": false,
    "vocab_size": 100352
  },
  "tie_word_embeddings": true,
  "transformers_version": "4.57.3",
  "use_cache": true,
  "vision_config": {
    "attention_dropout": 0.0,
    "dtype": "bfloat16",
    "hidden_act": "gelu_pytorch_tanh",
    "hidden_size": 768,
    "image_size": 512,
    "initializer_range": 0.02,
    "intermediate_size": 3072,
    "layer_norm_eps": 1e-06,
    "max_image_size": {
      "longest_edge": 512
    },
    "model_type": "idefics3_vision",
    "num_attention_heads": 12,
    "num_channels": 3,
    "num_hidden_layers": 12,
    "patch_size": 16,
    "size": {
      "longest_edge": 512
    }
  },
  "vocab_size": 100352
}

Can't test it right now myself, but is this maybe the culprit? https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/generation_config.json

IBM Granite org

Whelp, I'll certainly feel silly if it's as simple as that! Thank you so much for digging into this. I'll try to validate on my end too.

IBM Granite org

Unfortunately, I still see the same behavior with the mps backend. Looking more closely at your numbers, I think what you found is necessary (enabling cache for the text model), but doesn't solve the whole problem. Your "good" result was ~100 t/s which is about what I found on my GB10 as well. At the same time, MLX (and llama.cpp, though I don't have those numbers side-by-side) are able to achieve close to 3x that, so there is still a significant bottleneck in the torch tensor overhead I think.

I can confirm that llama.cpp is significantly faster than transformers. I achieved ~395 t/s with llama.cpp, which supports the theory that there is a bottleneck in how torch handles tensors.

IBM Granite org

On llama.cpp w/ my GB10 box, I see throughputs (with one concurrent user) up to 2324 t/s:

 ./bin/llama-batched-bench -m ~/models/ibm-granite/granite-docling-258M/ggml-model-BF16.gguf -c 2048 -b 2048 -ub 512 -npp 128,256,512 -ntg 128,256 -npl 1
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.216 592.45 0.265 483.62 0.481 532.53
128 256 1 384 0.005 23813.95 0.517 495.14 0.522 735.06
256 128 1 384 0.008 32765.90 0.263 487.40 0.270 1419.94
256 256 1 512 0.006 39850.56 0.524 488.62 0.530 965.39
512 128 1 640 0.011 45333.80 0.264 484.76 0.275 2324.40
512 256 1 768 0.009 55006.45 0.528 484.75 0.537 1429.07

This is not an apples-to-apples comparison of course because the numbers of prefill and generate are not identical, but the upper bound on throughput should be a LOT higher than 100 t/s.

I got something similar to you with llama-batched-bench

PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
128 128 1 256 0.098 1306.96 0.277 462.52 0.375 683.25
128 256 1 384 0.005 26402.64 0.532 481.20 0.537 715.28
256 128 1 384 0.006 42328.04 0.270 474.86 0.276 1393.32
256 256 1 512 0.006 42659.56 0.538 475.45 0.544 940.41
512 128 1 640 0.014 37388.64 0.275 465.21 0.289 2215.77
512 256 1 768 0.009 55640.08 0.549 466.03 0.559 1375.05

I noticed you mentioned seeing throughputs up to 2324 t/s, but it looks like you are citing the S t/s (Total speed) column rather than the S_TG (Text generation) column.

If you look at your S_TG column, your actual generation speed is steady at ~485 t/s, which matches almost exactly what I am seeing on my end (~460–480 t/s).

Compiling the text model increases the speed from 111 t/s to 231 t/s.

I ran your profiler and found that the CPU is spending ~5x more time scheduling and dispatching kernels than the GPU spends executing them:

  • Total CPU time: ~1.84s
  • Total CUDA time: ~0.36s

Sign up or log in to comment