Mini Granite speed benchmark #98

owtaylor · 2025-04-07T15:56:27Z

We're often end up in situations where it would be useful to have a simple benchmark of how fast inference is with our models on different hardware. Questions that we might want to answer:

How fast is a M3 Pro vs NVIDIA GTX 3060 vs...
How fast is inference with granite3.2:8b vs granite3.2:2b
How much automatic context can we provide if we want to add only 3 seconds of latency to an initial question on a M3 Pro

We might also want to validate our mental models of performance.

Is the first equation in Context stability - taking advantage of prefix caching #96 right? Does the time to first token go as the square of the input size? Is the output tokens/second inversly proportional to the input size?
Is processing time strictly a function of input and output tokens, or does the content processed matter?

My vision here is a small Python program using the Ollama API that anybody can download and run

granite-speedbench granite:3.2b

Which would output the time-to-first token and output tokens/second for input context sizes of, say, 0, 1000, 2000 and 4000 tokens. For each model, it probably should do something like:

Repeat 3x
- Stop the model
- Do a dummy "generate" request to warm up the model.
- Do the real generate

For content, try a couple of different approaches:

Random data
Consistent actual code - say pull a large code file from github at a particular hash and then take the tokens of it.
A meaningful prompt constructed from the consistent actual code and using the Granite templates

Do these all give the same timings?

To get the 0,1000,2000,4000 tokens, some token counting code that matches the tokenization of Granite. The token counts should be validated against the counts in the response from Ollama after processing the prompt.

The text was updated successfully, but these errors were encountered:

Jazzcort · 2025-04-08T17:21:33Z

Working on it!

owtaylor added this to Granite.Code May Release Apr 8, 2025

owtaylor moved this to Todo in Granite.Code May Release Apr 8, 2025

owtaylor assigned Jazzcort Apr 8, 2025

Jazzcort moved this from Todo to In Progress in Granite.Code May Release Apr 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mini Granite speed benchmark #98

Mini Granite speed benchmark #98

owtaylor commented Apr 7, 2025

Jazzcort commented Apr 8, 2025

Mini Granite speed benchmark #98

Mini Granite speed benchmark #98

Comments

owtaylor commented Apr 7, 2025

Jazzcort commented Apr 8, 2025