You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're often end up in situations where it would be useful to have a simple benchmark of how fast inference is with our models on different hardware. Questions that we might want to answer:
How fast is a M3 Pro vs NVIDIA GTX 3060 vs...
How fast is inference with granite3.2:8b vs granite3.2:2b
How much automatic context can we provide if we want to add only 3 seconds of latency to an initial question on a M3 Pro
We might also want to validate our mental models of performance.
Is processing time strictly a function of input and output tokens, or does the content processed matter?
My vision here is a small Python program using the Ollama API that anybody can download and run
granite-speedbench granite:3.2b
Which would output the time-to-first token and output tokens/second for input context sizes of, say, 0, 1000, 2000 and 4000 tokens. For each model, it probably should do something like:
Repeat 3x
Stop the model
Do a dummy "generate" request to warm up the model.
Do the real generate
For content, try a couple of different approaches:
Random data
Consistent actual code - say pull a large code file from github at a particular hash and then take the tokens of it.
A meaningful prompt constructed from the consistent actual code and using the Granite templates
Do these all give the same timings?
To get the 0,1000,2000,4000 tokens, some token counting code that matches the tokenization of Granite. The token counts should be validated against the counts in the response from Ollama after processing the prompt.
The text was updated successfully, but these errors were encountered:
We're often end up in situations where it would be useful to have a simple benchmark of how fast inference is with our models on different hardware. Questions that we might want to answer:
We might also want to validate our mental models of performance.
My vision here is a small Python program using the Ollama API that anybody can download and run
Which would output the time-to-first token and output tokens/second for input context sizes of, say, 0, 1000, 2000 and 4000 tokens. For each model, it probably should do something like:
For content, try a couple of different approaches:
Do these all give the same timings?
To get the 0,1000,2000,4000 tokens, some token counting code that matches the tokenization of Granite. The token counts should be validated against the counts in the response from Ollama after processing the prompt.
The text was updated successfully, but these errors were encountered: