Skip to content

Mini Granite speed benchmark #98

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
owtaylor opened this issue Apr 7, 2025 · 1 comment
Open

Mini Granite speed benchmark #98

owtaylor opened this issue Apr 7, 2025 · 1 comment
Assignees

Comments

@owtaylor
Copy link
Contributor

owtaylor commented Apr 7, 2025

We're often end up in situations where it would be useful to have a simple benchmark of how fast inference is with our models on different hardware. Questions that we might want to answer:

  • How fast is a M3 Pro vs NVIDIA GTX 3060 vs...
  • How fast is inference with granite3.2:8b vs granite3.2:2b
  • How much automatic context can we provide if we want to add only 3 seconds of latency to an initial question on a M3 Pro

We might also want to validate our mental models of performance.

  • Is the first equation in Context stability - taking advantage of prefix caching #96 right? Does the time to first token go as the square of the input size? Is the output tokens/second inversly proportional to the input size?
  • Is processing time strictly a function of input and output tokens, or does the content processed matter?

My vision here is a small Python program using the Ollama API that anybody can download and run

granite-speedbench granite:3.2b

Which would output the time-to-first token and output tokens/second for input context sizes of, say, 0, 1000, 2000 and 4000 tokens. For each model, it probably should do something like:

  • Repeat 3x
    • Stop the model
    • Do a dummy "generate" request to warm up the model.
    • Do the real generate

For content, try a couple of different approaches:

  • Random data
  • Consistent actual code - say pull a large code file from github at a particular hash and then take the tokens of it.
  • A meaningful prompt constructed from the consistent actual code and using the Granite templates

Do these all give the same timings?

To get the 0,1000,2000,4000 tokens, some token counting code that matches the tokenization of Granite. The token counts should be validated against the counts in the response from Ollama after processing the prompt.

@Jazzcort
Copy link

Jazzcort commented Apr 8, 2025

Working on it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

2 participants