You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
|[TokenizerBuilder](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TokenizerBuilder.cs#L14)|[Tokenizer.CreateTiktokenForModel/Async](https://github.com/dotnet/machinelearning/blob/70e191b3fae444f6625fdc001071de1e2bd1080b/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L298-L330) downloads<br> [Tokenizer.CreateTiktokenForModel(Async/Stream)](https://github.com/dotnet/machinelearning/blob/70e191b3fae444f6625fdc001071de1e2bd1080b/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L222-L296) user provided file stream |
14
+
15
+
### General Guidance
16
+
17
+
- Avoid hard-coding tiktoken regexes and special tokens. Instead use the appropriate Tiktoken.`CreateByModelNameAsync` method to create the tokenizer from either a downloaded file, or a provided stream.
18
+
- Avoid doing encoding if you need the token count or encoded Ids. Instead use `Tokenizer.CountTokens` for getting the token count and `Tokenizer.EncodeToIds` for getting the encode ids.
19
+
- Avoid doing encoding if all you need is to truncate to a token budget. Instead use `Tokenizer.IndexOfCount` or `LastIndexOfCount` to find the index to truncate from the start or end of a string, respectively.
0 commit comments