Skip to content

Commit cf23caf

Browse files
Adding migration guide for deepdev (#7073)
* Adding migration guide for deepdev * Update based on feedback * Apply Tarek's suggestion --------- Co-authored-by: Eric StJohn <[email protected]>
1 parent 18da9fe commit cf23caf

File tree

1 file changed

+19
-0
lines changed

1 file changed

+19
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Porting to Microsoft.ML.Tokenizers
2+
3+
This guide provides general guidance on how to migrate from various tokenizer libraries to `Microsoft.ML.Tokenizers` for Tiktoken.
4+
5+
## Microsoft.DeepDev.TokenizerLib
6+
7+
### API Guidance
8+
9+
| Microsoft.DeepDev.TokenizerLib | Microsoft.ML.Tokenizers
10+
| --- | --- |
11+
| [TikTokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TikTokenizer.cs#L20) | [Tokenizer](https://github.com/dotnet/machinelearning/blob/acced974bea6ed484503a595d87a3e7016c8a558/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L28) |
12+
| [ITokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/ITokenizer.cs#L7) | [Tokenizer](https://github.com/dotnet/machinelearning/blob/acced974bea6ed484503a595d87a3e7016c8a558/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L28) |
13+
| [TokenizerBuilder](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TokenizerBuilder.cs#L14) | [Tokenizer.CreateTiktokenForModel/Async](https://github.com/dotnet/machinelearning/blob/70e191b3fae444f6625fdc001071de1e2bd1080b/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L298-L330) downloads<br> [Tokenizer.CreateTiktokenForModel(Async/Stream)](https://github.com/dotnet/machinelearning/blob/70e191b3fae444f6625fdc001071de1e2bd1080b/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L222-L296) user provided file stream |
14+
15+
### General Guidance
16+
17+
- Avoid hard-coding tiktoken regexes and special tokens. Instead use the appropriate Tiktoken.`CreateByModelNameAsync` method to create the tokenizer from either a downloaded file, or a provided stream.
18+
- Avoid doing encoding if you need the token count or encoded Ids. Instead use `Tokenizer.CountTokens` for getting the token count and `Tokenizer.EncodeToIds` for getting the encode ids.
19+
- Avoid doing encoding if all you need is to truncate to a token budget. Instead use `Tokenizer.IndexOfCount` or `LastIndexOfCount` to find the index to truncate from the start or end of a string, respectively.

0 commit comments

Comments
 (0)