for ad-hoc cases where the tokens are very different than in natural language text, e.g. codes
for languages that are not well represented in existing models (i.e. non-English), the tokens and the vocabs could be very different
Cons
maybe your data doesn’t represent what people are going to use the model for, it might not be as efficient and as effective at processing that kind of data.
Advices
compare against a good general purpose tokenizer (e.g. GPT-NeoX) to make sure that it works better
not all the tokenizer created at Mosaic is better than the generic one (source)
The size of vocab has not been definitively studied and there’s no good heuristic for choosing vocab size. (source)
vocab size will impact the efficiency of the model
bigger vocab will be less efficient, but have more tokens so it’s more token efficient (a piece of text may fit into fewer tokens)
OpenAI tends to go with very big vocab
vary by domain quite a bit
a standard ~50k tends to be popular
but in production there has been down to 25k or up to over 100k before
none of them has been making the model much better or much less efficient