Background
How it works
- Initializes the base vocab to a large number of symbols
- could be all pre-tokenized words and the most common substrings
- Defines a loss (often log-likelihood) over the training data given the current vocab and a unigram language model
- For each symbol in the vocab, computes how much the loss would increase if the symbol was to be removed
- It removes 10-20% of the symbols that least affect the overall loss over the training data
- Repeats until reached the desired size.
- Always keeps the base characters so that any word can be tokenized.
Notes
- Since the algorithm is not based on merge rules, it has several ways of tokenizing a text
- picks the most likely tokenization
- but also sample a possible tokenization according to their probabilities