- Uses in BERT, DistilBERT, and Electra
- Very similar to BPE but instead of choosing the most frequent symbol pair, it chooses the pair that maximizes likelihood of the training data
- Choose the pair such that is the greatest amongst all pairs.
- Intuitively, it evaluates what it loses by merging 2 symbols to ensure it’s worth it.