• Uses in BERTDistilBERT, and Electra
  • Very similar to BPE but instead of choosing the most frequent symbol pair, it chooses the pair that maximizes likelihood of the training data
  • Choose the pair such that is the greatest amongst all pairs.
    • Intuitively, it evaluates what it loses by merging 2 symbols to ensure it’s worth it.