references
sub-word tokenization
- A hybrid between word-level and character-level tokenization
- less memory complexity and computation than word-level tokenization
- better at learning context-independent representation than character-level tokenization
- Most LLMs nowadays uses sub-word tokenizers