A base vocab that includes all possible characters is large (all unicode characters) GPT-2 uses bytes as the base vocab to force the size of 256