- just adds a bit of a bias to the attention operations of things that are further apart
- good at extrapolating
- automatically gives the model the ability to look at longer sequences than it was trained on.
- used at Moisac, because
- faster
- doesn’t need to change anything about the network, just need to plug in a little bit of a bias into the attention