Efficient Transformers with Dynamic Token Pooling

jointly learn dynamic pooling (token segmentation) and language modelling
- but the number of boundaries is also dynamic, so how does the transformer layers work with it ?
- upsampling is by duplication
  - Can we apply the ideas of U-Net or deconvolution here ? idea
learn dynamic pooling by predict segment boundaries in the sequence dynamically
- normally pooling used fixed size, which is sub-optimal for language
- help preserve linguistic primitives during pooling
try to make the model perform
- hierarchical computation
- conditional computation by allocating resources to sub-sequences in proportion to the model uncertainty
learn the neural boundary predictor supervised by
- unigram tokenizer
- end-to-end through stochastic re-parameterisation
  - uses Gumbel-sigmoid
- spikes in the conditional entropy of the predictive distribution
  - ensure that the computation is adaptive to the level of uncertainty
  - this and Gumbel-sigmoid are inferior to alternatives for dynamic pooling
    - why ? this is such a cool idea
- natural data boundaries such as white spaces

Lone's notes