• jointly learn dynamic pooling (token segmentation) and language modelling
    • but the number of boundaries is also dynamic, so how does the transformer layers work with it ?
    • upsampling is by duplication
      • Can we apply the ideas of U-Net or deconvolution here ?idea
  • learn dynamic pooling by predict segment boundaries in the sequence dynamically
    • normally pooling used fixed size, which is sub-optimal for language
    • help preserve linguistic primitives during pooling
  • try to make the model perform
    • hierarchical computation
    • conditional computation by allocating resources to sub-sequences in proportion to the model uncertainty
  • learn the neural boundary predictor supervised by
    • unigram tokenizer
    • end-to-end through stochastic re-parameterisation
    • spikes in the conditional entropy of the predictive distribution
      • ensure that the computation is adaptive to the level of uncertainty
      • this and Gumbel-sigmoid are inferior to alternatives for dynamic pooling
        • why ? this is such a cool idea
    • natural data boundaries such as white spaces