
  • do we always need full nxn attention ?
    • (2x2)D attention
    • can we randomly pick pxp ? p < n
      • already done in DropAttention
        • they tried dropping columns (keys) and entries of the attention matrix
  • same idea as drop out
    • drop out is based on how the brain works
    • the connection in the brain is bidirectional, transformers in some way mimic that with self attention
      • can we extend it further ?
      • given a large amount of neurons, each iteration take a random p neuron and do pxp attention
      • the neurons dont need to take the same kind of input
        • as long as it turns the input into a vector of the same dimension
        • first layer will be specialized, because the input varies
          • brain has different neuron dedicated for different task ?
          • later layers can mix and swap input since now every thing has the same shape
        • sounds like a graph net, but here the nodes are randomly connected