ideanlp

  • perform self attention like a sliding window (with overlap)
  • calculate representative K, V for each window
  • new token can attend to past windows
  • extend this in a hierarchical manner when the context is even longer: sliding window over windows
    • then perform attention from top level windows to bottom level