- perform self attention like a sliding window (with overlap)
- calculate representative K, V for each window
- new token can attend to past windows
- extend this in a hierarchical manner when the context is even longer: sliding window over windows
- then perform attention from top level windows to bottom level