idea for images, why not make the attention 4D ? is there a way to exploit spatial relation with this approach even better, for n-d data, can use nxn-d attention using the same idea as above ?