Back to 2023
@mikiobraun
Mikio Braun
@mikiobraun
Replying to @rasbt
I think there were some examples of replacing the multi-head attention with more low-level versions (localized attention, or low-rank matrix factorizations). In all applications of ML I've ever seen, no matrices were ever full rank (OK, I'm exaggerating), so it seems plausible.