@mikiobraun: "I think there were some examples of replacing the multi-head..."

Mikio Braun

@mikiobraun

Replying to @rasbt

I think there were some examples of replacing the multi-head attention with more low-level versions (localized attention, or low-rank matrix factorizations). In all applications of ML I've ever seen, no matrices were ever full rank (OK, I'm exaggerating), so it seems plausible.

Jun 29, 2023 · 13:39