Comment on Nvidia unveils new GPU designed for long-context inference

brucethemoose@lemmy.world ⁨2⁩ ⁨days⁩ ago

Doubling down on flash attention (my interpretation of this) is quite risky, as there are more efficient attention mechanisms seeping into bigger and bigger models.

Deepseek’s MLA is a start. Jamba is already doing hybrid GQA/Mamba attention, and a Qwen3 update is rumored to be using something exotic as well.

In English, this seems like they’re selling the idea of the software architecture not changing much, when that doesn’t seem to be the case.

source
Sort:hotnewtop