Using supervised fine-tuning (SFT) to introduce even a small amount of relevant data to the training set can often lead to strong improvements in this kind of "out of domain" model performance. But the researchers say that this kind of "patch" for various logical tasks "should not be mistaken for achieving true generalization. ... Relying on SFT to fix every [out of domain] failure is an unsustainable and reactive strategy that fails to address the core issue: the model’s lack of abstract reasoning capability."
Rather than showing the capability for generalized logical inference, these chain-of-thought models are "a sophisticated form of structured pattern matching" that "degrades significantly" when pushed even slightly outside of its training distribution, the researchers write. Further, the ability of these models to generate "fluent nonsense" creates "a false aura of dependability" that does not stand up to a careful audit.
As such, the researchers warn heavily against "equating [chain-of-thought]-style output with human thinking" especially in "high-stakes domains like medicine, finance, or legal analysis." Current tests and benchmarks should prioritize tasks that fall outside of any training set to probe for these kinds of errors, while future models will need to move beyond "surface-level pattern recognition to exhibit deeper inferential competence," they write.
Technus@lemmy.zip 16 hours ago
I get scoffed at every time I call LLMs “glorified auto-correct” so it’s nice being validated.
Anyone who actually has a grasp of how Large Language Models work should not be surprised by this, but too many people, even engineers who should really know better, have drunk the Kool-aid.
mindbleach@sh.itjust.works 6 hours ago
Please don’t mistake vindication for a lack of ambiguity. When this took off, we had no goddamn idea what the limit was. The fact it works half this well is still absurd.
Simple examples like addition were routinely wrong, but they were wrong in a way that indicated - the model might actually infer the rules of addition. That’s a compact way to predict a lot of arbitrary symbols. Seeing that abstraction emerge would be huge, even if it was limited to cases with a zillion examples. And it was basically impossible to reason about whether that was pessimistic or optimistic.
A consensus for “that doesn’t happen” required all of this scholarship. If we had not reached this point, the question would still be open. Remove all the hype from grifters insisting AGI is gonna happen now, oops I mean now, oops nnnow, and you’re still left with a series of advances previously thought impossible. Backpropagation doesn’t work… okay now it does. Training only plateaus… okay it gets better. Diffusion’s cute, avocado chairs and all, but… okay that’s photoreal video. It really took people asking weird questions on high-end models to distinguish actual reasoning capability from extremely similar sentence construction.
And if we’re there, can we please have models ask a question besides ‘what’s the next word?’