<p>It's the whole thing where if you ask an LLM to multiply two small numbers together, someone has probably done that somewhere, so it "works," but it completely fails for larger numbers. "Reasoning" models can get around that by giving an escape hatch to eval, like the original chain-of-thought paper, but then why not just use eval directly?</p><p>But regardless, if you think of a task common enough that it has been solved in the training corpus, then it "works," right?</p>