Why LLMs alone don't solve optimization problems

A frequent question, often delivered as an objection: if the model can clearly understand the problem, why can’t it also solve it? GPT-4o can read “schedule four nurses across five days with the following constraints” and respond with a perfectly plausible schedule. The schedule looks right. It is laid out in a markdown table. The names are spelled correctly. What more could a solver give you?

The answer is: a guarantee that the schedule actually satisfies the constraints. The schedule that came back from the prompt is plausible. Plausibility is not validity. The two are not the same property, and conflating them is the central failure mode of using a language model as a constraint solver.

The shape of the problem

Optimization problems — at least the ones Savanty addresses, which is to say discrete constraint satisfaction over finite domains — have two layers. The first is the modelling layer. You take an informal description (“nurses can’t work two evening shifts in a row”) and reduce it to a set of variables, domains, and predicates over those variables (“for all nurses N, for all days D, it is not the case that evening_shift(N, D) and evening_shift(N, D+1)”). The second is the search layer. You enumerate, prune, or otherwise traverse the space of variable assignments looking for one that violates none of the predicates and optimizes whatever objective you specified.

LLMs are demonstrably good at the first layer. They have seen enough natural-language descriptions of scheduling and assignment problems to extract the entities, recognise that “morning” and “evening” are values in a shift domain, and produce a structured representation. This is essentially a translation task — the same task that they handle competently when translating English to Python or English to SQL. The boundaries of competence here are well understood: ambiguity in the source text turns into ambiguity in the output, and there are well-known prompting techniques (give examples, ask clarifying questions, surface assumptions) for managing it.

The second layer is where things go wrong. Search is not translation. There is no fluent, statistically likely sequence of tokens that corresponds to “the unique assignment that satisfies all twelve of these constraints simultaneously”. The model can only sample plausible continuations. When the continuation it samples happens to violate constraint #7, there is no internal mechanism that says “wait, this is wrong, back up”. The model emits the schedule with the same confidence either way. It cannot tell the difference between a valid output and an invalid one, because plausibility — the only signal it has access to during generation — is correlated with neither.

What you actually get back

When you ask a state-of-the-art LLM to solve a non-trivial constraint problem, you get one of three things and you cannot tell which one until you check by hand.

The first thing you might get is a correct solution. Small problems often produce correct solutions because the search space is small enough that the most plausible token sequence is also the valid one. This is fine and it lulls people into a false sense of how well the technique is working. It works on the demo. The demo has four nurses and five days. The real problem has 240 personnel and a six-week roster.

The second thing you might get is a confidently incorrect solution. The schedule is laid out cleanly, the constraints are restated at the top of the answer, and somewhere on row 14 day 4 the same nurse is double-booked. This is the dangerous output. A human reviewer might catch it if they read carefully. A downstream system that takes the answer as ground truth will not. There is no signal in the model’s output that distinguishes this case from the first one.

The third thing you might get is the model declining to commit, often phrased as “here is a possible schedule, but please verify against your constraints”. This is the most honest output. It is also useless as an automated answer — you still need a verifier, which is exactly the missing piece.

Why scaling does not fix this

It is tempting to assume this is a capability ceiling that will be lifted by the next model release. There is a structural reason to doubt this. Constraint satisfaction is, in general, NP-complete. The hard instances cannot be solved by pattern matching against training data because there is no compact pattern: each instance’s solution depends on the precise interaction of constraints in ways that do not transfer between problems. A 3-SAT instance with 200 variables has 2^200 possible assignments and only a vanishing fraction are valid. No amount of parameter scaling teaches a transformer to search that space. Scaling teaches it to produce better-looking output, which is the opposite of what you need.

This is not a controversial claim. The same model that emits an invalid schedule with confidence will, if you ask it, explain to you that constraint satisfaction is NP-complete and that exhaustive search is the only known sound technique. The model knows. It just isn’t running that procedure when it generates an answer; it is producing the next likely token.

Where the hybrid architecture comes in

This is the gap Savanty is built to close, and the architecture follows directly from the diagnosis above. The LLM is used for the layer it is good at — translation — and a real solver, Clingo, is used for the layer the LLM is bad at — search.

In Savanty’s pipeline, the LLM never produces a solution. It produces a program: a set of facts, a set of rules including the choice rule that defines the decision space, and (optionally) an optimization directive. Every decision is encoded as one canonical relation, assign(Var, Value). Every requirement is encoded as an integrity constraint, a rule that begins with :-. The choice rule generates candidate assignments; the integrity constraints rule out invalid ones; Clingo grounds the whole thing and searches exhaustively. If a valid assignment exists, Clingo finds one. If none exists, Clingo proves UNSAT.

This is sound and complete. “Sound” means every answer Clingo returns is genuinely an answer — every integrity constraint in your program holds in the model. “Complete” means if an answer exists, Clingo will find it (within the search budget you give it). Neither property is true of an LLM generating a schedule directly. Both are true of Clingo solving an ASP program. The architecture lets you have both: the LLM does the part where soundness and completeness don’t apply (English doesn’t have those properties), and the solver does the part where they’re load-bearing.

What this leaves on the table

This split is not magic. The remaining failure mode — the one Savanty’s repair loop exists to handle — is that the LLM might translate badly. The encoding it produces might be syntactically broken (Clingo can’t parse it), unsatisfiable when the underlying problem is actually solvable (the LLM over-constrained it), or trivially satisfiable when it should not be (the LLM forgot a constraint). When any of these happens, Savanty hands typed diagnostics back to the LLM — a parse error, a minimal unsatisfiable core, or a “you produced no assign/2 atoms” message — and asks for a revision. That loop is the subject of the next post.

The point of this post is the structural one. There is a useful division of labour between language models and formal solvers. It is not the case that solvers are obsolete because LLMs got good. It is the case that LLMs are excellent at the part of the workflow that used to be expensive — getting the problem into formal shape — and solvers are still the right tool for actually solving the resulting problem. Pretending the LLM can do both ends up with confident, incorrect schedules. Letting each tool do its job ends up with mathematically guaranteed ones.