From English to model: where translation breaks

The LLM is a translator. Translators get things wrong. The interesting question is which kinds of wrong are recoverable from inside the loop — and how.

asprepair-loopunsat-core

If the LLM-to-solver architecture is the right division of labour (the previous post argues that it is), then the seam between them deserves a hard look. The seam is translation. The LLM takes English and produces an ASP encoding: facts, rules, and an optional optimization directive. Clingo then either accepts that encoding or rejects it. The rejection cases are where the work is.

In Savanty there are exactly three rejection cases, and the structure of the codebase makes them explicit. They are syntax_error, unsat, and empty. Each has a different remedy and each surfaces different evidence about what the LLM got wrong. The mistake people make when first wiring an LLM to a solver is to lump these together and send the solver’s raw error text back to the model in a free-form way. That is the Logic-LM-style baseline, and Savanty implements it as a comparison mode (ASPRepairGeneric) precisely to demonstrate that it underperforms a more disciplined approach. Walking through the three cases is how you see why.

Case 1: syntax_error — Clingo can’t parse what came back

This is the easiest case to handle and the easiest to diagnose, because the diagnostic comes from Clingo itself. The grounder produces a parse error pointing at a specific line: maybe the LLM emitted :- assing(N, c1). (typo) or forgot a terminating period, or wrote 1{ assign(V,a); assign(V,b) }1 :- var(V) with no whitespace and tripped the lexer. The feedback to the LLM is straightforward: the rule it produced doesn’t parse, here is the error, fix it.

The reason this case is interesting at all is that it is also the case where most repair loops fail unnecessarily. Clingo’s error messages are precise. They name a line. The LLM can almost always fix them in one round if the feedback is preserved verbatim and not paraphrased. Where it gets dropped is in pipelines that summarise solver output into prose (“there was a problem with your ASP code, please revise”) instead of passing through the parse error. Don’t do that. The literal error message is the strongest evidence the loop has.

Case 2: unsat — the encoding parses but has no solutions

This is the interesting case and it is the case that motivated the whole project. The LLM produced a well-formed ASP program. Clingo grounded it without complaint. But the integrity constraints are jointly contradictory: there is no assignment of variables to values that satisfies all of them simultaneously. Clingo reports UNSAT. Now what?

The naive thing to do is tell the LLM “the program is unsatisfiable, please fix it”. This is approximately useless. The LLM has no way to know which constraint is wrong. It might have produced thirty integrity constraints. Twenty-nine of them are correct translations of the English description and the thirtieth is a hallucinated extra constraint that contradicts something earlier. Or two of them are individually correct but jointly inconsistent with a fact the LLM emitted as a definition. Or the problem is genuinely infeasible — there really is no valid schedule given the requirements as stated — and the LLM should not change anything and let the infeasibility surface to the user.

The model has no signal to distinguish these. So Savanty gives it one. When Clingo returns UNSAT, the runtime computes a minimal unsatisfiable core over the integrity constraints. The algorithm is straightforward deletion filtering: start with all the constraints, repeatedly try removing one and re-solving, and keep removing only those whose removal still leaves the program unsatisfiable. What remains is a minimal subset of constraints that is still jointly unsatisfiable together with the base rules and facts — removing any one of them would make the program satisfiable.

That minimal core is the localised evidence the LLM needs. The repair feedback explicitly says: these constraints, exactly these, cannot all hold together. Decide whether one of them misformalizes the problem (correct or remove it) or whether the problem is genuinely infeasible (leave it unchanged so the user sees a faithful “no solution exists” result). The model now has something to reason about. A typical pathology — the LLM emitted both “max 4 shifts per person” and a hidden implicit “everyone works every day” via a misread choice rule — becomes obvious when those two specific constraints are the entire core.

There is a subtlety worth noting. Computing the core requires running Clingo multiple times, once per candidate removal, and each call has parse-error handling that treats a parse failure as “satisfiable” so we don’t wrongly blame a constraint for a parse failure elsewhere. The cost is small for realistic problem sizes (single-digit dozens of integrity constraints) and the diagnostic quality is qualitatively different from “your program is unsat”.

Case 3: empty — the program solved but emitted no decisions

This case is rarer and slightly stranger. Clingo found an answer set; the answer set just doesn’t contain any assign/2 atoms. What usually happened: the LLM forgot to emit a choice rule, or emitted one that doesn’t ground (the variable in the head is unbound, or the body never matches any fact), or emitted decisions under a different predicate name despite the explicit assign(Var, Value) contract.

This is a contract violation, not a logical error, and the feedback reflects that: “you produced no assign(Var, Value) decision atoms; add the choice rule and ensure decisions surface as assign/2.” Savanty’s canonical decision contract — every decision is one assign/2 atom — is what makes this case detectable at all. Without it, the runtime would have no way to distinguish “the LLM forgot to emit decisions” from “the LLM used different predicate names”. The contract is load-bearing for the repair loop, not just for parsing the eventual answer.

Why typed feedback beats raw error messages

The contrast that motivates the whole design is between Savanty’s typed-core mode and the generic repair mode, where the LLM receives only the raw Clingo error string with no taxonomy, no core, no contract reminder. The generic mode reproduces the Logic-LM family of techniques in the literature, and it is implemented as the ASPRepairGeneric DSPy signature specifically so that benchmark runs can compare them.

The intuition for why typed cores win is that LLM repair is fundamentally a re-translation problem, and re-translation needs a localised, specific signal. “The solver couldn’t solve it” is not localised. “These four constraints, exactly, are jointly inconsistent — pick one to revise or report infeasibility” is localised. The model can act on the second; the first leaves it guessing, and guessing tends to produce a sibling encoding that has a different problem and another round of guessing.

There is also a subtler point about faithful infeasibility. In the typed-core mode, when the model decides that the core represents a genuine real-world contradiction (you asked for ten nurses to cover fifteen daily shifts with a hard cap of one shift per nurse per week), the contract is that it should not change the encoding. The user wants to know that the problem as stated has no solution. In the generic mode, the model is given no framing for “leave it alone and report infeasibility”, and tends to weaken a constraint to make the program satisfiable — producing an answer that does not correspond to the user’s problem.

What the seam tells you

The point of staring this hard at the LLM-to-solver translation step is that it is the only place left where things can go wrong. Once a well-formed ASP encoding reaches Clingo, the rest is mechanical. The interesting engineering is between the English description and the encoding — and specifically in how Clingo’s failures are turned back into evidence the LLM can use. Three cases, three remedies, one canonical contract. That is the whole repair loop, and it is the part of Savanty that does the real work.