TrendHub note. This article is based on the internal GEL paper draft and the ablation result set archived in the HQ research workspace. The goal is not to restate the paper line by line, but to explain why the result matters for people trying to run useful code-audit workflows on smaller local models.
What the ablation actually tested
The internal study isolates three prompt conditions on the same 4B model: a plain baseline audit prompt, a coached prompt with direct issue hints, and a coached prompt that keeps the reflective structure but removes explicit bug-category leakage. That distinction matters because a strong result is only useful if it comes from better reasoning, not from accidentally feeding the answer key back into the model.
- Baseline mean detection rate: 35.0%
- Coached-General mean detection rate: 72.5%
- Coached-Specific mean detection rate: 85.0%
- Statistical significance for the generalized coaching arm: p = 0.0015
- Effect size for generalized coaching vs. baseline: Cohen's d = 2.21
Why the generalized coaching result is the real headline
The most important number is not the 85.0% peak score under the hint-heavy condition. It is the jump from 35.0% to 72.5% when the prompt keeps the strategic frame but strips out answer leakage. That result says the model is not merely parroting categories it was told to inspect. It is doing meaningfully better work once the prompt forces it to think about robustness, maintainability and edge-case pressure before scanning the code.
In practical terms, that doubles the yield of a 4B audit pass without moving to a much more expensive hosted model. For teams that want to keep review loops local, cheap and frequent, that is the kind of delta that changes whether a workflow feels viable or merely interesting.
What the leakage gap still tells us
The distance between Coached-General and Coached-Specific is still useful. It quantifies the remaining advantage of direct hinting and gives the team a way to separate metacognitive lift from prompt contamination. The internal result set reports a 0.74 effect-size gap for that leakage component, which is material, but it is much smaller than the generalized coaching effect itself. That is why the study can still defend the central claim.
Operator takeaway
If you are running a 3B to 7B class model for code review, this is the evidence that a small amount of structured pre-audit coaching can be worth more than adding another layer of generic prompt ornamentation. The useful pattern is to prime the model with analytical posture, not to bury it under a long checklist of named failure classes.