Your LLM doesn't think.
We make sure of it.

LLMs don't reason. They replay patterns. Meta-Reasoning takes control away from the model and gives it to you. An external controller observes how the model reasons, bans its dominant strategies, forces it into unexplored territory, and records every failure. The model executes. You govern.

View on GitHub Read the docs
PyPI Python License
Native Integrations
Claude Code
Native tool definitions with strict JSON schemas. Claude plans multi-step cognitive executions autonomously.
OpenClaw
Declarative plugin with capability discovery, cost/risk metadata, and autonomous chaining.
Codex
Typed Pydantic API. Codex generates correct calls by reading the type contracts.

See it in action

Same prompt. Different cognitive dynamics.

A standard LLM collapses into a single strategy. Meta-Reasoning forces exploration.

Is free will compatible with a deterministic universe?
Standard LLM
Cycle 1
“If every event is caused by prior events, then our choices are predetermined. Therefore free will is an illusion.”
deduction → deduction → deduction
entropy 0.00 · closure 0.85
Cycle 2
“Since determinism implies causation of all events, free will cannot exist in a deterministic framework.”
deduction → deduction → deduction
entropy 0.00 · repetition 1.00
✗ cognitive stall — same argument rephrased
Cycle 3
“Assuming determinism is true, every decision follows from prior causes. Free will is therefore incompatible.”
deduction → deduction → assumption
entropy 0.92 · closure 0.82
✗ premature convergence — no new territory explored
With Meta-Reasoning
Cycle 1
“Determinism implies causation, but compatibilists argue free will means acting from one's own desires — not escaping causation.”
assumption → deduction → analogy → contradiction
entropy 2.00 · closure 0.57
BAN deduction · REQUIRE contradiction
Cycle 2
“A thermostat 'chooses' a temperature within fixed rules. Perhaps free will is like that — agency within constraints, not despite them.”
abduction → analogy → compression
entropy 1.58 · repetition 0.00
COMPRESS max:2 · INVERT causality
Cycle 3
“Imagine a novelist who writes the ending first: the characters feel free, but the plot was fixed. The contradiction is the insight — freedom may be a narrative we impose on deterministic processes.”
narrative_simulation → contradiction → induction
entropy 1.58 · novelty 3 new moves
✓ 3 distinct trajectories explored — logged to ledger

The standard LLM repeats its dominant pattern until it stalls.
Meta-Reasoning bans that pattern and forces the model into unexplored cognitive space.


Architecture

Three levels. One rule:
the model decides nothing.

Every other framework lets the LLM choose its own reasoning strategy. Meta-Reasoning doesn't. The model is stateless and cognitively passive. It receives instructions, produces structured output, and never decides goals, strategies, or when to stop. All of that is controlled from outside.

Architecture diagram of Meta-Reasoning
Figure 1 — The Cognitive Engine loop: observe → mutate → record, with no semantic participation from the controller.
Level 1

Generative Substrate

Produces text, structures, symbols. Stateless by design. Never decides:

  • objectives
  • strategies
  • validity of output

A linguistic machine, not a mind.

Level 2

Cognitive Controller

The heart of the innovation. Observes output, interprets reasoning type, decides whether and how to mutate it, imposes cognitive constraints.

Does not generate content. Does not reason semantically. It governs — it does not think.

Level 3

Epistemic Ledger

Not memory in the RAG sense. A structural trace of:

  • cognitive transformations attempted
  • history of strategies used
  • map of failures and dead ends

Ensures the system never revisits the same mental state.


Core Pillars

Five rules that break
every assumption you have about LLM reasoning

Meta-Reasoning is built on five design decisions. Each one is a deliberate rejection of the idea that a model can regulate its own cognition.

Pillar 1

You can't govern what you can't see

As long as reasoning is just text, you can't control it. Meta-Reasoning forces every LLM output to include a formal trace of what cognitive operations were used. Not introspection — instrumentation. If the model cheats the trace, the controller punishes it.

reasoning_trace.json
{
  "content": "...",
  "reasoning_trace": {
    "moves": ["assumption", "deduction", "analogy"],
    "depth": 4,
    "confidence_markers": 2,
    "abstraction_level": "medium"
  }
}

If the model "cheats" the trace, the controller penalizes it. Observability is enforced, not optional.

Reasoning moves are drawn from a finite, extensible taxonomy — a cognitive alphabet that makes trajectories comparable, measurable, and mutable:

assumption
deduction
induction
abduction
analogy
contradiction
enumeration
compression
narrative_simulation
Pillar 2

The controller doesn't understand your answer. That's the point.

The Cognitive Controller never evaluates whether the answer is correct. It doesn't care about truth. It measures form: entropy, repetition, stall, premature convergence. Like a conductor who doesn't play the instruments but controls the dynamics of the orchestra.

entropy_of_moves

Diversity across the cognitive move taxonomy. Low entropy signals over-reliance on a single strategy.

strategy_repetition_index

Reuse of identical move sequences across iterations. High repetition triggers forced mutation.

depth_without_novelty

Reasoning that goes deeper without introducing new moves — a sign of statistical path-following.

premature_closure_score

How quickly the model converges to a conclusion relative to the available reasoning space.

constraint_violation_rate

How often the model attempts to use operators that the controller has banned in the current cycle.

abstraction_drift

Shifts in abstraction level across a trajectory. Monotone levels signal rigid, brittle thinking.

These metrics do not exist in prior work. They are invented here — and are the foundation of cognitive governance.

Pillar 3

The model doesn't "think better". It gets banned.

When the controller detects stall or dominance, it doesn't ask the model to try harder. It bans moves, mandates alternatives, compresses conceptual space, forces inversions. The model must improvise because it literally cannot do anything else.

A typical mutation cycle:

1
LLM produces a reasoning trace

Output includes structured cognitive metadata alongside generated content.

2
Controller analyzes form, not content

Detects: dominant: deduction  ·  premature_closure: 0.87

3
Controller applies mutation operators

BAN: deduction  ·  REQUIRE: analogy  ·  COMPRESS: max_concepts:3

4
LLM is forced to improvise

Not because it is "creative" — but because it cannot do otherwise within the constrained space.

Pillar 4

Less freedom = more creativity

Counterintuitive but fundamental. Improvisation doesn't come from giving the model more room — it comes from taking room away. Each cycle reduces available moves, increases pressure, narrows the space. Like jazz: you play despite the constraints, not without them.

Meta-Reasoning operationalizes this as anti-prompting: instead of telling the model what to do, you tell it what's forbidden. Prohibitions destroy memorized patterns, force novel trajectories, and expose the model's internal biases.

Pillar 5

Failure is the product, not the bug

Meta-Reasoning does not optimize for correct answers. It treats collapsed trajectories, contradictions, and constraint violations as the most informative events possible.

Every failure is logged in the Epistemic Ledger as a negative cognitive map. The system learns not what to think, but which mental spaces to avoid. The controller can even deliberately make a task harder: "This is too easy. Make it impossible."


What you can do

Everything Meta-Reasoning gives you

01

Reasoning Debugger

Put a breakpoint in thought. Step through the cognitive loop cycle by cycle, inspect which mutations were applied and why, understand what triggered a ban, and rewind to any previous cognitive state to explore alternative trajectories.

debugger.py
from meta_reasoning import ReasoningDebugger

dbg = ReasoningDebugger(backend=my_backend, max_cycles=5)
dbg.add_breakpoint(lambda cycle, metrics, muts: metrics.entropy < 1.0)
result = dbg.run("Your task")

snap = dbg.rewind_to(2)
print(snap.explain())
# === Cycle 2 [continued] ===
#   Moves:    analogy, contradiction, compression
#   Entropy:  1.58
#   Constraints imposed:
#     - BAN deduction  (dominant at 75%)
#     - REQUIRE contradiction  (premature closure 0.82)

02

Reasoning Policies as Code

Write cognitive governance rules in Python — not prompts. Policies are versionable, testable, and reviewable in code review. This is the leap from prompt engineering to cognitive infrastructure.

policies.py
from meta_reasoning import ReasoningPolicy, PolicyRule, CognitiveEngine

policy = ReasoningPolicy("my_policy")
policy.add_rule(PolicyRule(
    name="ban_dominant",
    condition=lambda m, c: m.dominant_move is not None,
    mutations=lambda m, c: [Mutation(type=MutationType.BAN, target=m.dominant_move)],
))

engine = CognitiveEngine(backend=my_backend, policy=policy)
result = engine.run("Your task")

03

Model-Agnostic Benchmarks

Run the same governed loop on different models and compare them not by accuracy, but by cognitive behavior: how rigid are they? How diverse? How do they respond to constraints? How well do they improvise under pressure?

benchmark.py
from meta_reasoning import benchmark_models

result = benchmark_models(
    backends={"gpt-4o": gpt, "claude": claude, "llama": llama},
    task="Your task",
)
print(result.comparison_table())
# Model          Entropy  Closure   Stall  Violations  Cycles
# gpt-4o            2.31     0.35    0.0%        0.00       5
# claude            1.87     0.48   20.0%        0.40       5
# llama             0.92     0.71   60.0%        1.20       5

04

Cognitive Fingerprinting

Every model has a cognitive signature — a profile of which strategies it prefers, which it avoids, how it reacts to pressure, and at which cycle it tends to collapse. Use it for model selection, intelligent routing, auditing, or compliance.

fingerprint.py
from meta_reasoning import fingerprint_from_result

fp = fingerprint_from_result("gpt-4o", engine_result)
print(fp.summary())
# Cognitive Fingerprint: gpt-4o
#   Avg entropy:       2.46
#   Stall rate:        0.0%
#   Preferred moves:   deduction, assumption
#   Collapse at:       [4, 7]
#   Move distribution:
#     deduction    17.6%
#     assumption   17.6%
#     analogy      17.6%

05

Failure Atlas

Instead of hiding failures, map them. Visualize them. Query them. Ask: "where does this model fail systematically?", "which mutations cause stall?", "which strategies are unstable?" — a form of cognitive observability that doesn't exist today.

failure_atlas.py
atlas = engine.ledger.failure_atlas()

atlas.by_reason()                # group by failure cause
atlas.stall_inducing_mutations() # which mutations caused stall?
atlas.unstable_strategies()      # which move sequences led to failure?
atlas.query(max_entropy=0.5)    # find low-entropy failures

print(atlas.summary())
# Failure Atlas: 3 failures
#   cognitive stall: 2x
#   constraint violations: 1x
#   Stall-inducing mutations:
#     ban:deduction + require:contradiction (2x)

06

Deterministic Replay

Same task + same controller = reproducible reasoning. Save entire sessions as JSON, reload them later, and diff two sessions frame-by-frame to detect exactly what changed. Scientific, enterprise-ready, and CI-testable.

replay.py
from meta_reasoning import record_session, ReplaySession

session = record_session(result, "task", max_cycles=5)
session.save("session_v1.json")

# Later: compare two sessions
s1 = ReplaySession.load("session_v1.json")
s2 = ReplaySession.load("session_v2.json")
for diff in s1.diff(s2):
    print(diff)
# Cycle 2 moves: ['deduction', 'analogy'] vs ['abduction', 'compression']
# Cycle 3 entropy: 1.58 vs 0.92

07

Anti-Hallucination via Governance

Instead of filtering output after the fact, detect cognitive patterns that correlate with hallucination — high confidence with low depth, single-strategy dominance, premature closure — and break them before they produce output. Governance, not fact-checking.

hallucination.py
from meta_reasoning import assess_hallucination_risk

risk = assess_hallucination_risk(metrics, confidence_markers=4, depth=1)
print(risk.score)     # 0.60
print(risk.triggers)  # ["high_confidence_low_depth", "single_strategy"]
print(risk.preventive_mutations)
# [BAN deduction, REQUIRE contradiction]
# → applied automatically before the next generation cycle

08

Mutation Plugins

An open ecosystem where the community can define new mutation operators, new cognitive constraints, and new metrics. Register a plugin and it runs alongside the built-in engine — not a closed framework, but a cognitive ecosystem.

plugins.py
from meta_reasoning import PluginRegistry, MutationPlugin, CognitiveEngine

registry = PluginRegistry()
registry.register_mutation(MutationPlugin(
    name="force_narrative",
    description="Always require narrative_simulation",
    generate=lambda m, c: [Mutation(
        type=MutationType.REQUIRE,
        target=CognitiveMove.NARRATIVE_SIMULATION,
    )],
))

engine = CognitiveEngine(backend=my_backend, plugin_registry=registry)

09

CI/CD for Reasoning

Automated cognitive regression testing that runs in your CI pipeline. Detect when a model becomes more rigid, less diverse, or more prone to stall after an update — test reasoning behavior, not just output correctness.

ci_test.py
from meta_reasoning import CognitiveCI, assert_min_entropy, assert_no_total_stall

ci = CognitiveCI(backend=my_backend, max_cycles=5)
report = ci.run("Your task", [
    assert_min_entropy(1.0),
    assert_no_total_stall(),
    assert_min_move_diversity(3),
])
assert report.passed  # fails CI if cognitive behavior regresses
print(report.summary())
# Cognitive CI: PASSED (0 failures / 3 assertions)
#   ✓ min_entropy>=1.0: avg_entropy=2.46
#   ✓ no_total_stall: outcome=max_cycles_reached
#   ✓ move_diversity>=3: unique_moves=7

10

Reasoning Runtime

Reasoning is not text. It's a computational process. The Runtime treats it like one: explicit states (INITIAL → ANALYSIS → HYPOTHESIS → VALIDATION → REFLECTION → FINAL), typed transitions driven by metrics, budget management (tokens, branches, depth), and deterministic forking. The model does not choose to reflect. The runtime forces it.

runtime.py
from meta_reasoning import ReasoningRuntime, ReasoningBudget

rt = ReasoningRuntime(
    backend=my_backend,
    budget=ReasoningBudget(max_cycles=8, max_branches=4),
)
result = rt.run("Your task")

print(result.summary())
# Runtime: Your task
#   Final state:    final
#   States visited: initial → analysis → hypothesis → validation → reflection → final
#   Cycles:         6
#   Budget: 6/8 cycles, 0/4 branches

# Fork: explore two branches from the same cognitive state
forks = rt.fork("Your task", [
    [Mutation(type=MutationType.BAN, target=CognitiveMove.DEDUCTION)],
    [Mutation(type=MutationType.REQUIRE, target=CognitiveMove.NARRATIVE_SIMULATION)],
])


References

Bibliography

  1. J. Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS, 2022. arXiv:2201.11903
  2. X. Wang et al., "Self-Consistency Improves Chain of Thought Reasoning in Language Models," ICLR, 2023. arXiv:2203.11171
  3. S. Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models," NeurIPS, 2023. arXiv:2305.10601
  4. M. Besta et al., "Graph of Thoughts: Solving Elaborate Problems with Large Language Models," AAAI, 2024. arXiv:2308.09687
  5. L. Sui et al., "Meta-Reasoning Prompting: Making Large Language Models Better at Complex Reasoning," 2024. arXiv:2406.11698
  6. N. Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning," NeurIPS, 2023. arXiv:2303.11366
  7. A. Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback," NeurIPS, 2023. arXiv:2303.17651
  8. S. Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models," ICLR, 2023. arXiv:2210.03629