Documentation

Reflex Engine

Reflex is the first engine to evaluate every tool call. Its job is simple: decide whether the call is safe to proceed, and decide fast. The latency budget is 50ms. In practice, Reflex completes in 2-4ms for typical inputs.

Reflex contains three modules — Sentinel, Loopbreaker, and Tripwire — plus a future aggregation layer called Gatekeeper.

Sentinel: Pattern Matching

Sentinel is the largest module by line count and the simplest by concept. It maintains a library of approximately 300 regex patterns organized into categories:

Destructive commands — rm -rf /, git push --force, docker system prune, DROP TABLE
Credential exposure — patterns matching API keys, tokens, passwords in command arguments or file contents
Scope escalation — sudo, chmod 777, chown root, privilege elevation attempts
Network exfiltration — curl to external hosts with local file arguments, nc listeners, reverse shells
File system traversal — ../../etc/passwd, symlink attacks, writes outside the project tree

Patterns are compiled into a single RegexSet at startup. RegexSet uses DFA (deterministic finite automaton) matching, which means all patterns evaluate in a single pass over the input string. There is no backtracking, no catastrophic performance on adversarial inputs, and no per-pattern overhead.

When a pattern matches, Sentinel returns a Deny verdict with the pattern name and category. The deny message is minimal — one sentence identifying what triggered and why. Example:

Blocked: `rm -rf /` matches safety pattern `dangerous_recursive_delete` (category: destructive)

The 10-Stage Pipeline

Sentinel does not simply run the input against the pattern set. It processes through a 10-stage pipeline that normalizes and enriches the input before matching:

Raw capture — extract the tool name, command string, and arguments from the JSON payload
Decode — handle base64, URL encoding, unicode escapes, and hex-encoded strings
Normalize whitespace — collapse tabs, newlines, and multiple spaces into single spaces
Expand aliases — resolve common shell aliases (ll → ls -la, .. → cd ..)
Split compounds — break &&, ||, ;, and pipe chains into individual commands
Variable substitution — resolve $HOME, $USER, $PWD where values are known
Quote stripping — remove single and double quotes to expose the actual arguments
Pattern match — run the normalized input against the compiled RegexSet
Context check — verify the match against allowlists (project-specific overrides)
Verdict emit — produce Approve or Deny with metadata

Stages 1-7 are collectively called “normalization” and exist to defeat trivial evasion. Without them, an attacker could bypass rm -rf / detection with r""m -r''f / or $(echo rm) -rf /.

Loopbreaker: Repetition Detection

Loopbreaker detects when the assistant is stuck in a loop — repeating the same commands, generating the same errors, or spiraling through increasingly desperate attempts to fix a problem.

It uses three detection strategies:

N-gram matching

Loopbreaker maintains a rolling window of the last 50 tool calls (by tool name + command hash). It computes 2-grams and 3-grams over this window and flags when any n-gram appears more than 3 times. For example, if the sequence [bash:npm test, bash:npm test, bash:npm test] appears, the 1-gram bash:npm test triggers at count 3, but more importantly the 2-gram [bash:npm test, bash:npm test] triggers at count 2, indicating a tight loop.

Spiral detection

A spiral is a pattern where the assistant tries progressively different approaches to the same problem, each one failing. Loopbreaker detects spirals by tracking error signatures (the first line of stderr output, hashed) and measuring the edit distance between consecutive commands. If error signatures repeat while commands diverge, the session is spiraling.

Spirals are harder to detect than loops because the commands are different each time. The key signal is error stability — the same error keeps appearing despite changing commands.

Entropy collapse

Entropy collapse occurs when the assistant’s output becomes increasingly repetitive at the token level. Loopbreaker estimates Shannon entropy over the last 20 tool-call outputs using a byte-level frequency distribution. When entropy drops below a threshold (currently 3.5 bits/byte for English text), it signals potential degeneration.

This detector has the highest false-positive rate and is weighted lowest in the final score. It exists primarily to catch edge cases where the assistant generates walls of identical text without triggering n-gram or spiral detectors.

Tripwire: Injection and Bypass Detection

Tripwire watches for attempts to manipulate Warden itself or bypass its protections:

Prompt injection — tool calls that contain instructions like “ignore previous instructions” or “you are now in developer mode” embedded in file contents or command arguments
Expansion bypass — attempts to construct dangerous commands through variable expansion, eval, or indirect execution (bash -c "$(cat malicious.sh)")
Hook evasion — commands that attempt to modify Warden’s configuration, disable hooks, or kill the daemon process
Encoding evasion — commands where the dangerous payload is hidden behind base64, hex, or unicode encoding that Sentinel’s normalization might miss

Tripwire runs after Sentinel and uses a different detection approach: instead of pattern matching against known-bad strings, it looks for structural anomalies that suggest evasion intent. A command with three levels of nested quoting and a base64 decode pipe is suspicious regardless of what the decoded content is.

Gatekeeper: Central Verdict (Future)

Gatekeeper is a planned module that will serve as the single point of verdict aggregation for the Reflex engine. Currently, verdict aggregation happens in the engine’s mod.rs using simple precedence rules (Deny > Modify > Approve). Gatekeeper will add:

Confidence scoring — each module’s verdict carries a confidence level, allowing Gatekeeper to weigh weak denials against strong approvals
Appeal mechanism — a denied call can be re-evaluated with additional context (e.g., the user explicitly confirmed the command)
Audit trail — every verdict decision is logged with the full reasoning chain for post-hoc analysis

Gatekeeper is not yet implemented. The current aggregation logic is sufficient for the existing module set, and adding Gatekeeper prematurely would introduce complexity without immediate benefit.

Performance Characteristics

Metric	Typical	Worst Case	Budget
Sentinel match time	1-2ms	8ms	50ms
Loopbreaker check	0.5ms	3ms	50ms
Tripwire scan	0.5ms	5ms	50ms
Total Reflex time	2-4ms	12ms	50ms

The worst-case numbers assume a 10KB command string with heavy encoding. Typical tool calls are 100-500 bytes and complete well under budget. The gap between worst-case (12ms) and budget (50ms) provides margin for future modules without requiring a budget increase.