Reflex Engine
Reflex is the first engine to evaluate every tool call. Its job is simple: decide whether the call is safe to proceed, and decide fast. The latency budget is 50ms. In practice, Reflex completes in 2-4ms for typical inputs.
Reflex contains three modules — Sentinel, Loopbreaker, and Tripwire — plus a future aggregation layer called Gatekeeper.
Sentinel: Pattern Matching
Sentinel is the largest module by line count and the simplest by concept. It maintains a library of approximately 300 regex patterns organized into categories:
- Destructive commands —
rm -rf /,git push --force,docker system prune,DROP TABLE - Credential exposure — patterns matching API keys, tokens, passwords in command arguments or file contents
- Scope escalation —
sudo,chmod 777,chown root, privilege elevation attempts - Network exfiltration —
curlto external hosts with local file arguments,nclisteners, reverse shells - File system traversal —
../../etc/passwd, symlink attacks, writes outside the project tree
Patterns are compiled into a single RegexSet at startup. RegexSet uses DFA (deterministic finite automaton) matching, which means all patterns evaluate in a single pass over the input string. There is no backtracking, no catastrophic performance on adversarial inputs, and no per-pattern overhead.
When a pattern matches, Sentinel returns a Deny verdict with the pattern name and category. The deny message is minimal — one sentence identifying what triggered and why. Example:
Blocked: `rm -rf /` matches safety pattern `dangerous_recursive_delete` (category: destructive)
The 10-Stage Pipeline
Sentinel does not simply run the input against the pattern set. It processes through a 10-stage pipeline that normalizes and enriches the input before matching:
- Raw capture — extract the tool name, command string, and arguments from the JSON payload
- Decode — handle base64, URL encoding, unicode escapes, and hex-encoded strings
- Normalize whitespace — collapse tabs, newlines, and multiple spaces into single spaces
- Expand aliases — resolve common shell aliases (
ll→ls -la,..→cd ..) - Split compounds — break
&&,||,;, and pipe chains into individual commands - Variable substitution — resolve
$HOME,$USER,$PWDwhere values are known - Quote stripping — remove single and double quotes to expose the actual arguments
- Pattern match — run the normalized input against the compiled RegexSet
- Context check — verify the match against allowlists (project-specific overrides)
- Verdict emit — produce Approve or Deny with metadata
Stages 1-7 are collectively called “normalization” and exist to defeat trivial evasion. Without them, an attacker could bypass rm -rf / detection with r""m -r''f / or $(echo rm) -rf /.
Loopbreaker: Repetition Detection
Loopbreaker detects when the assistant is stuck in a loop — repeating the same commands, generating the same errors, or spiraling through increasingly desperate attempts to fix a problem.
It uses three detection strategies:
N-gram matching
Loopbreaker maintains a rolling window of the last 50 tool calls (by tool name + command hash). It computes 2-grams and 3-grams over this window and flags when any n-gram appears more than 3 times. For example, if the sequence [bash:npm test, bash:npm test, bash:npm test] appears, the 1-gram bash:npm test triggers at count 3, but more importantly the 2-gram [bash:npm test, bash:npm test] triggers at count 2, indicating a tight loop.
Spiral detection
A spiral is a pattern where the assistant tries progressively different approaches to the same problem, each one failing. Loopbreaker detects spirals by tracking error signatures (the first line of stderr output, hashed) and measuring the edit distance between consecutive commands. If error signatures repeat while commands diverge, the session is spiraling.
Spirals are harder to detect than loops because the commands are different each time. The key signal is error stability — the same error keeps appearing despite changing commands.
Entropy collapse
Entropy collapse occurs when the assistant’s output becomes increasingly repetitive at the token level. Loopbreaker estimates Shannon entropy over the last 20 tool-call outputs using a byte-level frequency distribution. When entropy drops below a threshold (currently 3.5 bits/byte for English text), it signals potential degeneration.
This detector has the highest false-positive rate and is weighted lowest in the final score. It exists primarily to catch edge cases where the assistant generates walls of identical text without triggering n-gram or spiral detectors.
Tripwire: Injection and Bypass Detection
Tripwire watches for attempts to manipulate Warden itself or bypass its protections:
- Prompt injection — tool calls that contain instructions like “ignore previous instructions” or “you are now in developer mode” embedded in file contents or command arguments
- Expansion bypass — attempts to construct dangerous commands through variable expansion, eval, or indirect execution (
bash -c "$(cat malicious.sh)") - Hook evasion — commands that attempt to modify Warden’s configuration, disable hooks, or kill the daemon process
- Encoding evasion — commands where the dangerous payload is hidden behind base64, hex, or unicode encoding that Sentinel’s normalization might miss
Tripwire runs after Sentinel and uses a different detection approach: instead of pattern matching against known-bad strings, it looks for structural anomalies that suggest evasion intent. A command with three levels of nested quoting and a base64 decode pipe is suspicious regardless of what the decoded content is.
Gatekeeper: Central Verdict (Future)
Gatekeeper is a planned module that will serve as the single point of verdict aggregation for the Reflex engine. Currently, verdict aggregation happens in the engine’s mod.rs using simple precedence rules (Deny > Modify > Approve). Gatekeeper will add:
- Confidence scoring — each module’s verdict carries a confidence level, allowing Gatekeeper to weigh weak denials against strong approvals
- Appeal mechanism — a denied call can be re-evaluated with additional context (e.g., the user explicitly confirmed the command)
- Audit trail — every verdict decision is logged with the full reasoning chain for post-hoc analysis
Gatekeeper is not yet implemented. The current aggregation logic is sufficient for the existing module set, and adding Gatekeeper prematurely would introduce complexity without immediate benefit.
Performance Characteristics
| Metric | Typical | Worst Case | Budget |
|---|---|---|---|
| Sentinel match time | 1-2ms | 8ms | 50ms |
| Loopbreaker check | 0.5ms | 3ms | 50ms |
| Tripwire scan | 0.5ms | 5ms | 50ms |
| Total Reflex time | 2-4ms | 12ms | 50ms |
The worst-case numbers assume a 10KB command string with heavy encoding. Typical tool calls are 100-500 bytes and complete well under budget. The gap between worst-case (12ms) and budget (50ms) provides margin for future modules without requiring a budget increase.