Bug Hunt
6 min read
A developer runs the test suite. Everything passes. They open a pull request. The reviewer approves it. The code ships to production. Three days later, a user in Germany discovers that entering an umlaut in the search field crashes the application. The test suite tested ASCII inputs. The reviewer checked logic flow. Nobody traced untrusted input from the search box through the URL encoder, into the database query, and back to the response renderer. Nobody checked all fourteen places where bugs hide. /draft:bughunt does.
Why Fourteen Dimensions
Bugs do not confine themselves to a single category. A race condition in state management creates a security vulnerability when stale auth tokens are used for requests. A performance issue in an algorithm becomes a reliability issue under load. An accessibility gap becomes a legal liability in regulated industries.
Most code review catches bugs in one or two dimensions — usually correctness and style. /draft:bughunt systematically analyzes code across fourteen dimensions because bugs cluster at the intersections between concerns, in the places where no single reviewer has expertise.
The bug report is the primary deliverable. Every verified bug appears in the final report regardless of whether a regression test can be written. Tests are supplementary output. Bughunt does not fix code — it finds defects, verifies them with evidence, and reports them with severity rankings and actionable fix descriptions.
The Fourteen Dimensions
Each dimension targets a distinct class of defect. Before analysis, /draft:bughunt determines which dimensions apply to the codebase — a CLI tool skips UI Responsiveness and Accessibility, a frontend-only repo skips API Contracts and Configuration. Skipped dimensions are documented with reasons, not silently omitted.
| # | Dimension | What It Catches |
|---|---|---|
| 1 | Correctness | Logic errors, off-by-one, boundary conditions, invalid state transitions, silent failures |
| 2 | Reliability | Crash paths, unhandled exceptions, broken recovery after errors, resource leaks, timeout handling |
| 3 | Security | XSS, injection, CSRF, auth bypass, secrets exposure, path traversal, insecure deserialization |
| 4 | Performance | O(n^2) in hot paths, memory leaks, blocking main thread, unnecessary allocations, unbounded growth |
| 5 | UI Responsiveness | Blocking operations, janky animations, layout shifts, forced reflows, poor loading states |
| 6 | Concurrency | Race conditions, deadlocks, lost updates, stale responses overwriting newer state, event ordering |
| 7 | State Management | Stale state, inconsistent state across components, source-of-truth violations, subscription leaks |
| 8 | API Contracts | Breaking changes, missing validation, schema drift, undocumented behavior dependencies |
| 9 | Accessibility | Missing ARIA labels, keyboard navigation gaps, broken tab order, color contrast, screen reader issues |
| 10 | Configuration | Missing defaults, environment-specific bugs, dev-only code in production, missing env var validation |
| 11 | Tests | Flaky tests, snapshot misuse, assertion density problems, tests that pass for wrong reasons |
| 12 | Dependencies | Known CVEs, unpinned versions, deprecated packages, license conflicts, typosquatting risk |
| 13 | Algorithmic Complexity | Exponential blowup, unbounded recursion, cache stampede, regex catastrophic backtracking |
| 14 | i18n/l10n | Hardcoded strings, locale-sensitive operations without locale, RTL issues, Unicode handling bugs |
Taint Tracking
Dimension 3 (Security) includes end-to-end taint tracking — following untrusted input from its entry point through the entire codebase to every dangerous sink. This is not a surface-level check for innerHTML usage. It is a systematic trace of data flow.
Bughunt identifies all entry points: HTTP parameters, form data, file uploads, environment variables, CLI arguments, message queue payloads, webhook bodies. For each entry point, it traces the data through every function call, transformation, and storage operation until it reaches a dangerous sink — SQL queries, shell execution, eval, innerHTML, file path construction, URL construction, deserialization, or template rendering.
Entry point: req.query.search (HTTP GET parameter)
→ passed to buildQuery(search) at src/api/handler.ts:34
→ buildQuery concatenates into SQL string at src/db/queries.ts:78
→ NO sanitization, NO parameterized query
→ Sink: raw SQL execution at src/db/queries.ts:82
Verdict: SQL injection — user input reaches query without parameterization
For each sink, bughunt verifies whether sanitization or validation exists on every path from source to sink. A single unsanitized path is sufficient for exploitation, even if nine other paths are properly guarded.
The Verification Protocol
The difference between bughunt and a static analysis tool is the verification protocol. Static analysis tools produce hundreds of findings, most of them false positives. Bughunt applies a multi-step verification process to every candidate finding before it enters the report.
Six Verification Steps
- Code path verification — Read the actual code, trace the data flow, check for upstream guards and validators, verify the path is reachable in production
- Context cross-reference — Check
.ai-context.md(is this behavior intentional?),tech-stack.md(does the framework handle it?),product.md(is this a requirement violation?), existing tests (is this expected behavior?) - Framework verification — Read the official documentation for the specific method or pattern, quote the relevant section, check framework version for behavior differences
- Codebase pattern check — Search for the same pattern elsewhere. If it appears consistently and works, investigate what makes this instance different
- False positive elimination — Is this dead code? Test-only code? Intentionally disabled? Explained by a comment?
- Pattern prevalence check — If the pattern appears 5+ times, sample three instances. If they all work correctly, do not report. If all are buggy, report the total count
Every reported bug must include: the actual problematic code snippet, the trace showing how data reaches the bug, which verification checks were completed, and an explicit statement of why this is not a false positive. A finding without evidence is not a finding — it is speculation.
Confidence Filtering
Bughunt uses a strict confidence threshold. Only HIGH and CONFIRMED findings are included in the report. This is a deliberate design choice — a report with 50 findings where half are false positives teaches the team to ignore the report. A report with 8 verified findings, all actionable, teaches the team to trust it.
| Confidence | Criteria | Action |
|---|---|---|
| CONFIRMED | Verified through code trace, no mitigating factors, optionally confirmed by a failing test | Report |
| HIGH | Strong evidence, checked context, no obvious mitigation found | Report |
| MEDIUM | Suspicious but not fully verified | Ask user to confirm before including |
| LOW | Possible issue, likely handled elsewhere | Do not report |
Context-Driven Analysis
What separates /draft:bughunt from generic static analysis is its use of Draft context. When draft/.ai-context.md exists, bughunt leverages every documented architectural decision to find bugs that tools without context cannot detect:
- Critical invariants — The architecture documents that "user IDs are always UUIDs" or "all monetary values use integer cents." Bughunt checks for violations
- Concurrency model — The architecture specifies the threading model. Bughunt uses this to identify race conditions specific to that model
- Data state machines — If the architecture defines valid state transitions (e.g., Order: pending → confirmed → shipped), bughunt checks for code that allows invalid transitions
- Failure recovery matrix — If the architecture claims operations are idempotent, bughunt verifies those claims by tracing retry paths
- Consistency boundaries — Where eventual consistency is documented, bughunt looks for stale reads, lost events, and missing reconciliation at those seams
Regression Test Generation
For each verified bug, bughunt generates a regression test in the project's native test framework. The test is designed to fail against the current buggy code and pass after the fix — serving as both proof of the bug and protection against regression.
Before generating any test, bughunt discovers existing test coverage for the buggy code path. Each bug is classified as COVERED (existing test catches it), PARTIAL (test exists but misses this case), WRONG_ASSERTION (test asserts buggy behavior as correct), NO_COVERAGE (no test exists), or N/A (untestable code).
Bug: [HIGH] Security: Unsanitized input in comment renderer
File: src/components/Comment.tsx:88
Status: PARTIAL — Comment.test.tsx exists but only tests ASCII input
Existing test: Comment.test.tsx:23 — "renders comment text"
→ Tests basic string rendering, does not test HTML injection
New test case (add to Comment.test.tsx):
describe('Comment XSS prevention', () => {
it('should sanitize HTML in user-submitted comments', () => {
const malicious = '<img src=x onerror=alert(1)>';
render(<Comment text={malicious} />);
expect(screen.queryByRole('img')).toBeNull();
});
});
If no test framework is detected, bugs are still reported in full — the test section is marked N/A. The bug report is the primary deliverable; tests are supplementary.
Bughunt vs. Review
These commands serve different purposes and are designed to work together:
| Aspect | /draft:review | /draft:bughunt |
|---|---|---|
| Question | Does this code match the spec and follow conventions? | Does this code contain defects? |
| Scope | Changed files in a track | Entire repo, specific paths, or track files |
| Focus | Compliance, style, spec accuracy | Bugs across 14 dimensions |
| Output | Pass/fail with review comments | Severity-ranked bug report with evidence |
| Modifies code | No | No (report and regression tests only) |
The with-bughunt modifier on /draft:review runs both in sequence — first the review checks spec compliance and conventions, then bughunt sweeps for defects across all fourteen dimensions. This combined run inherits the scope from the review command, so there is no redundant scope confirmation.
The most dangerous bugs live at the intersection of two dimensions. A performance issue (Dimension 4) in an algorithm becomes a denial-of-service vulnerability (Dimension 3) when the input is user-controlled. A state management bug (Dimension 7) becomes a data loss issue (Dimension 2) when the stale state is persisted. Fourteen dimensions is not about being exhaustive for its own sake — it is about covering the cross-cutting spaces where single-dimension reviews have blind spots.
Dimension Deep Dives
Three of the fourteen dimensions deserve special attention because they catch bug classes that are commonly missed:
Dimension 11: Tests
Bughunt analyzes tests themselves for defects. This includes assertion density problems — tests with zero or weak assertions like expect(result).toBeDefined() that pass without actually verifying behavior. It catches test isolation violations where shared mutable state between test cases creates ordering dependencies. And it identifies test double misuse where mocks diverge from real implementation behavior, giving false confidence.
Dimension 12: Dependencies
Beyond checking for known CVEs, bughunt examines typosquatting risk (packages with names suspiciously similar to popular ones), transitive dependency depth (deeply nested chains that increase supply chain attack surface), and license conflicts (GPL dependencies in MIT projects, AGPL in proprietary code).
Dimension 13: Algorithmic Complexity
This dimension goes beyond obvious O(n^2) loops. Bughunt identifies regex catastrophic backtracking — nested quantifiers like (a+)+ applied to user-controlled input that can lock a CPU for minutes. It finds cache invalidation storms where a cache miss triggers recomputation that itself invalidates caches, creating a thundering herd. And it catches hot path inefficiency where linear scans are used where hash maps would suffice, or the same collection is sorted repeatedly.