Bug Hunt

Part IV: Quality · Chapter 13

6 min read

A developer runs the test suite. Everything passes. They open a pull request. The reviewer approves it. The code ships to production. Three days later, a user in Germany discovers that entering an umlaut in the search field crashes the application. The test suite tested ASCII inputs. The reviewer checked logic flow. Nobody traced untrusted input from the search box through the URL encoder, into the database query, and back to the response renderer. Nobody checked all fourteen places where bugs hide. /draft:bughunt does.

The 14-dimension bug sweep: each dimension converges on the central defect target. Bugs cluster at the intersections between dimensions, in the places where no single reviewer has expertise.

Why Fourteen Dimensions

Bugs do not confine themselves to a single category. A race condition in state management creates a security vulnerability when stale auth tokens are used for requests. A performance issue in an algorithm becomes a reliability issue under load. An accessibility gap becomes a legal liability in regulated industries.

Most code review catches bugs in one or two dimensions — usually correctness and style. /draft:bughunt systematically analyzes code across fourteen dimensions because bugs cluster at the intersections between concerns, in the places where no single reviewer has expertise.

The Primary Deliverable

The bug report is the primary deliverable. Every verified bug appears in the final report regardless of whether a regression test can be written. Tests are supplementary output. Bughunt does not fix code — it finds defects, verifies them with evidence, and reports them with severity rankings and actionable fix descriptions.

The Fourteen Dimensions

Each dimension targets a distinct class of defect. Before analysis, /draft:bughunt determines which dimensions apply to the codebase — a CLI tool skips UI Responsiveness and Accessibility, a frontend-only repo skips API Contracts and Configuration. Skipped dimensions are documented with reasons, not silently omitted.

#	Dimension	What It Catches
1	Correctness	Logic errors, off-by-one, boundary conditions, invalid state transitions, silent failures
2	Reliability	Crash paths, unhandled exceptions, broken recovery after errors, resource leaks, timeout handling
3	Security	XSS, injection, CSRF, auth bypass, secrets exposure, path traversal, insecure deserialization
4	Performance	O(n^2) in hot paths, memory leaks, blocking main thread, unnecessary allocations, unbounded growth
5	UI Responsiveness	Blocking operations, janky animations, layout shifts, forced reflows, poor loading states
6	Concurrency	Race conditions, deadlocks, lost updates, stale responses overwriting newer state, event ordering
7	State Management	Stale state, inconsistent state across components, source-of-truth violations, subscription leaks
8	API Contracts	Breaking changes, missing validation, schema drift, undocumented behavior dependencies
9	Accessibility	Missing ARIA labels, keyboard navigation gaps, broken tab order, color contrast, screen reader issues
10	Configuration	Missing defaults, environment-specific bugs, dev-only code in production, missing env var validation
11	Tests	Flaky tests, snapshot misuse, assertion density problems, tests that pass for wrong reasons
12	Dependencies	Known CVEs, unpinned versions, deprecated packages, license conflicts, typosquatting risk
13	Algorithmic Complexity	Exponential blowup, unbounded recursion, cache stampede, regex catastrophic backtracking
14	i18n/l10n	Hardcoded strings, locale-sensitive operations without locale, RTL issues, Unicode handling bugs

Taint Tracking

Dimension 3 (Security) includes end-to-end taint tracking — following untrusted input from its entry point through the entire codebase to every dangerous sink. This is not a surface-level check for innerHTML usage. It is a systematic trace of data flow.

Bughunt identifies all entry points: HTTP parameters, form data, file uploads, environment variables, CLI arguments, message queue payloads, webhook bodies. For each entry point, it traces the data through every function call, transformation, and storage operation until it reaches a dangerous sink — SQL queries, shell execution, eval, innerHTML, file path construction, URL construction, deserialization, or template rendering.

Entry point: req.query.search (HTTP GET parameter)
  → passed to buildQuery(search) at src/api/handler.ts:34
  → buildQuery concatenates into SQL string at src/db/queries.ts:78
  → NO sanitization, NO parameterized query
  → Sink: raw SQL execution at src/db/queries.ts:82
  Verdict: SQL injection — user input reaches query without parameterization

For each sink, bughunt verifies whether sanitization or validation exists on every path from source to sink. A single unsanitized path is sufficient for exploitation, even if nine other paths are properly guarded.

The Verification Protocol

The difference between bughunt and a static analysis tool is the verification protocol. Static analysis tools produce hundreds of findings, most of them false positives. Bughunt applies a multi-step verification process to every candidate finding before it enters the report.

Six Verification Steps

Code path verification — Read the actual code, trace the data flow, check for upstream guards and validators, verify the path is reachable in production
Context cross-reference — Check .ai-context.md (is this behavior intentional?), tech-stack.md (does the framework handle it?), product.md (is this a requirement violation?), existing tests (is this expected behavior?)
Framework verification — Read the official documentation for the specific method or pattern, quote the relevant section, check framework version for behavior differences
Codebase pattern check — Search for the same pattern elsewhere. If it appears consistently and works, investigate what makes this instance different
False positive elimination — Is this dead code? Test-only code? Intentionally disabled? Explained by a comment?
Pattern prevalence check — If the pattern appears 5+ times, sample three instances. If they all work correctly, do not report. If all are buggy, report the total count

Evidence Over Assumptions

Every reported bug must include: the actual problematic code snippet, the trace showing how data reaches the bug, which verification checks were completed, and an explicit statement of why this is not a false positive. A finding without evidence is not a finding — it is speculation.

The 6-step verification funnel: many candidate findings enter at the top, but each step filters out false positives. Only verified bugs with evidence survive all six checks and appear in the final report.

Confidence Filtering

Bughunt uses a strict confidence threshold. Only HIGH and CONFIRMED findings are included in the report. This is a deliberate design choice — a report with 50 findings where half are false positives teaches the team to ignore the report. A report with 8 verified findings, all actionable, teaches the team to trust it.

Confidence	Criteria	Action
CONFIRMED	Verified through code trace, no mitigating factors, optionally confirmed by a failing test	Report
HIGH	Strong evidence, checked context, no obvious mitigation found	Report
MEDIUM	Suspicious but not fully verified	Ask user to confirm before including
LOW	Possible issue, likely handled elsewhere	Do not report

Context-Driven Analysis

What separates /draft:bughunt from generic static analysis is its use of Draft context. When draft/.ai-context.md exists, bughunt leverages every documented architectural decision to find bugs that tools without context cannot detect:

Critical invariants — The architecture documents that "user IDs are always UUIDs" or "all monetary values use integer cents." Bughunt checks for violations
Concurrency model — The architecture specifies the threading model. Bughunt uses this to identify race conditions specific to that model
Data state machines — If the architecture defines valid state transitions (e.g., Order: pending → confirmed → shipped), bughunt checks for code that allows invalid transitions
Failure recovery matrix — If the architecture claims operations are idempotent, bughunt verifies those claims by tracing retry paths
Consistency boundaries — Where eventual consistency is documented, bughunt looks for stale reads, lost events, and missing reconciliation at those seams

Regression Test Generation

For each verified bug, bughunt generates a regression test in the project's native test framework. The test is designed to fail against the current buggy code and pass after the fix — serving as both proof of the bug and protection against regression.

Before generating any test, bughunt discovers existing test coverage for the buggy code path. Each bug is classified as COVERED (existing test catches it), PARTIAL (test exists but misses this case), WRONG_ASSERTION (test asserts buggy behavior as correct), NO_COVERAGE (no test exists), or N/A (untestable code).

Bug: [HIGH] Security: Unsanitized input in comment renderer
File: src/components/Comment.tsx:88
Status: PARTIAL — Comment.test.tsx exists but only tests ASCII input

Existing test: Comment.test.tsx:23 — "renders comment text"
  → Tests basic string rendering, does not test HTML injection

New test case (add to Comment.test.tsx):
  describe('Comment XSS prevention', () => {
    it('should sanitize HTML in user-submitted comments', () => {
      const malicious = '<img src=x onerror=alert(1)>';
      render(<Comment text={malicious} />);
      expect(screen.queryByRole('img')).toBeNull();
    });
  });

If no test framework is detected, bugs are still reported in full — the test section is marked N/A. The bug report is the primary deliverable; tests are supplementary.

Bughunt vs. Review

These commands serve different purposes and are designed to work together:

Aspect	`/draft:review`	`/draft:bughunt`
Question	Does this code match the spec and follow conventions?	Does this code contain defects?
Scope	Changed files in a track	Entire repo, specific paths, or track files
Focus	Compliance, style, spec accuracy	Bugs across 14 dimensions
Output	Pass/fail with review comments	Severity-ranked bug report with evidence
Modifies code	No	No (report and regression tests only)

The with-bughunt modifier on /draft:review runs both in sequence — first the review checks spec compliance and conventions, then bughunt sweeps for defects across all fourteen dimensions. This combined run inherits the scope from the review command, so there is no redundant scope confirmation.

The Intersection Problem

The most dangerous bugs live at the intersection of two dimensions. A performance issue (Dimension 4) in an algorithm becomes a denial-of-service vulnerability (Dimension 3) when the input is user-controlled. A state management bug (Dimension 7) becomes a data loss issue (Dimension 2) when the stale state is persisted. Fourteen dimensions is not about being exhaustive for its own sake — it is about covering the cross-cutting spaces where single-dimension reviews have blind spots.

Dimension Deep Dives

Three of the fourteen dimensions deserve special attention because they catch bug classes that are commonly missed:

Dimension 11: Tests

Bughunt analyzes tests themselves for defects. This includes assertion density problems — tests with zero or weak assertions like expect(result).toBeDefined() that pass without actually verifying behavior. It catches test isolation violations where shared mutable state between test cases creates ordering dependencies. And it identifies test double misuse where mocks diverge from real implementation behavior, giving false confidence.

Dimension 12: Dependencies

Beyond checking for known CVEs, bughunt examines typosquatting risk (packages with names suspiciously similar to popular ones), transitive dependency depth (deeply nested chains that increase supply chain attack surface), and license conflicts (GPL dependencies in MIT projects, AGPL in proprietary code).

Dimension 13: Algorithmic Complexity

This dimension goes beyond obvious O(n^2) loops. Bughunt identifies regex catastrophic backtracking — nested quantifiers like (a+)+ applied to user-controlled input that can lock a CPU for minutes. It finds cache invalidation storms where a cache miss triggers recomputation that itself invalidates caches, creating a thundering herd. And it catches hot path inefficiency where linear scans are used where hash maps would suffice, or the same collection is sorted repeatedly.