Example: Bug Fix with Hypothesis Workflow

This example demonstrates the hypothesis workflow - a scientific approach to debugging complex issues where the root cause isn't obvious.

The Bug

Users report intermittent 500 errors during checkout. It happens randomly, affects about 2% of checkouts, and has been occurring for two weeks.

Why Hypothesis Workflow?

This bug is a good candidate for hypothesis workflow because:

Intermittent - Not reproducible on demand
Unknown cause - Stack traces show timeout, but why?
Production impact - Losing revenue, need systematic fix
Complex system - Checkout touches many services

A "shotgun debugging" approach (trying random fixes) would be inefficient and risky.

Phase 1: Elaboration

User: /elaborate

I want to fix intermittent 500 errors on checkout. About 2% of
checkouts fail with database timeout errors. Been happening for
2 weeks. I want to use the hypothesis workflow.

AI (Elaborator): "I'll set this up for hypothesis-driven debugging.

# .ai-dlc/fix-checkout-500/INTENT.md

---
status: active
workflow: hypothesis
started_at: 2026-02-02T10:00:00Z
---

# Fix Intermittent Checkout 500 Errors

## Problem Statement
~2% of checkout attempts fail with HTTP 500 errors. Stack traces
indicate database connection timeout. Issue started approximately
2 weeks ago.

## Completion Criteria
- [ ] Root cause identified with evidence
- [ ] Fix implemented and verified
- [ ] Regression test added
- [ ] Error rate returns to baseline (<0.1%)
- [ ] Monitoring added to detect recurrence
- [ ] All tests pass

Since this is a single investigation, I'll create one unit for the full debug cycle."

Phase 2: Construction

Observer Hat (OHOTL)

AI (Observer): "Gathering data about the checkout failures...

Error Logs Analysis:

Querying logs for checkout 500 errors, last 7 days...

Findings:
- 847 errors out of 42,350 checkouts (2.0%)
- Error: "SequelizeConnectionAcquireTimeoutError"
- Timeout after 10s waiting for connection
- No correlation with specific users
- No correlation with cart size or products

Timing Analysis:

Errors by hour:
  00:00-06:00: 12 errors (0.8%)
  06:00-12:00: 187 errors (1.5%)
  12:00-18:00: 423 errors (2.8%)
  18:00-24:00: 225 errors (2.1%)

Peak error rate correlates with peak traffic.

Recent Changes (last 3 weeks):

- Week -3: No checkout changes
- Week -2: Added inventory check before payment (PR #1234)
- Week -1: Various unrelated changes

Database Metrics:

Current pool configuration:
  - Max connections: 25
  - Idle timeout: 10s
  - Acquire timeout: 10s

Connection pool stats (peak hour):
  - Active: 23-25 (often at max)
  - Waiting: 0-8
  - Idle: 0-2

Summary of Observations:

Errors are connection pool exhaustion (not query failures)
Correlates with traffic peaks
Started around when inventory check was added
Pool frequently at maximum connections
No single slow query identified"

Hypothesizer Hat (HITL)

AI (Hypothesizer): "Based on observations, I've formed these hypotheses:

Hypothesis 1: Connection Pool Too Small (High probability)

Evidence: Pool at 25 max, often exhausted
Theory: Traffic growth outpaced pool size
Test: Increase pool size, measure error rate

Hypothesis 2: New Inventory Check Holds Connections (Medium probability)

Evidence: Timing aligns with PR #1234
Theory: Inventory check doesn't release connections properly
Test: Review code, check for connection leaks

Hypothesis 3: Slow Query Blocking Pool (Medium probability)

Evidence: Timeouts during high traffic
Theory: Some query blocks connections under load
Test: Enable slow query logging, analyze during peak

Hypothesis 4: Database Performance Degraded (Low probability)

Evidence: Issue is recent
Theory: Database server resource constraints
Test: Check DB server metrics, disk I/O, CPU

Ranking:

Connection pool too small (simplest, matches data)
Inventory check connection leak (timing matches)
Slow query (possible contributing factor)
Database degradation (least likely, would affect other operations)

Which hypothesis should we test first?"

User: Start with #1, but also investigate #2 since it matches the timing.

Experimenter Hat (OHOTL)

AI (Experimenter): "Testing hypotheses 1 and 2...

Experiment 1: Connection Pool Analysis

// Adding pool monitoring
const poolStats = [];
setInterval(() => {
  poolStats.push({
    timestamp: Date.now(),
    active: pool.numUsed(),
    waiting: pool.numPendingAcquires(),
    available: pool.numFree(),
  });
}, 1000);

Deployed monitoring to staging, ran load test simulating peak traffic.

Results:

Load: 100 concurrent checkouts

With pool size 25:
- Average wait time: 4.2s
- Timeouts: 8 requests
- Peak waiting: 12

With pool size 50:
- Average wait time: 0.3s
- Timeouts: 0 requests
- Peak waiting: 2

With pool size 75:
- Average wait time: 0.1s
- Timeouts: 0 requests
- Peak waiting: 0

Conclusion: Pool size of 25 is insufficient for current load. Increasing to 50 eliminates timeouts in testing.

Experiment 2: Inventory Check Code Review

// PR #1234 - The inventory check
async function checkInventory(cartItems) {
  const connection = await getConnection();  // Acquires from pool
  try {
    for (const item of cartItems) {
      const stock = await connection.query(
        'SELECT quantity FROM inventory WHERE product_id = ?',
        [item.productId]
      );
      if (stock.quantity < item.quantity) {
        throw new InsufficientStockError(item);
      }
    }
  } finally {
    connection.release();  // Properly released
  }
}

Code analysis:

Connection is properly acquired and released
Uses try/finally pattern correctly
BUT: Sequential queries for each cart item

Issue Found: While not a leak, the sequential queries hold a connection longer than necessary. A cart with 10 items makes 10 sequential queries on one connection.

Optimized approach:

async function checkInventory(cartItems) {
  const productIds = cartItems.map(i => i.productId);
  const stocks = await db.query(
    'SELECT product_id, quantity FROM inventory WHERE product_id IN (?)',
    [productIds]
  );
  // Single query, connection held briefly
}

Test Results:

Original: 340ms connection hold time (10 items)
Optimized: 45ms connection hold time (10 items)

Combined Impact: Both issues contribute to the problem:

Pool too small for traffic
Inventory check holds connections 7x longer than needed"

Analyst Hat (HITL)

AI (Analyst): "Analyzing results and proposing fix...

Root Cause Confirmed: Two contributing factors:

Primary: Connection pool size (25) insufficient for current traffic levels
Secondary: Inventory check holds connections longer than necessary due to sequential queries

Evidence:

Pool exhaustion directly correlates with errors
Increasing pool eliminates timeout errors in testing
Inventory code change timing matches issue start
Optimizing inventory query reduces connection hold time by 87%

Proposed Fix:

# Phase 1: Immediate relief
database:
  pool:
    max: 50  # Was 25
    min: 10  # Was 5
    acquireTimeout: 30000  # Was 10000

# Phase 2: Code optimization
# Batch inventory queries (PR attached)

Verification Plan:

Deploy pool increase to staging
Run load test confirming no timeouts
Deploy to production (canary)
Monitor error rate for 24 hours
Deploy inventory optimization
Monitor for additional improvement

Monitoring Added:

// Alert if pool utilization > 80% for 5 minutes
alert('db_pool_high_utilization', {
  condition: 'pool_used / pool_max > 0.8',
  duration: '5m',
  severity: 'warning'
});

// Alert if acquire wait time > 5s
alert('db_pool_slow_acquire', {
  condition: 'acquire_wait_p95 > 5000',
  duration: '1m',
  severity: 'critical'
});

Regression Test:

describe('Checkout under load', () => {
  it('completes 100 concurrent checkouts without timeout', async () => {
    const checkouts = Array(100).fill(null).map(() =>
      checkout({ items: [{ productId: 'test', quantity: 1 }] })
    );
    const results = await Promise.allSettled(checkouts);
    const failures = results.filter(r => r.status === 'rejected');
    expect(failures.length).toBe(0);
  });
});

Shall I implement this fix?"

User: Yes, proceed.

AI: "Implementing fix...

Changes made:

Updated database config: pool max 25 -> 50
Optimized inventory check: sequential -> batch query
Added pool monitoring metrics
Added regression test

Deployment:

Deployed to staging
Load test: 0 timeouts at 150 concurrent users
Deploying to production (canary 10%)

Results after 2 hours:

Canary (new config):
- Checkouts: 1,247
- Errors: 0
- Error rate: 0%

Control (old config):
- Checkouts: 11,223
- Errors: 198
- Error rate: 1.8%

Fix confirmed effective. Proceeding with full rollout."

Phase 3: Completion

AI: "Fix verified. Updating intent...

# .ai-dlc/fix-checkout-500/INTENT.md

---
status: completed
workflow: hypothesis
started_at: 2026-02-02T10:00:00Z
completed_at: 2026-02-02T14:30:00Z
---

## Completion Criteria
- [x] Root cause identified: Pool exhaustion + inefficient inventory queries
- [x] Fix implemented: Pool size increase + query optimization
- [x] Regression test added: Load test for concurrent checkouts
- [x] Error rate: 0% post-deployment (was 2%)
- [x] Monitoring added: Pool utilization + acquire time alerts
- [x] All tests pass: 156/156

Intent COMPLETE.

Summary:

Root cause: Connection pool undersized + inefficient queries
Fix: Increased pool, optimized queries
Result: Error rate 2% -> 0%
Prevention: Added monitoring alerts"

Key Takeaways

Why Hypothesis Workflow Worked

Systematic approach - Didn't jump to random fixes
Data-driven - Observations guided hypotheses
Multiple hypotheses - Discovered two contributing factors
Verified fix - Canary deployment confirmed effectiveness

Time Investment

Phase	Time	Activities
Observer	30 min	Log analysis, metrics review, code history
Hypothesizer	15 min	Form and rank theories
Experimenter	60 min	Pool testing, code review, optimization
Analyst	30 min	Confirm cause, implement fix, verify

Total: ~2.5 hours for a complex production bug.

What Could Go Wrong

Without hypothesis workflow:

Might have only increased pool size (missing the query optimization)
Might have tried random fixes (wasting time)
Might have deployed fix without verification
Might not have added monitoring (bug could recur)

Human Decisions Made

Prioritized which hypotheses to test
Approved the verification plan
Approved production deployment

AI Contributions

Gathered and analyzed data systematically
Generated multiple hypotheses
Designed and executed experiments
Proposed comprehensive fix with monitoring

Next Steps

Workflows - Review all workflow options
Example: Feature Implementation - Default workflow example
Core Concepts - Understanding the fundamentals