Appearance
Understanding Findings
Every Caliper review produces findings — individual observations about your code that range from critical security issues to minor style nits. This page explains how to read them, what each field means, and how to make good approve/skip decisions.
Severity levels
Every finding has exactly one severity level. Severity determines whether a finding blocks merging and how urgently it should be addressed.
| Level | Meaning | Action required |
|---|---|---|
| blocking | Must be fixed before merging. Security vulnerabilities, logic bugs, broken API contracts, data loss risks. Does not lock the PR via GitHub — use --fail-on-blocking in CI for enforcement. | Fix the issue. In caliper gate, blocking findings cause a non-zero exit code and prevent the commit. |
| recommendation | A concrete improvement worth making. Design improvements, missing error handling, readability concerns. | Review and fix where you agree. These improve code quality but are safe to defer if you have good reason. |
| nit | Minor style or preference issue. Naming choices, import ordering, formatting inconsistencies. | Address if convenient. These are low-effort improvements that keep the codebase clean over time. |
You can filter findings by severity on the command line:
bash
npx caliper check --severity blocking # only blocking issues
npx caliper check --severity recommendation # blocking + recommendation (skip nits)Finding categories
Within each severity level, findings are sorted by category priority. Categories appear in this order, from highest to lowest priority:
security
Auth guards, injection vulnerabilities, credential exposure, input validation gaps, and unsafe data handling. Security findings are almost always severity blocking.
logic
Off-by-one errors, race conditions, null/undefined handling, incorrect boolean logic, and unhandled edge cases. Logic bugs that cause incorrect behavior are typically blocking; subtle edge cases may be recommendation.
error-handling
Missing error recovery, swallowed errors, error messages that lack actionable context, and unhandled promise rejections. Most error-handling findings are recommendation unless the missing handling could cause data loss or silent failures.
design
Single-responsibility violations, inappropriate abstraction levels, tight coupling, code duplication, and naming that obscures intent. Design findings are typically recommendation and focus on maintainability.
performance
N+1 queries, missing pagination, unbounded loops or allocations, and unnecessary re-renders. Performance findings are recommendation unless the issue affects production reliability (e.g., unbounded memory growth), in which case they may be blocking.
boy-scout
Dead code, stale comments, unnecessary complexity in files that were already modified by the PR. These are almost always nit — small cleanups that follow the boy-scout rule of leaving code better than you found it.
testability
Missing tests for behavior changes, untestable structure (e.g., hardcoded dependencies that prevent mocking), and gaps in test coverage for new code paths. Typically recommendation.
convention
Project-specific rules derived from your deterministic checks (generated by caliper refresh from your CLAUDE.md). Convention findings always come from the deterministic check engine, not from AI review. See Convention checks vs AI findings below.
Confidence levels
AI-generated findings include a confidence rating that reflects how certain the AI is about the issue.
| Level | What it means | How to use it |
|---|---|---|
| high | Strong, specific evidence in the code. The finding is almost certainly correct. | Treat as reliable. Fix blocking/recommendation findings without second-guessing. |
| medium | Likely correct based on the diff and surrounding context, but may require human judgment to confirm. | Read the explanation and evidence before deciding. The AI's reasoning is usually sound but context you have (and it does not) may change the picture. |
| low | A possible issue flagged for your review. The AI saw a pattern that could be problematic but is not confident. | Evaluate the evidence and explanation carefully. Low-confidence findings have a higher false-positive rate, but they occasionally surface real issues that are easy to miss. |
TIP
Convention check findings (category convention) do not have a confidence level — they are deterministic pass/fail checks, so confidence does not apply.
Finding structure
Each finding contains the following fields:
Core fields
| Field | Type | Description |
|---|---|---|
title | string | A short, descriptive summary of the issue. This is what appears in PR comments and terminal output. |
body | string | The full description of the problem, written as a standalone comment that makes sense to any developer reading it. |
file | string | The file path where the issue was found, relative to the repo root. |
line | number | The line number in the file where the issue is located. |
severity | blocking | recommendation | nit | How urgent the finding is. See Severity levels. |
category | string | The finding category. See Finding categories. |
AI-specific fields
These fields are present on AI-generated findings but not on convention check findings:
| Field | Type | Description |
|---|---|---|
confidence | high | medium | low | How certain the AI is about this finding. See Confidence levels. |
explanation | string | The AI's reasoning for flagging the issue. Shown in the Details tab during interactive review. Not posted to PRs. |
evidence | string[] | Specific code patterns, file paths, or line references that support the finding. Shown in the Details tab. Not posted to PRs. |
Optional fields
| Field | Type | Description |
|---|---|---|
suggested_fix | object | A concrete code change with old_code and new_code fields. When present, caliper review --fix can apply it automatically. |
references | array | Cross-file code references that provide additional context, each with a file path, line range, and label. |
INFO
The explanation and evidence fields are for your eyes only during interactive review. When findings are posted as PR comments, only the title, body, and suggested_fix are included.
Convention check findings vs AI review findings
Caliper produces findings from two distinct engines. Understanding the difference helps you evaluate them appropriately.
Convention checks (deterministic)
Convention checks are generated by caliper refresh from your CLAUDE.md and stored in .caliper/checks.js. They run as shell commands or pattern matches — no AI involved.
- Category: Always
convention - Confidence: Not applicable (deterministic pass/fail)
- Speed: Milliseconds — they run as part of
caliper checkwith no AI calls - False positive rate: Very low, since they are pattern-based. If one fires incorrectly, the check definition needs adjustment rather than feedback
- Examples: File naming conventions, required imports in specific directories, maximum file size limits, banned patterns
When violations are found, the stop hook formats them as flat, file-sorted instructions. Each violation shows file:line — check-name (Section), a Found: line with the matched text, and an Action: line with the fix hint. The output closes with After fixing, continue the original task. so Claude immediately knows what to do next.
AI review findings
AI findings come from Claude analyzing your diff, surrounding code, and project context. They cover all categories except convention.
- Category: Any category except
convention - Confidence: Always present (
high,medium, orlow) - Speed: Seconds to minutes, depending on diff size and pipeline depth
- False positive rate: Varies by confidence level. High-confidence findings are rarely wrong; low-confidence findings require more scrutiny
- Examples: Logic bugs, missing error handling, security vulnerabilities, design concerns
AI findings include explanation and evidence fields that show the AI's reasoning.
When to approve vs skip
During interactive review (caliper review), you approve or skip each finding. Here is a decision framework:
Approve the finding when
- You agree the issue is real and plan to fix it
- The finding catches something you missed during development
- The suggested fix looks correct (verify before applying with
--fix) - Even if the severity feels too high, the underlying observation is valid
Skip the finding when
- The finding misunderstands the intent of your code (e.g., a deliberate trade-off the AI does not see)
- The issue is already handled elsewhere in the codebase and the AI lacks that context
- The finding duplicates another finding you already approved
- The convention check is too broad and needs refinement (fix the check, not the code)
Severity adjustment
If you agree with the finding but disagree with its severity, you can adjust it during interactive review. For example, you might downgrade a blocking finding to recommendation if the issue is real but not urgent enough to block a merge.
TIP
Your approve/skip decisions are not throwaway — they feed into Caliper's feedback loop. Consistent patterns in your decisions improve future reviews. See False positives and the feedback loop below.
False positives and the feedback loop
No review tool is perfect. Caliper includes a feedback loop that learns from your decisions over time.
How feedback is collected
Every time you approve or skip a finding during interactive review, Caliper records the decision to .caliper/history.jsonl. Each entry captures:
- The finding's category, severity, and confidence
- Whether you approved or skipped it
- Any notes you added
- The source phase that generated the finding
How the feedback loop works
As your history grows, Caliper detects patterns:
- False positive detection — Categories or patterns you consistently skip are flagged as likely false positives. With
autoCalibrateenabled, these patterns are injected into the AI review prompt to reduce recurrence. - Emerging conventions — Patterns you consistently approve may indicate conventions worth codifying.
caliper statssurfaces these so you can add them as deterministic checks. - Per-category calibration — Approval rates by category help Caliper focus AI attention on categories your team finds most valuable.
Reporting a false positive
To report a false positive, simply skip the finding during interactive review. No additional action is needed — the skip is recorded and contributes to pattern detection. If you want to add context, you can attach a note when skipping.
For convention check false positives, the right fix is to update the check definition in .caliper/checks.js (or re-run caliper refresh to regenerate checks).
Configuration
The feedback loop is configurable in .caliper/config.yaml:
yaml
feedback:
enabled: true # collect finding disposition data (default: true)
autoCalibrate: false # inject historical insights into AI prompts (default: false)
maxHistory: 10000 # max entries before truncation (default: 10000)
minEvidence: 5 # minimum pattern occurrences before detection (default: 5)Finding statistics
Use npx caliper stats to analyze your review history and identify trends:
bash
npx caliper statsThe stats command shows:
- Approval rates by category — which categories produce the most accepted findings and which are skipped most often
- False positive patterns — recurring patterns that are consistently skipped, suggesting areas where the AI needs calibration
- Emerging conventions — patterns you consistently approve that could become deterministic checks
- Phase effectiveness — which review phases produce the most valuable findings
Stats require at least 5 review history entries in .caliper/history.jsonl to show meaningful patterns. History is recorded automatically after each interactive review session.
TIP
Review your stats periodically. If a category has a low approval rate, consider whether the AI needs better context (add more detail to your CLAUDE.md) or whether those checks should be disabled.