Understanding Findings

Every Caliper review produces findings — individual observations about your code that range from critical security issues to minor style nits. This page explains how to read them, what each field means, and how to make good approve/skip decisions.

Severity levels

Every finding has exactly one severity level. Severity determines whether a finding blocks merging and how urgently it should be addressed.

Level	Meaning	Action required
blocking	Must be fixed before merging. Security vulnerabilities, logic bugs, broken API contracts, data loss risks. Does not lock the PR via GitHub — use `--fail-on-blocking` in CI for enforcement.	Fix the issue. In `caliper gate`, blocking findings cause a non-zero exit code and prevent the commit.
recommendation	A concrete improvement worth making. Design improvements, missing error handling, readability concerns.	Review and fix where you agree. These improve code quality but are safe to defer if you have good reason.
nit	Minor style or preference issue. Naming choices, import ordering, formatting inconsistencies.	Address if convenient. These are low-effort improvements that keep the codebase clean over time.

You can filter findings by severity on the command line:

bash

npx caliper check --severity blocking    # only blocking issues
npx caliper check --severity recommendation  # blocking + recommendation (skip nits)

Finding categories

Within each severity level, findings are sorted by category priority. Categories appear in this order, from highest to lowest priority:

security

Auth guards, injection vulnerabilities, credential exposure, input validation gaps, and unsafe data handling. Security findings are almost always severity blocking.

logic

Off-by-one errors, race conditions, null/undefined handling, incorrect boolean logic, and unhandled edge cases. Logic bugs that cause incorrect behavior are typically blocking; subtle edge cases may be recommendation.

error-handling

Missing error recovery, swallowed errors, error messages that lack actionable context, and unhandled promise rejections. Most error-handling findings are recommendation unless the missing handling could cause data loss or silent failures.

design

Single-responsibility violations, inappropriate abstraction levels, tight coupling, code duplication, and naming that obscures intent. Design findings are typically recommendation and focus on maintainability.

performance

N+1 queries, missing pagination, unbounded loops or allocations, and unnecessary re-renders. Performance findings are recommendation unless the issue affects production reliability (e.g., unbounded memory growth), in which case they may be blocking.

boy-scout

Dead code, stale comments, unnecessary complexity in files that were already modified by the PR. These are almost always nit — small cleanups that follow the boy-scout rule of leaving code better than you found it.

testability

Missing tests for behavior changes, untestable structure (e.g., hardcoded dependencies that prevent mocking), and gaps in test coverage for new code paths. Typically recommendation.

convention

Project-specific rules derived from your deterministic checks (generated by caliper refresh from your CLAUDE.md). Convention findings always come from the deterministic check engine, not from AI review. See Convention checks vs AI findings below.

Confidence levels

AI-generated findings include a confidence rating that reflects how certain the AI is about the issue.

Level	What it means	How to use it
high	Strong, specific evidence in the code. The finding is almost certainly correct.	Treat as reliable. Fix blocking/recommendation findings without second-guessing.
medium	Likely correct based on the diff and surrounding context, but may require human judgment to confirm.	Read the explanation and evidence before deciding. The AI's reasoning is usually sound but context you have (and it does not) may change the picture.
low	A possible issue flagged for your review. The AI saw a pattern that could be problematic but is not confident.	Evaluate the evidence and explanation carefully. Low-confidence findings have a higher false-positive rate, but they occasionally surface real issues that are easy to miss.

TIP

Convention check findings (category convention) do not have a confidence level — they are deterministic pass/fail checks, so confidence does not apply.

Finding structure

Each finding contains the following fields:

Core fields

Field	Type	Description
`title`	string	A short, descriptive summary of the issue. This is what appears in PR comments and terminal output.
`body`	string	The full description of the problem, written as a standalone comment that makes sense to any developer reading it.
`file`	string	The file path where the issue was found, relative to the repo root.
`line`	number	The line number in the file where the issue is located.
`severity`	`blocking` \| `recommendation` \| `nit`	How urgent the finding is. See Severity levels.
`category`	string	The finding category. See Finding categories.

AI-specific fields

These fields are present on AI-generated findings but not on convention check findings:

Field	Type	Description
`confidence`	`high` \| `medium` \| `low`	How certain the AI is about this finding. See Confidence levels.
`explanation`	string	The AI's reasoning for flagging the issue. Shown in the Details tab during interactive review. Not posted to PRs.
`evidence`	string[]	Specific code patterns, file paths, or line references that support the finding. Shown in the Details tab. Not posted to PRs.

Optional fields

Field	Type	Description
`suggested_fix`	object	A concrete code change with `old_code` and `new_code` fields. When present, `caliper review --fix` can apply it automatically.
`references`	array	Cross-file code references that provide additional context, each with a file path, line range, and label.

INFO

The explanation and evidence fields are for your eyes only during interactive review. When findings are posted as PR comments, only the title, body, and suggested_fix are included.

Convention check findings vs AI review findings

Caliper produces findings from two distinct engines. Understanding the difference helps you evaluate them appropriately.

Convention checks (deterministic)

Convention checks are generated by caliper refresh from your CLAUDE.md and stored in .caliper/checks.js. They run as shell commands or pattern matches — no AI involved.

Category: Always convention
Confidence: Not applicable (deterministic pass/fail)
Speed: Milliseconds — they run as part of caliper check with no AI calls
False positive rate: Very low, since they are pattern-based. If one fires incorrectly, the check definition needs adjustment rather than feedback
Examples: File naming conventions, required imports in specific directories, maximum file size limits, banned patterns

When violations are found, the stop hook formats them as flat, file-sorted instructions. Each violation shows file:line — check-name (Section), a Found: line with the matched text, and an Action: line with the fix hint. The output closes with After fixing, continue the original task. so Claude immediately knows what to do next.

AI review findings

AI findings come from Claude analyzing your diff, surrounding code, and project context. They cover all categories except convention.

Category: Any category except convention
Confidence: Always present (high, medium, or low)
Speed: Seconds to minutes, depending on diff size and pipeline depth
False positive rate: Varies by confidence level. High-confidence findings are rarely wrong; low-confidence findings require more scrutiny
Examples: Logic bugs, missing error handling, security vulnerabilities, design concerns

AI findings include explanation and evidence fields that show the AI's reasoning.

When to approve vs skip

During interactive review (caliper review), you approve or skip each finding. Here is a decision framework:

Approve the finding when

You agree the issue is real and plan to fix it
The finding catches something you missed during development
The suggested fix looks correct (verify before applying with --fix)
Even if the severity feels too high, the underlying observation is valid

Skip the finding when

The finding misunderstands the intent of your code (e.g., a deliberate trade-off the AI does not see)
The issue is already handled elsewhere in the codebase and the AI lacks that context
The finding duplicates another finding you already approved
The convention check is too broad and needs refinement (fix the check, not the code)

Severity adjustment

If you agree with the finding but disagree with its severity, you can adjust it during interactive review. For example, you might downgrade a blocking finding to recommendation if the issue is real but not urgent enough to block a merge.

TIP

Your approve/skip decisions are not throwaway — they feed into Caliper's feedback loop. Consistent patterns in your decisions improve future reviews. See False positives and the feedback loop below.

False positives and the feedback loop

No review tool is perfect. Caliper includes a feedback loop that learns from your decisions over time.

How feedback is collected

Every time you approve or skip a finding during interactive review, Caliper records the decision to .caliper/history.jsonl. Each entry captures:

The finding's category, severity, and confidence
Whether you approved or skipped it
Any notes you added
The source phase that generated the finding

How the feedback loop works

As your history grows, Caliper detects patterns:

False positive detection — Categories or patterns you consistently skip are flagged as likely false positives. With autoCalibrate enabled, these patterns are injected into the AI review prompt to reduce recurrence.
Emerging conventions — Patterns you consistently approve may indicate conventions worth codifying. caliper stats surfaces these so you can add them as deterministic checks.
Per-category calibration — Approval rates by category help Caliper focus AI attention on categories your team finds most valuable.

Reporting a false positive

To report a false positive, simply skip the finding during interactive review. No additional action is needed — the skip is recorded and contributes to pattern detection. If you want to add context, you can attach a note when skipping.

For convention check false positives, the right fix is to update the check definition in .caliper/checks.js (or re-run caliper refresh to regenerate checks).

Configuration

The feedback loop is configurable in .caliper/config.yaml:

yaml

feedback:
  enabled: true # collect finding disposition data (default: true)
  autoCalibrate: false # inject historical insights into AI prompts (default: false)
  maxHistory: 10000 # max entries before truncation (default: 10000)
  minEvidence: 5 # minimum pattern occurrences before detection (default: 5)

Finding statistics

Use npx caliper stats to analyze your review history and identify trends:

bash

npx caliper stats

The stats command shows:

Approval rates by category — which categories produce the most accepted findings and which are skipped most often
False positive patterns — recurring patterns that are consistently skipped, suggesting areas where the AI needs calibration
Emerging conventions — patterns you consistently approve that could become deterministic checks
Phase effectiveness — which review phases produce the most valuable findings

Stats require at least 5 review history entries in .caliper/history.jsonl to show meaningful patterns. History is recorded automatically after each interactive review session.

TIP

Review your stats periodically. If a category has a low approval rate, consider whether the AI needs better context (add more detail to your CLAUDE.md) or whether those checks should be disabled.

Understanding Findings ​

Severity levels ​

Finding categories ​

security ​

logic ​

error-handling ​

design ​

performance ​

boy-scout ​

testability ​

convention ​

Confidence levels ​

Finding structure ​

Core fields ​

AI-specific fields ​

Optional fields ​

Convention check findings vs AI review findings ​

Convention checks (deterministic) ​

AI review findings ​

When to approve vs skip ​

Approve the finding when ​

Skip the finding when ​

Severity adjustment ​

False positives and the feedback loop ​

How feedback is collected ​

How the feedback loop works ​

Reporting a false positive ​

Configuration ​

Finding statistics ​