Verifying the Effect of AI Cross-Review Using a React Benchmark
A comparative experiment of React code review using Claude Opus 4.8 and GPT-5.5. The difference in effect between same-session, separate-session, and cross-review is small, revealing that the introduction of a review process itself is dominant.
The effectiveness of code review by AI code assistants has long been debated: “Is cross-review using a different model effective?” To answer this, a quantitative data attempt was published on Zenn. Using a React proficiency benchmark, a comparative experiment of four review processes was conducted with Claude Opus 4.8 and GPT-5.5 (Codex).
The experiment was carried out from June 11 to 12, 2026. Claude Opus 4.8 (claude CLI, effort=default/high) and GPT-5.5 (codex CLI, reasoning=default/medium) were used as implementation models, while Claude Sonnet 4.6 was uniformly used as the evaluator model. The evaluation metric was the React proficiency benchmark (13 specs) published by uhyo, which scores from perspectives such as accessibility, component design, state management, and TypeScript quality.
Characteristics of the Experimental Design
Four review patterns were compared: evaluation of the base implementation only, self-polishing within the same session, review in a separate session (without memory), and cross-review by a different model. A key design point is that the reviewer outputs only comments, and the fix is always performed by the original implementation model. This measures the “quality of suggestions” in a manner close to actual pull request reviews.
The same-session condition used session resume functionality (claude --resume / codex exec resume) to simulate polishing code immediately after writing, with memory. In contrast, cross-review had a different model generate review comments on first sight.
Summary of Score Results
The average scores across 13 specs were as follows: Base Claude implementation 79.0, GPT implementation 77.7. Same-session polishing: 83.8 and 84.2 respectively; separate-session review: 84.4 and 82.6; cross-review: Claude implementation with GPT review 84.0, GPT implementation with Claude review 82.0.
Average improvement (Δ) from the base: Claude implementation: same-session +4.8, separate-session +5.4, cross-review +5.0. GPT implementation: same-session +6.5, separate-session +4.9, cross-review +4.3. All patterns showed improvements of +4 to +7, and the differences between patterns were within the range of variance.
Little Significant Difference Between Patterns
The most important finding is that inserting a review process itself is dominant for score improvement, while the effect of how the review is conducted (same-session vs. separate-session, own model vs. other model) is small. In terms of cost, the author points out that same-session polishing reduces the round trip of agent calls by one, offering the highest cost-effectiveness.
Looking at the breakdown by category, accessibility (ARIA attributes, focus management) and component design scores improved significantly. State design and TypeScript quality were already high at the base level and had reached a ceiling, leaving little room for improvement. Reviews can be said to function as a process to catch cross-cutting concerns that are easily missed in a single-shot implementation.
Systematic Failure Cases in Cross-Review
Interestingly, systematic cases where cross-review worsened scores were confirmed under specific conditions. In Spec 004 (user profile viewing), only the pattern where Claude Opus 4.8 generated review comments and GPT-5.5 applied the fixes resulted in scores below the baseline. This pattern showed high reproducibility even after multiple trials. The same comment exchange did not cause deterioration with own-model combinations or GPT’s own reviews.
The author checked the comment content but found no obvious errors. A possible explanation is a “perception gap” between models degrading the fix quality, but the cause could not be identified. This phenomenon suggests that cross-review collaboration between models with different architectures or training data carries unexpected risks.
Limitations and Future Tasks
Each case was executed only once, so statistical stability is not guaranteed. The published repository allows reproduction and extension. Additionally, among the three Claude Code models (Haiku, Sonnet, Opus), only Opus was used for implementation and review. Note that no prompt engineering or linter assistance was applied, so the results measure “raw ability.”
Furthermore, Claude Sonnet 4.6 was uniformly used as the evaluator for all reports. How results change if the evaluator model is varied remains a question for future verification.
Editorial Opinion
Short-term Impact
This experiment provides a practical guideline for companies and teams introducing AI code review: priority should be placed on establishing a review process itself rather than which model to use for review. Over the next 3–6 months, similar verification using benchmarks may expand to other frameworks and languages. In particular, from a cost-efficiency perspective, the trend of incorporating same-session polishing as a standard workflow is expected to accelerate.
Long-term Perspective
The systematic failure cases in cross-review pose a new challenge for collaborative AI agent design. When an era arrives where multiple AI models coordinate to manage codebases, workflow design accounting for each model’s characteristics and “perception gaps” will become inevitable. Research is needed to quantitatively define the boundary between areas where single-model self-polishing is sufficient and areas where deliberately involving a different model is beneficial.
Questions from the Editorial Desk
This experiment is limited to the React framework and two model families. Will similar trends be observed with other programming languages or frameworks, or with different model architectures (e.g., Mixture of Experts vs. Dense models)? Furthermore, elucidating the mechanism behind systematic failures is directly linked to improving the reliability of future AI code assistants. Accumulating data on these questions is expected to refine practical guidelines for AI review.
References
Frequently Asked Questions
- What is cross-review?
- Cross-review is a method where code is reviewed by a different model (or human) than the one that implemented it. In this experiment, code implemented by Claude Opus 4.8 was reviewed by GPT-5.5, and the comments were used by the original model to make fixes.
- Which review pattern was the most effective?
- All patterns showed average improvements of +4 to +7, with no significant difference between patterns. Cost-wise, same-session polishing is the most efficient, but since specific specs showed score degradation only in cross-review, no blanket advantage can be assigned.
- Can these experimental results be directly applied to real-world practice?
- This experiment measured "raw ability" without prompt engineering or linter assistance. When applying to practice, adjustments based on each team's codebase and quality standards are necessary. Also, since each case was run only once, caution is needed regarding statistical certainty.
Comments