Aotokitsuruya
Aotokitsuruya
Senior Software Developer
Published at

Maybe Your Claude Code Should Check Its Own Implementation

This article is translated by AI, if have any corrections please let me know.

I was originally planning to release a tool that applied techniques from my COSCUP talk in August. However, I discovered that there are many considerations in development, and practical usage has its barriers.

I decided to start from a simpler, more immediately accessible angle. Recently, I finally completed the basic functionality of ccharness, which can improve Claude Code’s output quality from a different perspective.

ReAct

In my talk, I based my approach on ReAct (Reasoning and Acting), replacing Thought, Action, and Observation with an Evaluation, Feedback, and Optimization cycle to automatically rewrite the System Prompt to improve the quality of subsequent generation.

This inspiration came from the TextGrad paper, which uses a TextLoss mechanism. By describing evaluation methods in text, LLMs (Large Language Models) can understand the gap between output and goals. This can serve as reference information, allowing LLMs to rewrite inputs (like System Prompts) based on this.

This approach is also a Meta Prompting technique. The process is very similar to what ReAct does, though ReAct places more emphasis on Reasoning, as outputting more relevant tokens can ensure a higher probability of generating the content we want.

Currently, most Reasoning-type techniques trade more cost (token expense) for better quality.

Applied to Agent scenarios, after using tools, the LLM confirms the tool output to decide the next action. Repeating this process should theoretically maintain good output quality. Most current Coding Agents are designed based on this principle.

Context Engineering

If ReAct itself can produce stable, high-quality output, why does Claude Code sometimes still not perform well? The main hypothesis of this article is that we still lack sufficient context throughout the development process, and Context Engineering can be used to improve this issue.

Based on the information I’ve received, although we’re beginning to grasp Context Engineering concepts, most adjustments and improvements to Claude Code still focus on modifying the System Prompt, attempting to provide complete, detailed task instructions at the “beginning.”

This approach can indeed improve quality to some extent. AWS’s SDD (Spec-Driven Development) is a good example, which I also based my thinking on when preparing my COSCUP talk. However, this doesn’t apply the core idea of Context Engineering: “the right information in the right place.”

We can consider a question: Before Coding Agents appeared, what was the human process for implementing a feature?

  • Understand requirements
  • Implement features
  • Commit code

But do we really “never check our code” after implementation? I believe most people still verify before committing. This means Code Review doesn’t just happen once during development but involves multiple implicit “iterative reflections.”

To add this step, one approach is to require “after editing files, check that the implementation meets project requirements” in the System Prompt. However, this is limited by the characteristics of the attention mechanism - the most recent User Prompt usually gets the most attention, so this isn’t guaranteed to be executed every time.

Another approach is to use Claude Code’s Hook mechanism. By triggering on PostToolUse, we can add a new System Prompt before the next action, forcing a Code Review of the implementation to further improve Claude Code’s output quality.

Therefore, thinking from the perspectives of ReAct and Context Engineering, a Coding Agent development flow becomes:

  • Understand requirements
  • Implement features
    • Tool Response (Observation - 1)
    • Code Review (Observation - 2)
  • Fix implementation
  • Commit code

The original ReAct Observation was “modification successful” but didn’t confirm “implementation correct.” This meant fixes could only be made after stopping for problematic areas. However, the Hook mechanism allows us to “Block” subsequent actions, greatly increasing the success rate of “forced checking.”

In my design, I provide different standards based on the files being modified, allowing more refined handling of checking standards for different scenarios.

Evaluation

Adding Hooks isn’t difficult for most people; the challenge is designing good evaluation methods.

There are two evaluation approaches. One is the example in Claude Code’s official documentation: adding Linters and Formatters like Rubocop, Prettier, and Ruff. The other requires writing Evaluation Prompts, which is relatively more difficult as it depends heavily on Prompt Engineering and understanding of projects and software engineering.

Relying solely on Linter Hooks can hardly guarantee quality. I believe most people have already configured them in Claude Code, but something always feels slightly off. This is why we need to further create Evaluation Prompt Hooks to strengthen these weaknesses.

I recommend using the G-Eval approach for design. This was the first evaluation method I encountered, and it allows LLMs and humans to have similar scoring methods, less influenced by the LLM’s inherent preferences.

If an LLM generates an implementation we don’t want, letting it self-evaluate might not identify issues. This is why designing Evaluation Prompts is difficult - being too strict or too lenient can both lead to deviation.

G-Eval can be understood as creating a rubric. Here’s a simple example:

 1# Rails Controller
 2
 3This document outlines the criteria for evaluating the quality of rails controller.
 4
 5## Criteria
 6
 7Following are the criteria used to evaluate the design and implementation of the rails controller. Review step by step and give reasoning to explain why the implementation can get the score.
 8
 9### Before Actions over in-line Code (1 points)
10
11When defining a action in a rails controller, ensure the common logic is extracted to before actions.
12
13```ruby
14class ReviewsController < ApplicationController
15  before_action :set_review, only: [:show, :edit, :update, :destroy]
16
17  def show
18    # @review is already set by the before_action
19  end
20
21  def edit
22    # @review is already set by the before_action
23  end
24
25  # ...
26
27  private
28
29  def set_review
30    @review = Review.find(params[:id])
31  end
32end
33```
34
35## Scoring
36
37Each criterion only get the point when it is fully satisfied, otherwise get 0 point.

When a team encourages using before_action, we check for this in the rubric. Since Claude Code prioritizes example code, it ensures implementations will be “very similar” while avoiding scoring standards that are too vague due to text ambiguity.

If you think carefully, rubrics document the baseline standards for team Code Reviews. In the past without documentation, personal understanding differences easily led to varying judgments during Code Review.

But now with documentation, standards become relatively clear. What’s documented is rigid “must adjust when seen,” while undocumented items are like “personal preferences” - soft requirements. For Coding Agents, this becomes the difference between “knowing” and “not knowing,” like constantly reminding new team members.

In summary, our actual development process looks like this:

  • Understand requirements
  • Implement features
  • Check implementation
  • Commit code

Most senior engineers naturally “check their own code” to ensure stable quality before committing. Isn’t it rather strange that we haven’t considered this when using Coding Agents?

While we can’t know the original intent behind Claude Code’s Hook feature, at least the Linter Hook application scenario is useful for solving submission quality problems. This might be part of what makes Claude Code difficult for other Coding Agents to replace.

Due to space limitations, there are many more considerations and techniques behind this. But this article provides a direction to try. I hope to see applications that improve output quality from different angles in the future.