Skip to main content
Aotokitsuruya
Aotokitsuruya
Senior Software Developer
Published at

Kobako: How Much Protection Can Tests Provide with a Coding Agent?

This article is translated by AI, if have any corrections please let me know.

Developing Kobako has surfaced quite a few interesting cases. Continuing from the previous post, Building Kobako with AI: Will It Eventually Crash?, I ran into yet another new problem afterward—and this time it was the Segmentation Fault error I had always dreaded seeing during development. That signals a high chance something went wrong in the non-Ruby territory between Rust and WebAssembly.

Flaky Tests

Around version 0.5.0, Kobako started narrowing its scope and slowing down feature expansion. With the core features in good shape, it was time to head toward a stable release—but instead, I ran into flaky tests: tests that would intermittently fail without any code changes at all.

Since I was wrapping up features, I just kept developing, and pretty soon the tests stopped showing the Segmentation Fault message. That actually made me more nervous, because if I couldn’t reproduce it, catching the problem would become much harder.

This is exactly the risk of developing with a coding agent. Even though we consider Rust a memory-safe language, and even after spending considerable effort minimizing unsafe usage by wrapping most of the mruby C API, we still ended up in this situation.

Still, at least we had tests. That is far better than having no tests at all and only catching the problem once it is already in real use or released.

Garbage Collection

I was not particularly optimistic about tracking down the problem—at least not based on past experience. To my surprise, it was caught faster than expected. After having Opus 4.7 run the tests a few times, it quickly confirmed the issue was reproducible and offered the judgment that “this might be caused by garbage collection.”

Luck was on my side, too: the problem did not occur in 0.4.x, the previously planned release. I only had to narrow down the commits in the 0.4.x ~ HEAD range, and that commit happened to be the RPC -> Transport refactor—a mistake in object marking right at the boundary between Ruby and the Ruby extension.

In Rust, we designed an #on_dispatch handler on the Runtime to handle method calls from the Guest to the Host, like the scenario below:

1Utils::Time.now # sandbox.define(:Utils).bind(:Time, Time)
2# Kobako::Transport::Proxy -> #method_missing -> __kobako__dispatch

In short, the call punches through from WebAssembly via the C API and asks the host linker to find a method matching something like Utils::Time.now, at which point #on_dispatch is responsible for handling how it gets called.

However, the initialization order goes Sandbox -> Runtime + Transport + Catalog, so the Runtime cannot know in advance what the Dispatcher method on the Transport is. It therefore needs the Sandbox to help install it, roughly like this:

1    def install_dispatch_proc!
2      @runtime.on_dispatch = lambda do |request_bytes|
3        ...
4      end
5    end

At the time, the Rust implementation handled the #on_dispatch= behavior like this:

1pub(crate) fn set_on_dispatch(&self, proc_value: Value) -> Result<(), MagnusError> {
2	let mut store_ref = self.store.borrow_mut();
3	store_ref
4		.data_mut()
5		.bind_on_dispatch(Opaque::from(proc_value));
6	Ok(())
7}

In short, on the Rust side we convert the lambda do ... end into a pointer and store it on a particular Rust object. When Ruby’s garbage collection kicks in, since no Ruby variable holds onto this lambda object, it gets judged as “collectible” and cleaned up. So when mruby actually calls it, the call operates on a null pointer, which triggers the Segmentation Fault.

This usage is common with Magnus, and most of the time it does not go wrong—as long as you are interacting with Ruby, there is rarely a problem. But we happened to need a boundary that integrates with WebAssembly, which is “uncommon” enough in the LLM (Large Language Model) training data that it did not adopt an alternative approach, and ended up using the wrong design here.

At least once Opus 4.7 had pinned down the problem, combining it with WebSearch to confirm against the documentation, it quickly understood that the Opaque::from provided by Magnus does not mark for garbage collection—which means, as far as Ruby is concerned, the object is collectible.

The fix was not complicated either. We tied this object’s lifetime to the Runtime, so that when the Runtime is dropped, the Ruby variable set via #on_dispatch= is automatically released; meanwhile, the Ruby variable held through #on_dispatch= is marked as non-collectible. With that, the problem was eliminated.

The Protective Power of Tests

This case happened to be a fairly rare one, yet it was genuinely something tests could defend against. But having tests does not guarantee perfect protection—it depends heavily on how the test cases are designed and how they cover the critical nodes.

In traditional software development workflows, tests were mostly seen as a cost, so we would prioritize testing the happy path to make sure expected behavior rarely broke, and only cover edge cases for extra reinforcement if we had the bandwidth.

Once you are using a coding agent, the cost of writing code itself is very low, which makes it well suited to covering things with a large number of tests. But if that coverage is fragmented and incomplete, it is not necessarily useful. For example, with only unit tests—if you only test the objects and methods of Ruby and Rust separately—this case of Rust holding onto a Ruby variable would not have been caught.

That is why choosing “which path to cover” is such an important judgment call. Both unit tests and integration tests can push coverage above 90%, but they mean very different things. If even one path is not run end to end, there is a chance you will miss a “problem in the integration process” and introduce new risk.

If your environment allows it, I would recommend covering End-to-End Testing as well for the ideal case, and Behavior-Driven Development is also a good approach. Even if you do not go that far, I would still recommend defining the behavior of the happy path—for instance, Kobako’s docs/behavior.md provides specifications for each usage scenario, and that is exactly how this problem was caught.

All in all, do not let the perceived hassle stop you from doing it. The cost of handing this kind of documentation to AI is already very low, but whether you have test coverage—and whether it covers the fragile spots—makes a big difference to the stability of AI-driven software like this and to the trustworthiness of what it produces.