A CAD Kernel in a Weekend

The premise

Karpathy called it Software x.0. The idea: we’re building machine learning systems that deliver functional software using fleets of agents. Not code generation as autocomplete. Agents as the primary authors of production systems, operating in a loop against an environment that tells them whether they’re converging on something correct.

This post is about applying that idea to build Knot, a B-rep CAD kernel written in Rust. The goal: a complement to Rhino3D, built for the web, designed to scale like Google’s Manifold but without dropping down to meshes. Maintaining geometric precision end to end.

Why build a kernel

The idea came from working with Rhino Compute. Rhino is an incredible piece of software, but hosting it is painful. It depends on Windows APIs, which means you’re running Windows VMs in the cloud to serve geometry operations. Scaling it is expensive and awkward. There’s no path to running it at the edge or embedding it in a web application.

Manifold from Google is the modern alternative. It’s fast, it’s open source, it runs anywhere. But it operates on meshes. Meshes are approximations. For many applications that’s fine, but if you need the precision that NURBS-based B-rep gives you (exact tangency, exact offsets, exact fillets), meshes are fundamentally the wrong representation. You’re working with a discretized approximation of the geometry rather than the geometry itself.

Knot aims to sit in the gap: a kernel with the precision of Rhino’s NURBS-based approach, but built from scratch in Rust for the web, with no Windows dependencies, designed to be hosted anywhere and scaled horizontally.

The ABC dataset as environment

The ABC dataset is a large corpus of real CAD models. We used it not as a target to optimize against directly, but as a stress probe. The agent’s job was to implement a production boolean operations pipeline, and ABC was how we knew whether it worked on real geometry, not just textbook primitives.

We ran two tiers of validation:

  1. Synthetic primitives (300 boolean ops on parametrically-generated boxes, spheres, cylinders, cones, tori with a fixed seed). Fast, deterministic, near 100%. The regression smoke test.
  2. ABC chunk 0000 (30 pairs x 3 operations = 90 booleans on real CAD models). Slow, exposes real-world degeneracies. Started around 90%, drove it to 100%/96.7%. The stress probe.

The two tiers serve different purposes. Synthetic catches “did I break something basic.” ABC catches “does this work on inputs that weren’t designed to be easy.” Most of the real algorithmic work iterated against ABC because it surfaced things synthetic never would: multi-solid STEP files, NURBS-vs-NURBS face pairs, classify-stage performance cliffs.

Honest measurement first

The first several percentage points of improvement weren’t algorithmic. They were getting the measurement honest.

We added a watchdog timeout so an infinite loop in the boolean couldn’t silently wedge the harness. Our earlier “94.4% peak” turned out to be partly an artifact of hung pairs being counted into different buckets across runs. With a real watchdog we got a noisy 88-94%, then a stable 93.3%, then eventually 100%. But each number was real for the first time.

We made file ordering deterministic (sorted recursive walk) so the same corpus produced the same test sequence. We’d been chasing 5 percentage points of HashMap-iteration noise.

We categorized failures per-pair: valid, empty, bad_input, topo_fail, tess_fail, crash, timeout. This made it obvious which class of failure to attack and whether fixes genuinely closed problems or just moved them between buckets.

The discipline: don’t optimize until your measurement is reproducible. Everything after this was built on that foundation.

Diagnose before committing

This was the load-bearing discipline of the entire project. We built about six short diagnostic harnesses, each answering one specific structural question before we committed to any algorithmic work.

One asked: which validation rule trips on imports? Answer: 14 Euler violations, 8 non-manifold edges, many dangling references. This drove the line-edge reconciliation fix.

One asked: what’s the bounding-box filter survival rate per pair? Answer: 4.6% on one pair, 0.3% on another. This falsified the assumption that surface-aware bounding boxes would help the hard cases.

One asked: where does the 8-second budget actually go per pair? Answer: 23.9 seconds in classify_face_exact on the hardest pair. This was the single biggest find of the project. It was hiding in plain sight while we’d been queuing weeks of fat-plane Bezier clipping work that would have addressed a different bottleneck entirely.

Each diagnostic cost about an hour to write and consistently saved days of mistargeted work. Twice we had multi-week algorithmic projects queued based on intuition that a diagnostic immediately falsified:

A “NURBS-vs-NURBS fat-plane Bezier clipping” project (1-2 weeks of work) got cancelled when the filter diagnostic showed the model was mostly planar faces (206 of 253), not NURBS. A “BVH spatial culling” project (3-5 days) got cancelled when stage-trace showed candidate filtering was already 1ms and the actual bottleneck was 23.9s in classify.

The actual fix in both cases was a SolidClassifier: about 50 lines of acceleration structure plus a dispatcher refactor. Hours, not weeks. Every time we slowed down to verify, the queued plan got smaller or pivoted entirely.

Wall-clock budgets as correctness

A boolean that takes 30 seconds is not a successful boolean for a CAD user. So time budgets aren’t a tunable you loosen to hit a higher percentage. They are the correctness criterion.

We layered three:

The pipeline budget shaped the engineering. The question for any optimization wasn’t “can we make this fit eventually” but “can we make this fit in 8s on real CAD.” This constraint eliminated a whole class of approaches that would technically produce correct output but not at interactive speed.

The unsexy work won

The reliability journey table in our pipeline docs is intentionally chronological. It shows which optimization actually moved the headline at each step. The pattern is clear: the instinctively-prioritized work (algebraic SSI subsystems, BVH spatial culling, Bezier clipping) didn’t move the number nearly as much as the unsexy work (instrumentation, deterministic ordering, soft-accept calibration, classify-stage caching).

That ordering is itself a result of the diagnose-first discipline. It would have been the opposite ranking if we’d guessed instead of measured. For a reliability-bound system, the binding constraint is almost never where intuition places it. Cheap diagnostic infrastructure that points at the actual bottleneck is the highest-leverage thing you can build.

Continuous integration as agent feedback

The whole approach is really just real-time CI applied to agentic authorship. The agent makes a change. The harness runs. The agent sees whether reliability went up, down, or sideways, and in which failure category. It iterates.

You can extend this to anything with a verifiable environment. The agent calls a local API to check values. It hits a live test deployment to validate integration. It runs against a dataset to check correctness. In each case, the environment provides the signal and the agent iterates until it converges.

The key insight: you need the signal to be fast, honest, and categorized. Fast so the agent can iterate many times. Honest so it’s not chasing measurement noise. Categorized so it knows which kind of failure it’s looking at. Get those three properties right and the agent converges. Miss any of them and it thrashes.

Not a free lunch, but a silver bullet

This approach is not free. Compute costs are real. Agent loops that run for hours burn tokens and CPU time. You need good data or a good environment to optimize against. If your verification signal is noisy or incomplete, the agent will converge on something that passes your checks but doesn’t actually work. All the usual discipline around testing and specification still applies.

But I’d argue this is genuinely a silver bullet in the Brooks sense. The constraint on what you can build is no longer human engineering time. It’s what you can compute. If you can generate the test cases, if you can construct the dataset, if you can define the environment that tells the agent whether it’s right or wrong, then you can build the system. The human role shifts from writing the software to defining the problem and providing the verification signal.

The limiting factor becomes: can you compute the feedback? Can you generate the data? Can you define what “correct” looks like in a way the agent can check? If yes, you can build essentially anything. The search space hasn’t shrunk. But the cost of searching it has dropped to the price of compute, and compute gets cheaper every year.

What it actually cost

I’m not totally sure what this means yet in terms of economics. When I had Claude report its session cost using the default pricing from Anthropic’s docs, each agent session estimated around $150. I’m on the 5x Claude Max plan at $100/month, so I’m heavily subsidized. Without that subsidy, running a fleet of agents like this on a frontier model could get expensive fast.

Altogether, based on the status line cost tracking script, it appears I used somewhere around $900 in credits to build the kernel. For a weekend of work that produced a functional B-rep kernel, that’s remarkable. But it’s not nothing, and it scales with the complexity of what you’re building and the model you choose. Smaller models would be cheaper but likely require more iterations. Frontier models converge faster but cost more per token.

The cost question is real, but it’s also a moving target. Inference costs drop steadily. What cost $900 today might cost $90 in a year. The pattern itself (agent fleet optimizing against a dataset) is model-agnostic. As cost per token drops, the set of problems where this approach makes economic sense expands.

What I’d do differently