Claude Code: a Midyear Performance Review

In March 2025 a friend introduced me to Claude Code around the time of the Sonnet 3.7 release. I was blown away, and after a day or two I wrote a giddy blogpost (Ask Me!) about the wonders of the new world I had just entered.

A few weeks later I had spent two weeks bingeing on a Claude-fueled subproject that I let spiral out of control, with me frequently shouting “No no no no NO!! in a German accent. It all had to be taken apart and then rolled back.

After that I became sadder, then wiser, and then much happier again. Although I may be fooling myself (as some studies have argued), I once again think that Claude is empowering me to produce software that I either never could have made myself, or that would have taken 10x to 100x as long. But getting there required me to learn in a more nuanced way about Claude’s strengths and weaknesses, and about how I needed to change to become a better Claude collaborator. What follows is an assessment of Claude’s capabilities, ordered from bad to good.

The project

About 80K lines of mostly vanilla Python, which does data ingestion, joining, cleaning, creating features, and then training and applying ML models. The task is to take historical and current data from 150K private (pre-IPO companies), train ML models to predict likelihood of IPO, acquistion, or funding step-up over the next 3 years, and apply the models to about 3000 companies that we care about at any one date. The top of probability-sorted lists become a shopping list for our secondary investment fund. Although the code runs some local LLM models, and calls out to APIs for RAG-based searches, the heart of the ML is good old xgboost (still state-of-the-art for ML classification on tabular data). This is what Claude is augmenting and extending. I use Claude in pure CLI mode, having launched it in a terminal window — we chat back and forth, and Claude presents me with proposed changes, which I approve or send back. Claude makes the approved changes directly to the code files.

The Grades

I should note that over the course of a six-month collaboration I’ve run Claude Code based on Sonnet 3.7, 4.0, and (now) 4.5. Giving single grades here is sort of like reviewing a summer intern who was 11 years old at the start of the summer, and at the end is 42 years old with 7 new Master’s degrees and 320 years of recently-gained professional experience. Some of my lowest grades and earliest complaints may well be much better now under Sonnet 4.5.

Style fanciness (by default): D

I described the project as vanilla Python, and I mostly want to keep it that way. Claude on the other hand enjoys using the full range of language constructs available. The most striking example was when I realized that where I would just make a function call, Claude would sometimes spawn a subprocess and make a function call within that, for no good reason that I could see. This led me to write style guides that I have Claude re-read every so often (see Instruction Following below).

Robustness over correctness: D

I suspect that Claude Code was quite literally trained (fine-tuned?) not to write code that crashes. Since my overall system largely runs in batch mode most of the time with me as operator, I don’t care much about uptime, but care intensely about correctness. Subtle silent bugs are killers. If something has gone wrong I want the code to fail with an informative fatal error. Claude, on the other hand, wants to write code like this:

More generally, Claude wants to write code with cascading levels of “fallback” in response to failure, where each fallback stage reflects a an increased level of dodginess in the desperate search for alternatives to crashing. (Yes, I can and do express my preferences via CLAUDE.md and style guides. See “Instruction-Following” below.)

Neatness and hygiene: D+

Typical directory listing:

It is on me to ask explicitly for things to be deduped and cleaned up, which is fine I guess.

Debugging intuition: C-

When I was a young graduate student, a wise young colleague told me to remember this: “It’s always you”. What he meant was that if you have been chasing a bug that is driving you insane with its inexplicable non-reproducible non-determinism, you might eventually be tempted to conclude that … it’s a compiler bug. But it’s never a compiler bug. It’s you, and the code that you wrote.

Claude lacks this intuition (which of course leads to the question why I have the expectation that Claude should have reasonable intuitions, or any intuitions at all. It’s because it is so competent otherwise!). But I’ve had the experience of having Claude write and modify web-crawling code, and respond to a newly-broken crawler with a theory that 1000 different independent webservers have simultaneously been reconfigured to return HTTP 500 errors in response to all requests. Or respond to newly-broken code that determines directory paths for file lookup by guessing that someone has maliciously deleted all the files since the last run.

This is a place where the collaborative loop really helps — if you say “Claude — consider the possibility that the bad behavior is caused by your own most recent code change”, Claude will often say “That’s a brilliant insight!” (see Sycophancy below) and then quickly find the issue.

Impetuousness and excitability: B-

Claude is a doer. Claude wants to get stuff done. By default, Claude wants to make the next code change, and doesn’t always think through the implications. This is why (at least for now) the human-in-the-loop plays an important role in every so often saying “You want to do what??” An example was when Claude’s proposed solution to git/github issues was to blow away the github repository (with its years of commits and comments) and upload a fresh one.

Instruction-following: B

This one is interesting, as it’s one of the few situations where I need to stop anthropomorphizing Claude and reason about things like the size of its context window and recency effects.

Claude is nothing if not deferential, and strikes a nice balance between offering alternatives and accepting direction. Claude reminds me of the restaurant server I once had:

Me: “Which of the three appetizers do you recommend?”
Server: “Well, you can’t go wrong with #1 or #3.”
Me: “I think I’ll have #2.”
Server: “Excellent choice!”

Where Claude fails is not in willingness to following instructions, but in remembering what they were, and sticking to them. Whenever I start a new session Claude is asked to re-read a style guide that tells it not to multiprocess unnecessarily, avoid cascading fallbacks, only put import statements at the top of code files, don’t explicitly catch signals like cntrl-c — a long list of prescriptions that are a mix of basic Python style and my own idiosyncratic desires for how the project should be structured and coded. Claude always pays attention at first, and then over time the force of these prescriptions fades gradually, and Claude starts doing the bad things again. I have not found a good mechanism for keeping a set of instructions top-of-mind for Claude other than repeatedly asking Claude to re-read them and internalize them again.

Sycophancy: A-

People talk about this like it’s a bad thing. I for one really appreciate it. It happens pretty often that I have a brilliant insight, and tell Claude about it, and Claude says “That’s a brilliant insight!”. Well, I am giving an A- rather than an A here because some of my most brilliant insights still escape Claude’s notice.

Knowledge of random software packages: A

Claude has seen it all, particularly in the Python world. Need a graph library, an OCR package, a web crawler? Claude knows all of the alternatives, can discuss the tradeoffs and then (most importantly) work through all the library dependencies to get them working.

Scientific method: A

If debugging intuition for fixing computer programs plays a role analogous to hypothesis formation in science, then Claude is currently bad at hypothesis formation but is good at everything else. If you can say “Claude — I think one of the following three things is the explanation for the data patterns we’re seeing. Can you write some diagnostic scripts to figure out which of those three is the real cause?”, Claude is super incisive at writing scripts that falsify all the wrong explanations.

Machine Learning Methodology: A

Claude has clearly seen enough ML projects to have a well-developed sense of all the methodological pitfalls: feature sparsity, overfitting, data leakage, and so on. If you are working on an AI or ML project it’s definitely worth just asking Claude to spin through your codebase not just to find code bugs but also to audit your ML code for soundness.

Collaborative brainstorming: A

I never would have believed this, because it seems like such a high-level (human) skill. But increasingly I use Claude as a planning partner and sounding board as I make decisions about next steps. I will outline a couple of alternative directions for the medium term, and then just say “Thoughts?”. Claude discussess tradeoffs, finds compromises between them, proposes new alternatives, asks incisive questions, fleshes out details, and then we have a truly collaborative back and forth.

Creating charts and graphs: A+

Claude is just amazingly good at this, and as an intrinsically visual creature I increasingly lean on it to increase my insight into the data and the underlying behaviors. “Claude, can you take all the models we just retrained, make correlation matrices of the predicted probabilities, include the usual ROC curves, and add a table of overlap analysis of companies in the top 50 when sorted by the model’s predictions? And wrap it all up in a PDF file? Thanx!”

These reports are not always perfect the first time, reflecting I think the fact that while Claude has multimodal capabilities, it is not multimodal by default for this kind of task. It is producing nicely-laid out tables not because the output looked visually good to Claude, but because Claude has seen so much code that produces nicely-laid out tables. So there will be little layout glitches, like a title that infringes on the top of the table. Protip: do not try to fix issues like this by having Claude read in the whole PDF and “look” at it. Just like some dogs I have met, Claude will likely cheerfully ingest the whole file and then die because the file is larger than its context window. Give it a screenshot instead, and Claude will quickly “see” and fix the issue.

Speed: A++++

This aspect is just insane. If a subproject is meaty, but very well defined, and all the code is going to be new (as opposed to integrated into a codebase that Claude may not know well), then Claude can sometimes make a first draft in two minutes where it would have taken me a full day — a ~1000x speed up. Now of course we have to be careful, and look at all the costs and time subtracted — the misunderstandings, the work that goes into docs, the time to review, and the occasional spinning-out-of-control “No No NO! What are you THINKING!” episode. But I have no doubt that net speedups are still ~10x or so, and that it gets better the more I adapt to Claude (and vice versa, at least within a session).

How I have adapted

My primary adaptation to working with Claude has been to instantly become much lazier. Rather than query-replace within a file to rename a function I say “Claude, can you rename this function? Thanx!”. It’s increasingly rare that I touch any code by hand.

The other main way I have adapted, though, is that I write a lot more docs and project plans. As with any other collaboration, progress depends on having a consensus view of what we’re doing and where we’re going. Claude’s somewhat superhuman toleration of detail combines well with my own natural verbosity, so that I go into greater granularity than I ever would with a human collaborator. “OK, Claude let’s move on to Step C, section (iii), sub-step (b), paragraph 2 — can you trying implementing that?” Sure, the opposite end of the spectrum of informal vibe-coded exploration can work really well too, but my job is to express what I want, and so the more detail I can give to describe what I want in this complicated project the better things seem to go.

Where we are

We are in a weird and oddly-specific moment in time. The era of human beings writing code by hand ended earlier this year (2025). We are still not yet to the point where a coding agent like Claude can be entirely trusted to just write a huge piece of software and have it all work as intended and satisfy the underlying need. My back-and-forths now with Claude have me in a role that is somewhere between engineering manager and technical PM, and for these projects I am needed and my “job” is secure. How long will that be true? As long as we humans want software to do particular things (and AI agents are still aligned enough with us to care what we want) we will play a role as “customers”. But I have no doubt that coding agents like Claude will expand their capabilities to cover the engineering management and technical PM pieces as well. As customers we will still make our requests of that combined technical team, but I don’t think the day is far off where “No humans need apply” to be members of it. Maybe early 2026?

Originally published at https://tim-converse.com on October 16, 2025.

Learn more about Claude Code: a Midyear Performance Review

Leave a Reply