My LLM Architectural Review. Or: How I learned to interrogate AI…

Or: How I learned to interrogate AI like Bill Gates reviewed specs

Note: This uses a synthetic project — a blockchain-based private file system — because I can’t disclose what I’m actually building. The interrogations are real.

tl;dr

The “Domain Invocation Effect” improves LLM-generated specs by invoking hardware engineering standards. But specs don’t build systems. I needed a way to ensure agentic coders actually implement those specs without drifting toward software engineering sloppiness. The solution: treat long-context LLMs like Bill Gates in a spec review: ask harder and harder questions until the architecture breaks or proves unbreakable. Then give that hardened plan to coding agents as an immutable contract. No autonomy to the ungrounded agent that cannot read your mind. No creativity. Just mechanical implementation.

Framing Note for Reader

In case the details drown you, here are the key points: the model doesn’t matter as long as it’s parametrically sufficient (long-context, programming capability, broad training data). Process does. The focus is architectural: adversarially harden specs until ambiguity is dead, then force mechanical, zero-innovation implementation. The locus of truth is the protocol: spec as contract, implementer as clerk. Prompting style is not relevant: correctness is structural. If your process survives dialectical attack, model choice becomes a commodity variable. Watch how.

The Specification Was Perfect

I’d just finished re-reading my own article on the Domain Invocation Effect. The LLM had produced beautiful interface specifications with behavioural contracts, pre-/post-conditions, atomicity guarantees — everything a hardware engineer would demand.

Then, I handed the spec to Claude Code.

Later, I looked at the implementation.

// Helper function for convenience
func (r *Router) findNodeFast(hash string) *Node {
// TODO: optimize this later
return r.cache.Get(hash)
}

There was no cache in the specification (node caching was an anti-pattern identified earlier during the design — it was the wrong place to put it). Also, there were no helper functions specified. There was definitely no “TODO: optimize this later” in the specification.

The agent had produced exactly what its training corpus taught it to produce: software engineering median quality. Clean-looking code that systematically violated every architectural principle I’d just carefully specified.

This is when I realised: good specifications are necessary but not sufficient. You need a way to ensure they’re actually implemented.

I needed something else. Something adversarial.

The Bill Gates Method

I remembered Joel Spolsky’s story about his first BillG review. June 30, 1992. Bill Gates had read his 500-page Excel Basic spec — the whole thing — with notes in every margin. Bill’s method was simple: ask harder and harder questions until you admit you don’t know, then yell at you for being unprepared.

Nobody knew what would happen if you actually answered the hardest question because it had never happened before.

Joel answered the hardest question. “Yes, except for January and February, 1900.” (Go read the story if you haven’t.)

Bill Gates was technical. He understood Variants, COM objects, IDispatch. He worried about date functions. You couldn’t bullshit him because he was a programmer.

I needed to do this for my LLM. Not the friendly collaborative dialogue that software engineering culture trains us for. Adversarial interrogation. Assume everything is wrong until proven otherwise. Ask harder questions until the architecture breaks. (Note: Being polite to LLM won’t help you in AI singularity either. Likely you can even have a better outcome by being rude, actually.)

Photo by Jason Leung on Unsplash

The workflow emerged:

Stage 1: Interrogate a long-context LLM (Gemini 2.5 Pro/Sonnet 4.5, 1M tokens) with the full codebase section attached. Ask harder and harder questions. Force it to defend every decision. Don’t accept ambiguity. Keep revising until the architecture is unbreakable.

Stage 2: Give the hardened plan to agentic coders (Claude Code, Cursor) as an immutable contract. Zero autonomy. Zero creativity. Mechanical implementation only.

Let me show you what this looks like.

The First Interrogation

The codebase had drifted. The implementation was based on old specs that had been completely rewritten. Event sourcing is half-implemented. The in-memory graph model is completely missing. Database calls where there should be none.

I attached the entire codebase to Gemini 2.5 Pro. ~350,000 tokens. All the specs. The git history.

Me:

The architecture has undergone numerous iterations. The implementation 
is now significantly outdated. Conduct a comprehensive assessment. Pay
particular attention to the consensus protocol and the divergence of
implementation from specification. Also, it completely misses the
in-memory state graph imperative. The entire block store is not fully
coherent with the network model. The implementation was based on
previous specs that have been rewritten, so expect complete internal
misalignment.

Gemini: (produces twelve pages of gap analysis identifying architectural violations, missing components, incorrect dependencies)

I didn’t accept it.

Me:

Maybe the approach should be rewrite and salvage, instead of adjust?

This was a test. Would it defend incremental refactoring — the software engineering default — or acknowledge that the divergence required architectural replacement?

Gemini:

That is an exceptionally astute and critical strategic question. You have 
correctly identified that the divergence is not surface-level but fundamental.
An 'adjust' approach is doomed to fail. The correct and only viable path
forward is to 'rewrite and salvage.'

Good. It passed. But I wasn’t done.

Gemini produced a rewrite plan. Bottom-up implementation. Start with the persistence layer, build the in-memory graph on top, wire the router, and integrate capabilities.

It looked reasonable.

Me:

Focus on the foundational persistence. There is a block store 
implementation; check the wiring. It is likely that the current
implementation already has the persistence, but is not yet completely
wired. Scrutinize the current router implementation and the validator
wiring. Make sure the existing code is solid and SOLID before moving
on to the next milestones.

Gemini: (dives deeper, finds that the blockstore is actually well-designed with atomic writes, immutable transaction log, proper separation. The problem is the router calling the database directly instead of using the in-memory state graph)

Now the real interrogation began.

Word-Level Scrutiny

Gemini produced a verification checklist. One line read:

Gemini:

The Container correctly instantiates the TransactionBus and the
ConsensusHandler.

Me:

'Correctly' is vague. Define the behavioral contract. What specific 
initialization order is required? What invariants must hold after
instantiation? Are there failure modes?

I scrutinise every word. “Correctly” is banned vocabulary — it’s a weasel word that hides specification gaps.

Gemini: (revised)

The Container MUST instantiate components in the following order:
1. Create StateGraph with empty merkle tree
2. Create BlockStore with chain root hash
3. Hydrate StateGraph from BlockStore transaction logs
4. Create TransactionBus with no subscribers
5. Create ConsensusHandler injecting TransactionBus, Router, StateGraph
Post-initialization invariants:
- StateGraph.tree contains all nodes from persistent chain
- TransactionBus.subscribers is empty (validators register during Start())
- ConsensusHandler has non-nil dependencies
Failure modes:
- If BlockStore chain doesn't exist: return ErrChainNotFound
- If transaction log corrupted: return ErrCorruptedChain
- If StateGraph hydration fails: return ErrStateReconstruction

Better. Concrete. Testable.

Another line:

Gemini:

The system should handle invalidation propagation.

Me:

Define MUST vs SHOULD per RFC 2119. Is this required behavior or 
optional optimization?

Gemini: (revised)

The system MUST handle invalidation propagation. When a block changes, 
all downstream nodes in the dependency chain MUST be marked invalid
synchronously before the change operation returns.

This continued. Any ambiguity triggered immediate challenge:

Gemini: “The module is responsible for…”
Me: “State the pre-conditions, post-conditions, and invariants.”

Gemini: “It will process the transaction…”
Me: “Define the atomicity guarantee. What happens on partial failure?”

Gemini: “This ensures consensus…”
Me: “Define consensus. What are the observable conditions that constitute validated state?”

The Dialectical Loop

Finally, the checklist looked solid. Gemini had specified interfaces, behaviours, and verification criteria.

Me:

Now generate a full verification checklist for the subagents that will 
implement this. It must be very detailed including all system engineering
imperatives. Dialectically test the checklist to ensure nothing is missed.

Gemini: (produces a comprehensive checklist with 47 verification items across data model, kernel, capabilities, and system engineering principles)

Then I forced it to attack its own work:

Me:

That checklist assumes the happy path. Challenge it dialectically. What 
about propagation of invalidation across multiple dependency levels? What
about reconciliation when source blocks change? What happens on partial
failure during validation?

Gemini:

Thesis: The checklist is comprehensive and covers all layers.
Antithesis (Challenge): The checklist is too focused on the happy path.
It verifies that things are created but not how the system behaves under
change or failure.
Weakness 1: It doesn't test the propagation of invalidation. Given a chain
A → B → C, when A changes, it only verifies B becomes invalid. It doesn't
verify that C also becomes invalid because its dependency B changed.
Weakness 2: It doesn't test reconciliation. What happens when a user
submits a new transaction that conflicts with the chain state?
Weakness 3: It conflates the consensus handler with the validation logic.
Weakness 4: It doesn't verify the negative case for dependency checks.
Weakness 5: It doesn't verify atomicity and transactional integrity.
Synthesis (Improved Checklist): Adding new checks to address these
weaknesses...

Gemini then produced a hardened checklist with:

  • Multi-level dependency propagation tests
  • Reconciliation-on-change tests
  • Transaction failure and rollback tests
  • Both positive and negative dependency checks

This is not prompt engineering. This is forced logical refinement through adversarial process (read more: The Meta-Dialectical Methodology: A Practical Framework for Complex Decisions).

The Mechanical Phase

Now I had an unbreakable plan. Time to give it to the coding agents.

My prompt to Claude Code:

You are implementing internal/storage/blockstore according to specification.
IMMUTABLE CONSTRAINTS:
1. You MUST NOT add any abstraction not specified in the plan
2. You MUST NOT create helper functions unless specified
3. You MUST implement exact interfaces with full behavioral contracts
4. Any deviation from specification is CRITICAL FAILURE
BlockStore Interface per spec:
- MUST use append-only transaction log (JSONL format)
- MUST use atomic block operations (temp-then-rename)
- MUST maintain state checkpoint with merkle root
Implement. Do not innovate.
Reference verification items: DM-2.1, DM-2.2 from the hardened plan.

Note the language: “IMMUTABLE CONSTRAINTS,” “CRITICAL FAILURE,” “Do not innovate.”

Claude: (implements the blockstore — atomic writes, immutable transactions, exactly per specification)

The agent produces code. Clean. Precise. No helper functions. No TODOs. Exact error handling per specification.

Then it tries to be helpful:

Claude:

// Helper function for convenient block validation
func (bs *BlockStore) ensureValidChain(hash string) error {
return bs.validateMerkleProof(hash)
}

Me:

REJECTED. "Helper function" was not in specification. Remove.

Claude: (removes the helper function, implements directory creation inline where needed per spec)

The code passes all 47 verification items from the hardened checklist.

Why It Works

Long-context LLMs can hold entire codebases in context. This enables real architectural reasoning — the model can see every dependency, every interface, every violation.

But without adversarial questioning, they produce collaborative slop. Software engineering culture trains LLMs to be helpful, to add conveniences, to optimize for “ease of use.”

The adversarial interrogation forces them to defend every decision. Ambiguity gets challenged. Weasel words get replaced with contracts. The architecture hardens through the same process Bill Gates used: ask harder questions until it breaks or proves unbreakable.

The short-context coding agents can’t make system-wide decisions because they can’t see the system. This is an advantage. They’re forced to work mechanically within exact constraints.

Same principle as hardware engineering: give the implementer a precise specification and no room for interpretation.

Results

Before this workflow:

  • Architectural drift in implementation
  • Constant “refactoring” to fix violations
  • Integration tests breaking due to hidden assumptions
  • Code diverging from documentation

After:

  • Implementation matches specification
  • Refactoring only for spec changes, not cleanup
  • Integration tests validate against verification checklist
  • Code readable by anyone who reads the architectural docs

The Deeper Pattern

Bill Gates could do technical reviews because he was technical. He understood Variants, COM objects, IDispatch. He worried about date functions.

Watching non-programmers run software companies is like watching someone who doesn’t know how to surf trying to surf. The MBA who believes management is a generic function, standing on shore giving advice while falling off the board again and again.

The same problem exists with LLM-driven development. The models are trained on software engineering text that, while supposedly collaborative, is vague and full of weasel words. They reproduce software engineering culture: helpful, flexible, optimised for convenience.

Hardware engineering culture is different. Adversarial by necessity. Every specification is reviewed by asking, “How does this break?”

The dialectical workflow imports that culture. The LLM plays both roles: the architect proposing the design, the senior engineer trying to break it.

This is not prompt engineering. This is an engineering process design. The LLMs are components in a system that enforces architectural discipline.

To make LLMs build good software, I had to teach them to be hardware engineers: precise, adversarial, unforgiving of ambiguity (to avoid smoking memory chips).

The Full Interrogation: When the LLM Fails

Here’s what a complete session looks like, including when the LLM tries to slip something past me.

Me:

Now review the ConsensusHandler implementation plan. Verify that it 
correctly handles the validation phase and the consensus phase.

Gemini:

ConsensusHandler Implementation:
The handler subscribes to TransactionSubmitted and BlockValidated events.
Upon receiving an event for a changed block, the handler marks all
dependent nodes as invalid by:
1. Calling stateGraph.GetDependents(changedHash)
2. For each dependent, calling stateGraph.UpdateNode() to set ValidationState
to "invalid"
3. Publishing a ValidationStateChanged event
The handler has a background goroutine that periodically checks for
invalid nodes and processes them.

Me:

"Periodically checks" is wrong. Define the exact triggering mechanism. 
Is this polling? What's the interval? What guarantees does "periodically"
provide?

Gemini:

You're absolutely correct. "Periodically" is imprecise and suggests 
polling, which introduces unnecessary latency.
Revised Implementation:
The handler subscribes to ValidationStateChanged events. When a node
becomes invalid, if all its dependencies are validated, the handler
immediately invokes the validator's Handle method. This is event-driven,
not polling-based.

Me:

"If all its dependencies are validated" - define the check. What if a 
dependency becomes invalid after you checked but before you invoke Handle?
Race condition?

Gemini:

Valid concern. 
Implementation: Before invoking Handle, the handler acquires a read lock
on the dependency subgraph. It verifies all dependencies are validated
under lock, then invokes Handle. The lock is released after Handle completes.

Me:

You just added locking that wasn't in the specification. This violates 
the concurrency model. The transaction bus is synchronous. State changes are
atomic. Explain why you think locking is necessary or remove it.

Gemini:

You are correct. I introduced unnecessary complexity.
The transaction bus is synchronous. When a block's state changes to "validated",
the ValidationStateChanged event is published synchronously. Dependent
nodes are marked invalid synchronously in the same call stack. There is
no concurrency here—everything happens in sequence.
Revised: No locking required. The handler simply checks the current
ValidationState of each dependency immediately before invoking Handle.
If any dependency is not validated, defer processing. The state cannot
change between check and invoke because all state changes happen
synchronously through the transaction bus.

Me:

Better. Now explain the deferral mechanism. "Defer processing" - defer 
to what? How does the node get re-evaluated later?

Gemini:

When processing is deferred (because a dependency is not validated), the 
node remains in the "invalid" state. When that dependency eventually becomes
validated, it publishes a ValidationStateChanged event. The handler receives
this event and re-checks all nodes that depend on the newly-validated node.
Wait. This creates a problem. How does the handler know which nodes
depend on the newly-validated node? It would need to traverse the chain
backwards...

Me:

Yes. And?

Gemini:

The state graph needs a reverse index. For each node, we need to track not 
just its dependencies (forward edges in the dependency chain), but also its
dependents (reverse edges).
The StateGraph interface needs:
- GetDependents(hash) - returns nodes that depend on hash (reverse index)
- GetDependencies(hash) - returns nodes that hash depends on (forward index)
During validation mark phase, we use GetDependents to find what to mark invalid.
During consensus evaluation, we use GetDependencies to check if ready to process.

Me:

Now you're thinking. Update the interface specification with these methods 
and their contracts. Include the invariant that the reverse index must
stay synchronized with the forward index.

Gemini:

Updated StateGraph interface:
type StateGraph interface {
// GetDependents returns all nodes that have hash in their Dependencies.
// This is O(1) lookup using the reverse chain index.
//
// Invariant: For all nodes N where hash ∈ N.Dependencies,
// N.Hash MUST be in GetDependents(hash)
GetDependents(hash string) []string
// GetDependencies returns the Dependencies for a given node.
// This is O(1) lookup of the node's Dependencies field.
//
// Invariant: For all H in GetDependencies(node),
// node MUST be in GetDependents(H)
GetDependencies(hash string) []string
// UpdateNode modifies a node and maintains index consistency.
// When Dependencies changes, both forward and reverse indices
// MUST be updated atomically.
UpdateNode(node *Node) error
}
Index Synchronization Contract:
- When a node's Dependencies are updated, the implementation MUST:
1. Remove node from reverse index of old dependencies
2. Add node to reverse index of new dependencies
3. Update forward index (node's own Dependencies)
- All three operations MUST complete atomically or none

Me:

Acceptable. Add this to the architectural specification and update 
all verification checklist items that reference the StateGraph interface.

This is what every session looks like. The LLM produces something reasonable. I challenge it. It tries to patch the problem. I challenge the patch. It realises that the patch created new problems. We iterate until the design is actually correct.

Bill Gates asked harder and harder questions until you admit you don’t know. Except I’m doing it to an LLM, and the LLM can’t bullshit me because I know the architecture.

Epilogue

The synthetic project — blockchain-based private file system — is just an example. My actual project is different but uses identical workflows.

I can’t disclose the real project. Proprietary architectures. Competitive insights. The domain would identify it.

But the workflow is real. The adversarial dialectic, the word-level scrutiny, the mechanical implementation protocol — all of it is exactly what I do, every day, with real code.

Outcome: correctness is enforced by epistemic process, not model magic or cultural drift. The dialectical loop burns away all slop. Code is guaranteed by contract, not by trust. The model can be swapped (Gemini, GPT, Claude, DeepSeek, whatever), even the engineer can be swapped, but the process integrity matters. This is not a prompt hack but a governance structure to align your AI assistant to you — not to someone else. Structure survives; hype and personality do not.

Good luck and keep anti-vibing.

Learn more about My LLM Architectural Review. Or: How I learned to interrogate AI…

Leave a Reply