Who Watches the Watchmen? AI Code Generation and the Limits of Code Review

Edit on GitHub

Original version initially published on 12th March 2026 on the WebRTC.ventures blog as Who Watches the Watchmen? AI Code Generation and the Oversight Problem. This is a reviewed and cleaned version.

Some weeks ago I read an article that captured something many experienced developers have been feeling for some time: software development is changing rapidly in the age of generative AI, and not always in ways we fully understand.

One quote from the article especially resonated with me in particular:

An MIT professor called AI “a brand new credit card that lets us accumulate technical debt in ways we were never able to before”. That credit card now writes 41% of the code.

Whether the exact number is accurate or not (Stability AI Ex-CEO Emad Mostaque has been saying so since 2023… and he predicts that it will be 100% by end of the decade), the point is clear: a large and growing portion of modern codebases is now generated with the help of AI tools. This has undeniable benefits: developers can prototype faster, explore ideas quickly, and automate repetitive tasks that previously consumed valuable time.

But this acceleration also introduces a structural challenge: we are producing code faster than we can reasonably understand and validate it. Some years ago, AI systems started to be so much complex that we started to lose track to understand how they think, becoming increasingly opaque. And now that they are starting to write our code, we are in a situation where we start to don’t understand how our code works too, nor how well it does.

In many teams, the question is no longer whether AI-generated code should be used, but how we can maintain quality and reliability when the volume of generated code keeps increasing.

And increasingly, it raises a deeper question: we may start to need AI to review and explain the code generated by AI, but who watches the watchmen?

The Review Bottleneck

In theory, modern development workflows are designed to maintain quality through code reviews: pull requests are reviewed by experienced engineers before changes are merged to identify if they are aligned to the project’s standards and global architecture of the system and mentor more junior developers towards them, and automated tests provide additional guarantees about stability, performance, and correctness.

In practice, the scale of AI-generated code is changing the dynamics of reviews.

Several studies suggest that average code reading speed is around 1000 lines of code per hour, assuming sustained concentration. That means reviewing a 1000+ SLOC pull request may already require close to an hour of focused work under ideal conditions.

In reality, reviews rarely happen under ideal conditions: engineers are interrupted by meetings, messages, and other responsibilities. Context switching is constant. Maintaining full concentration for extended periods is difficult. And let’s be honest: code reviews are boring, and they are mostly done to both help our work colleagues and decompress between more intensive and focus demanding creative tasks, and usually with a cup of coffee in the hand.

AI-assisted development changes the equation further: developers are now able to produce significantly more code than before. Personally, I have experienced periods where (with properly defined specs and a carefully crafted prompt) AI assistance allowed me to produce in minutes the equivalent code that would have taken me days or weeks to write manually. This is a tremendous boost in productivity, but the result is not necessarily reduced workload. Instead, the nature of the work shifts from writing code to reviewing and validating all these increasingly large volumes of generated code.

This kind of review is cognitively demanding. After some time, it becomes easy to lose focus and overlook subtle issues, especially when reviewing large blocks of code that are technically correct but not particularly expressive or insightful. We have swapped our intellectual responsibilities with the AIs, delegating them the creative task of programming, and dealing us now instead with the mechanical and repetitive task of reviewing and validating their creations, and we have done it mostly consciously in the name of progress.

Ironically, code generated by AI often looks clean and well structured, and frequently includes tests. Coverage numbers may look better than ever (yay!), yet this apparent order can hide deeper structural issues that only become visible over time.

The Technical Debt Explosion

Technical debt has always been part of software development. Teams often accept short-term compromises in exchange for faster delivery, with the intention of addressing them later.

In practice, technical debt is rarely repaid as systematically as planned.

AI-assisted development introduces a new dimension to this dynamic: the rate at which codebases grow is increasing significantly. When code volume increases faster than architectural understanding, complexity grows as well, because there’s nobody left with a comprehensive view of the system as a whole that can identify the flaws and decide to simplify them. All is add, nothing is removed, the dump fire grows, and the codebase becomes more and more complex, with more and more bugs, and a larger attack surface, without a clear direction or strategy to manage it. This is a perfect storm recipe for accumulating technical debt at an unprecedented pace.

This does not mean AI-generated code is inherently bad. In many cases it is perfectly serviceable and sometimes excellent. The challenge is not individual code fragments, but the long-term evolution of entire systems.

One recurring pattern is the tendency for generated code to favor local solutions rather than integration with existing abstractions or external libraries. Over time this can lead to duplication and fragmentation across the codebase. Without deliberate architectural oversight, the cumulative effect may be an increase in complexity and maintenance cost, by having in the same source code multiple implementations of similar functionality with subtle differences in they way they works, and in some cases, incomplete implementations that don’t consider some other use cases that can fail, and that are already fixed in the other implementations, so their benefits doesn’t translate in a more robust common global solution (in addition to the extra memory consumption).

Maintaining long-term system coherence requires something that cannot easily be generated automatically, but requires instead to have experience working with it on first hand, or for some especially complex systems, also to have designed it from scratch: contextual understanding of the system as a whole. By fully delegating development to AI, we have neither.

The Limits of Human Review

Traditionally, experienced developers provided that system-level understanding. Senior engineers and software architects accumulated deep knowledge of the codebase and guided its evolution over time. They also transferred that knowledge to new contributors and mentored more junior developers.

However, as AI tools take on a larger share of code production, the role of senior developers increasingly shifts toward reviewing generated output rather than building and evolving systems directly.

This shift has subtle consequences.

Reviewing code is not the same as designing it. Architectural intuition develops through direct interaction with systems: writing code, refactoring it, and understanding its behavior over time. When engineers spend most of their time reviewing large volumes of generated code, maintaining that depth becomes more difficult.

At the same time, developers themselves are becoming dependent on AI-assisted workflows. Many of us have experienced how dramatically productivity can drop when those tools are unavailable, since that levels of productivity has became the new standard. VSCode Copilot, for example, has become an integral part of many developers’ workflows, and Claude Code is gaining a lot of popularity, specially when developing full applications and pet projects that would have taken the full weekend just 3-4 years ago.

When VSCode Copilot was launched, it was a game-changer for many, like having a colleague available to do pair programming at all times (while I’m writting this, it’s continuously identifying and fixing typos and completing full sentences, as if he would be sneaking the screen over my shoulder). And with the chat functionality, it has become even more powerful, allowing us to ask for explanations, suggestions, and improvements on the fly, for not talking about how good it is at generating code, especially for unit tests or with the latests models like GPT-5.3 Coder Max. But this also means that when those tools are not available, or we need to use older free models like GPT-4o or GPT-5 mini, we may struggle to maintain the same level of productivity and quality.

This dependency is not necessarily negative. Powerful tools have always reshaped software development. But it does highlight how central these systems have become to everyday work, and how much dependent we have become to the services of a handful of big AI corporations, just to be able to do our work and receive a paycheck at the end of the month. The question is not whether AI will remain part of the development process (it clearly will) but how development practices must evolve to support this new reality.

What Comes Next

If AI-assisted development continues to accelerate, traditional workflows may no longer be sufficient to maintain system quality at scale. Manual review alone cannot scale indefinitely with code volume. Increasing the number of reviewers does not fully solve the problem if each reviewer faces the same cognitive limits.

A likely next step is the development of systems that make software evolution machine-readable and traceable by design: instead of treating development history as a collection of commits and pull requests intended primarily for human consumption, future systems may represent changes as structured events that can be analyzed automatically.

Such systems could provide:

Fine-grained traceability of how code evolves over time
Machine-readable records of AI architectural decisions
Automated detection of structural inconsistencies
Continuous monitoring of technical debt accumulation
Reproducible histories of system evolution
Study alternative implementations and their trade-offs over time
Review and maintain legacy code, looking for silent bugs because “it worked”
Propose refactoring and simplifications based on long-term trends
Generate documentation that captures architectural intent and rationale
Constant monitoring of code quality metrics and architectural consistency across the codebase
AI-assisted identification of potential issues before they become critical, by analyzing patterns in code changes and their impact on system behavior

In such environments, AI systems would not only generate code, but also help supervise and evaluate it in a consistent and auditable way.

In short, a system like this would not replace human judgment as other trends are showing. Instead, a system like this would allow humans to focus on higher-level creative and reasoning tasks by providing them with more informed decisions, while low-level automated systems would do the repetitive tasks that nobody wants to do.

Conclusion

AI-assisted development is already transforming how software is written. Developers can move faster than ever before, and many long-standing limitations are being reduced or eliminated.

But at the same time, new challenges are emerging. Increasing code volume places pressure on review processes and architectural oversight. Maintaining long-term system quality requires adapting our tools and workflows to match the speed of modern development.

We may soon rely on AI systems not only to generate code, but also to help us to review and understand it.

But even in that scenario, one question remains worth asking:

If AI reviews the code generated by AI, who watches the watchmen?

Note: Parts of this work were developed with the assistance of AI tools. All opinions, ideas, experiments, validations and conclusions are my own.

Written on May 19, 2026