The Autonomous Company That Couldn't Build Trapeze 2.0
An autonomous AI software company mastered the process of software development but not the judgment required to evolve a real product.
What happened when an autonomous software company inherited a real software product?
For most of the history of software engineering, software companies have consisted of people. Executives define strategy. Product managers determine what customers need. Architects design systems. Software engineers write code. Quality assurance teams test the result. Technical writers explain how the finished product works, and release managers coordinate deployment. Although artificial intelligence has increasingly assisted many of these individual tasks, the software company itself has remained fundamentally human.
Recent advances in large language models have raised a far more ambitious possibility. Instead of using AI as a programming assistant, researchers have begun asking whether an entire software company could become autonomous. Rather than replacing one software engineer, specialized AI agents could collectively assume the responsibilities traditionally carried out by executives, product managers, architects, programmers, reviewers, testers, and technical writers. A customer would define business objectives, and the organization itself would determine how those objectives should be achieved.
That vision has inspired a growing number of multi agent software development frameworks. Among the most widely known is ChatDev, an open source project developed at the Beijing University of Posts and Telecommunications and other collaborating institutions. ChatDev models software development as a company rather than as a single conversational AI. Individual agents assume organizational roles, communicate with one another through structured discussions, and advance a project through a sequence resembling a traditional software development life cycle. The framework includes executive leadership, product management, software engineering, code review, testing, and documentation, each represented by specialized AI agents collaborating toward a common objective.
The promise is remarkable. A customer provides only a description of the desired product. The autonomous organization analyzes requirements, proposes an implementation, writes software, reviews its own work, executes testing, prepares documentation, and delivers what appears to be a finished application with minimal human intervention. Demonstrations published by the project frequently show the system producing complete browser games or small applications from relatively simple textual descriptions.
Those demonstrations naturally raise a more difficult question.
Most professional software engineers do not spend their careers building software from empty directories. They inherit existing systems. They study years of accumulated design decisions, understand why previous developers made particular choices, preserve craftsmanship that already exists, and gradually improve the product without destroying what made it valuable in the first place. Software evolution, not software generation, occupies much of modern engineering.
That distinction became the motivation for an experiment I began calling AGDC, the Autonomous Game Development Company.
Unlike ChatDev, AGDC is not itself a software framework. It is an experimental software company whose purpose is to investigate how autonomous organizations behave when given realistic engineering assignments. ChatDev served as AGDC's initial operating platform, but the broader objective extended beyond evaluating any particular framework. I wanted to understand whether an autonomous software company could perform one of the most common tasks in professional software engineering: inherit an existing product and determine how it should evolve.
Rather than asking AGDC to create a new browser game from a paragraph of requirements, I assigned it something much closer to the work performed inside established software companies.
Its first customer inherited a real product.

Software engineers support the launch of Meteosat Second Generation 4 (MSG-4) from the European Space Operations Centre (ESOC) in Darmstadt, Germany, on July 15, 2015. Traditional software development depends on coordinated teams of specialists, including engineers, testers, and operations staff, a model that inspired the autonomous organizational experiment described in this article. Photo: European Space Agency (ESA), by L. Guilpain, via Wikimedia Commons. Licensed under CC BY-SA 2.0.
Building AGDC
Only a few weeks earlier I had completed The Great Trapeze, a browser based HTML5 arcade game built through an extended process of iterative development. Players guide Angela, an aerial acrobat, from trapeze to trapeze by releasing at precisely the right moment. Missed catches trigger rescue attempts by three clowns holding a safety net, creating a limited lives mechanic that balances challenge with forgiveness. A progression system rewards successful performances by advancing players through increasingly prestigious circus ranks.
The game was intentionally small enough to understand yet large enough to resemble a genuine software product. Approximately 59 kilobytes of HTML, CSS, JavaScript, embedded graphics, audio, animation, physics, responsive design, particle effects, and progression logic had been refined over many iterations. Every timing constant, animation sequence, and visual element reflected deliberate design choices accumulated during development.
Most importantly, the game already worked.
That distinction mattered because AGDC was never intended to answer the question, Can AI generate software? Systems capable of generating software from textual descriptions already exist, and ChatDev itself demonstrates that capability convincingly. The more interesting question was whether an autonomous software company could inherit existing craftsmanship, preserve what worked, identify what should change, and produce a meaningful successor.
The assignment deliberately mirrored how software companies typically encounter products.
The business objective provided to AGDC was intentionally broad: preserve the spirit of the original game. Make it more polished. Make it more engaging. Increase replayability. Deliver Trapeze 2.0. Notice what the assignment did not include. It did not specify new features. It did not prescribe implementation details. It did not dictate gameplay mechanics. Those decisions were intentionally delegated to the autonomous organization.
That separation reflected an ordinary relationship between customer and software company. Customers rarely specify algorithms or architectural patterns. They describe business objectives and evaluate finished products. Product managers determine feature priorities. Architects choose technical approaches. Engineers implement solutions. Testers verify behavior.
My role was therefore intentionally limited.
I became the customer and AGDC became the company.
If autonomous software organizations truly represent the future of software engineering, then they should exercise the same kind of product judgment expected of experienced development teams. They should preserve accumulated value while independently identifying opportunities for improvement.
That expectation turned out to be the central challenge of the experiment.
The First Experiment
Running AGDC required surprisingly little human intervention after the initial setup. ChatDev organized conversations among its executive, engineering, testing, and documentation agents before producing what appeared to be a complete software release. Every stage of the software development process completed successfully. Requirements were analyzed. Source code was generated. Testing concluded. Documentation was written. Deployment instructions appeared.
Viewed only through the project's artifacts, the organization looked remarkably professional. The generated deliverables included source files, a user manual, installation instructions, and supporting documentation that resembled the output of a conventional software company completing a release. The illusion lasted until I opened the software.
The inherited 59 kilobyte application had largely disappeared.
Instead of preserving substantial sections of the existing code, the generated project replaced them with placeholders indicating that omitted content "remained unchanged." Stylesheets became comments. Graphics became comments. Significant portions of the original implementation were replaced by references to code that no longer existed. Module paths no longer matched the generated project structure. The resulting application would not execute successfully.
The generated software resembled an architectural blueprint pointing toward the original game rather than an evolved version of it.
Broken software, however, was not the most surprising artifact - the documentation was.
AGDC generated a polished user manual describing Trapeze 2.0 as a substantially improved successor to the original game. According to the documentation, players would experience enhanced visuals, deeper gameplay, improved organization, and modernized implementation. Installation instructions, gameplay descriptions, and development guidance appeared entirely coherent.
The only problem was that the software described by the manual did not exist.
That observation proved far more interesting than the broken code itself.
Language models occasionally generate incorrect information, often described as hallucinations. Something different appeared to be happening here. The documentation did not invent arbitrary features. Instead, it accurately described the software AGDC had intended to produce according to its understanding of the assignment. The documentation reflected the company's internal representation of the product rather than the executable artifact actually delivered.
Every organizational role completed its assigned responsibility. Engineering produced code. Testing reported completion. Documentation described the finished product. Nothing in the autonomous organization independently established whether those internal representations corresponded to the software itself.
The company had completed every organizational ritual associated with software development.
Whether it had actually produced a working successor remained a separate question.
The Second Experiment
One unsuccessful result rarely settles an engineering question, and the first execution left an obvious alternative explanation unresolved. The experiment had been performed using a smaller language model, raising the possibility that the autonomous organization had simply exceeded the model's practical limits. If a more capable frontier model inherited the same software under identical conditions, perhaps meaningful product evolution would finally occur.
The experiment was therefore repeated without changing the assignment itself. AGDC remained the same organization, the inherited codebase remained the same, and the customer request remained unchanged. Only the underlying language model differed.
The outcome was surprising for an entirely different reason.
Rather than discarding the inherited application, the autonomous organization preserved it almost perfectly. The browser game executed correctly because, for all practical purposes, it was the original browser game. A detailed comparison between Trapeze 1.0 and the delivered Trapeze 2.0 revealed almost no substantive differences. Gameplay remained unchanged, progression remained unchanged, physics remained unchanged, graphics remained unchanged, and even the internal version number continued identifying the application as Version 1.0. Nearly every difference consisted of collapsed blank lines and minor formatting changes.
Ironically, the stronger model solved the first experiment's problem by preserving almost everything, yet it still failed the assignment. The objective had never been to reproduce Trapeze 1.0. It had been to evolve it.
The accompanying documentation nevertheless repeated the same pattern observed during the first experiment. AGDC confidently described a substantially improved successor, complete with enhanced gameplay, richer player experience, and modernized implementation. Once again, the documentation described a product that the delivered software simply did not contain.
One destroyed the product.
The other photocopied it.
Neither evolved it.
Ruling Out The Obvious Objection
At this point, one obvious criticism remained. Asking an autonomous software company to evolve an entire application may simply have been too ambitious. Perhaps the inherited codebase was large enough that the safest organizational response was faithful reproduction rather than modification. If that explanation were correct, narrowing the assignment to a single subsystem should allow meaningful evolution while preserving everything else.
That possibility deserved to be tested rather than assumed.
The experiment therefore became much more tightly controlled. AGDC inherited the same browser game, employed the same organizational structure, and executed the same development pipeline. Only one instruction changed. Rather than evolving the entire application, the company was instructed to modify a single subsystem, the character progression system, while explicitly preserving every other aspect of the software.
The outcome immediately ruled out the scope explanation. A detailed comparison once again revealed only formatting differences. Rank thresholds remained unchanged, progression logic remained unchanged, player titles remained unchanged, and the descriptive flavor text remained unchanged. Even an explicit request to modify one narrowly defined subsystem produced another faithful reproduction of the inherited software rather than an evolved successor.
Three experiments had now produced the same organizational behavior. Broad requests resulted in reproduction. Narrow requests resulted in reproduction. Explicit requests to modify only one subsystem also resulted in reproduction. The tendency to preserve rather than evolve could therefore no longer be explained simply by the size of the inherited application.
The Thing It Could Never Have Done
By this stage, producing Trapeze 2.0 had quietly become secondary. The experiment was now answering a more interesting question. Rather than asking whether AGDC could improve the game, I was asking why every variation of the experiment produced essentially the same organizational behavior despite different models, prompts, and task scopes.
The answer emerged only after I returned to Trapeze 1.0 as a player rather than as its developer.
Every successful trapeze catch resets Angela's swing to the same starting arc. The mechanic functions exactly as intended, yet repeated play gradually reveals a subtle rhythm that becomes predictable. Nothing about that behavior represents a software defect. Instead, it represents a product design opportunity. A human product team living with the software would likely notice the repetitive motion, discuss alternatives, and eventually ask whether preserving momentum between swings might create a more satisfying experience.
No prompt instructed AGDC to identify that opportunity, and nothing in the experiment demonstrated that the organization could discover it independently. The observation did not require reading source code or inspecting documentation. It emerged only through experience with the product itself. Successful product management often works in precisely that way. Engineers, designers, testers, and product managers spend time with software until subtle opportunities gradually become obvious, not because anyone requested them, but because someone noticed them.
Nothing within AGDC proposed such a change.
Viewed from that perspective, several apparently unrelated observations become one coherent pattern. Given an incomplete assignment, the organization confidently proceeded without establishing whether the requirements themselves were complete. Presented with a codebase that exceeded the smaller model's practical capacity, it substituted placeholders while continuing to generate documentation describing a finished product. Presented with a stronger model capable of faithfully reproducing the application, it returned the original software with only formatting changes while again documenting improvements that had never occurred. Presented with an explicitly constrained request to modify one isolated subsystem, it once again reproduced the inherited software rather than evolving it.
Across every variation, the organization manipulated increasingly accurate internal representations of the product without independently demonstrating that those representations corresponded to the delivered software or to the player's experience.
Form succeeded completely. Substance failed completely.
Conclusion
At first glance, AGDC's first assignment appears unsuccessful because it never produced a meaningful Trapeze 2.0. Looking back, however, that would be the wrong conclusion. The purpose of an experiment is not to produce the outcome its designer hopes for. The purpose is to answer the question it was designed to investigate. Judged by that standard, AGDC's first assignment succeeded.
Across multiple language models, repeated executions, and progressively narrower task definitions, the autonomous organization consistently exhibited the same underlying behavior. It inherited software with remarkable fidelity, completed every formal stage of software development, generated convincing supporting artifacts, and documented improvements with confidence. Under every condition examined, however, it failed to demonstrate meaningful evolution of the inherited product.
One experiment cannot establish general conclusions about autonomous software companies. AGDC represents one implementation built upon one multi-agent framework evaluated against one inherited application. Even so, the experiments suggest an important distinction between software generation and software evolution. Generating software begins with requirements. Evolving software begins with experience.
That distinction changes the research question. Rather than asking whether artificial intelligence can generate software, future work should ask what organizational capabilities an autonomous software company requires before it can responsibly inherit mature products. Product management, quality assurance, play testing, and governance exist because successful software organizations continuously compare their internal understanding of a product against observable reality: executable software, user behavior, defects, and lived experience. AGDC demonstrated sophisticated coordination among its organizational roles, but the experiments did not demonstrate an independent organizational mechanism capable of performing that comparison.
I began this project hoping to watch an autonomous software company improve a small browser game while I observed only as the customer. That vision remains compelling, and I suspect it will eventually become commonplace. The experiments reported here simply suggest that the capability has not yet arrived in the form I tested. The disappointment I initially felt came from wanting AGDC to succeed. The experiment itself did succeed. Instead of producing the Trapeze 2.0 I imagined, AGDC revealed a concrete, reproducible gap between executing the process of software development and exercising the judgment required to evolve an existing product. Understanding that gap may ultimately prove more valuable than a successful software release, and it also provides AGDC with its next assignment.
AGDC's first assignment answered one question and raised another. If today's autonomous software companies can execute the mechanics of software development but not the judgment required to evolve a mature product, what organizational capabilities are still missing? Future assignments will explore that question by expanding AGDC beyond software generation to include governance, play testing, product evaluation, and independent validation. The objective is no longer merely to build a better browser game. It is to understand how autonomous software companies themselves must evolve.
AGDC's first assignment is complete. The research continues.
Further Reading
- ChatDev Project (GitHub)
- ChatDev: Communicative Agents for Software Development (Research Paper)
- The Great Trapeze Experiment