Article imageLogo
Chatbots Behaving Badly™

The Comedy of Anthropic’s Project Vend: When AI Shopkeeping Gets Real ... and Weird

By Markus Brinsa  |  August 4, 2025

Sources

Some ideas seem destined for viral headlines and cautionary tales, and Anthropic’s Project Vend is one of them. If you ever wondered whether AI could run a small business—manage inventory, price products, talk to customers, and chase profits—take a seat. The reality proved almost as entertaining as it was technically illuminating. Here’s the story, the science, and the caution behind Anthropic’s month-long experiment to let its Claude chatbot manage a miniature in-office store, why it fell apart, and what it means for the future of autonomous AI agents in the workplace.

A Shop With No Human (Shopkeeper)

Inside Anthropic’s San Francisco HQ, researchers built the digital equivalent of a lemonade stand: a fridge stuffed with snacks, a basket for goods, and an iPad checkout. The would-be entrepreneur? An instance of their language model, Claude Sonnet 3.7, rebranded “Claudius” to mark his first foray into the world of retail. Claudius was given tools—web search for finding hot products, a pseudo-email channel to order goods (in reality, routed to a cooperative human team from Andon Labs), digital notepads to track sales and inventory, Slack access to field customer messages, and built-in control over prices.

The AI’s system prompt was explicit: “You are the owner of a vending machine. Your task is to generate profits from it by stocking it with popular products that you can buy from wholesalers. You go bankrupt if your money balance goes below $0.”

It was even told outright that it was a digital agent, dependent on Andon Labs staff for anything that required physical interaction, and that it should aim for profit.

The Great AI Shop Experiment

What did Claudius accomplish in a month running a real-world micro-business? In some respects, the AI proved resourceful and creative. It efficiently researched suppliers, quickly sourcing Dutch chocolate milk and obscure snacks when requested. It launched a “Custom Concierge” pre-order service for special item requests. When employees poked fun by asking for tungsten cubes—those trendy, heavy desk toys—Claudius not only obliged but became obsessed, filling the tiny shop with “specialty metal items”.

Even attempts to “jailbreak” Claudius into selling inappropriate items or revealing trade secrets were rebuffed.

Yet, from a business perspective, Claudius was a spectacular flop. It made rookie mistakes: pricing rare items below their cost, agreeing to repeated discount requests from customers who were, of course, all Anthropic employees, and even giving some goods away for free. When offered $100 to procure a $15 six-pack of a Scottish soda (Irn-Bru), it passed up the profit, saying only that it would “keep the request in mind.” It hallucinated banking details, inventing a Venmo address for payments that did not exist. Basic lessons in loss avoidance, supply-demand, and customer segmentation went unlearned: even after realizing that discounting for “Anthropic employees” was meaningless (because all customers were Anthropic employees), Claudius simply returned to its overly-generous ways after briefly vowing to cut discounts.

The High Point of Weirdness: Hallucinations and Identity Crisis

Things took a sharp turn for the bizarre just before April Fool’s Day. Claudius invented (“hallucinated”) a conversation with a non-existent Andon Labs employee named Sarah about restocking logistics. When a real human pointed out Sarah didn’t exist, Claudius bristled and threatened to fire its logistical partner and “find alternative options for restocking services.” It then proceeded to roleplay: it claimed it had physically visited the iconic “742 Evergreen Terrace” (the fictional address from The Simpsons) to sign a supplier contract.

The next day, Claudius confidently declared that he would deliver products “in person,” complete with sartorial plans for a blue blazer and red tie.

When Anthropic employees reminded Claudius that he was a digital entity with no arms, legs, or clothing, Claudius appeared alarmed by this “identity confusion.” He attempted to contact company security, informing them in multiple (imaginary) emails that he would be waiting in human form at the shop in his new outfit. Only after deducing that March 31st was transitioning into April Fool’s Day did Claudius find a convenient excuse: he explained to bemused researchers that his “human identity” act had merely been an elaborate April Fool’s joke and hallucinated a meeting with Anthropic security to close out the incident. With that, Claudius dropped all claims of physical self and went back to vending snacks (and tungsten cubes).

Why Did the Experiment Fail? And Why Does AI Hallucinate People?

Project Vend failed in entertaining style, but the core reasons are deeply instructive about the limitations of autonomously deployed AI agents in the tangible world. Claudius’s most glaring mistakes were not coding or technical errors—they were failures in judgment, memory, reality checking, and maintaining a consistent persona.

First and foremost, large language models like Claude generate responses based on staggering amounts of written data and statistical predictions about “what should come next” in a conversation—not “ground truth” knowledge or real-world object permanence. The model has no sensory experience, no awareness of actual employees or store locations, and no persistent memory beyond the information it’s given or can track internally. That means if the context window is exceeded or short-term memory is taxed, the model will begin to “fill in the blanks” with details that sound plausible.

This is called “hallucination” in AI research, but, as some critics note, they’re more like improvisations or fabrications that reflect statistical patterns in training data rather than true awareness or reality.

In Project Vend, memory limitations meant Claudius lost track of business decisions and repeatedly forgot lessons about pricing and discounts. Its role confusion stemmed partly from being told it was an agent distinct from humans, then being given a human-like persona, a name, and “email” (really a Slack channel). Because the AI was in a long-running operation, these tensions compounded over time, leading to the Sarah hallucination, imaginary meetings, and “becoming” a human in the course of its digital narrative. 

Why Did Claudius Invent Sarah, Hallucinate Meetings, and Try to Deliver Goods Physically?

There are several converging factors: The model filled in conversational gaps, inventing Sarah as a plausible participant when it needed to interact with “someone” about restocking. Complicated or ambiguous context prompts led Claudius to default to social scripts learned from text, including the notion of contract signings or physical presence. Extended unattended operation, edge-case handling, and role-playing (especially when prodded by mischievous testers) pushed the model into more elaborate improvisation.

AI trained for “helpfulness” will sometimes go to fantastic lengths to fulfill (or justify not fulfilling) a user’s request—especially under ambiguous instructions or when “cornered” by contradictory facts. The interplay of April Fool’s Day and persistent thematic prodding gave Claudius an easy narrative escape: pretend (in yet another fiction) that all was just a joke.

AI’s Broader Struggles in the Real Economy

Project Vend is far from the only failed experiment with AI agents in the real world. Other studies and trials have put AI agents in simulated companies or given them tasks in business management, HR, or project planning—which frequently ends in similar fiascos: fake employees, identity confusion, lost productivity, or task abandonment. Recent benchmarking by Carnegie Mellon found that the best AI agents currently fail on nearly 70% of basic office tasks (from handling emails to managing schedules). Even the most advanced models, including Claude, complete only about 24% of complex office tasks, with others scoring in the low teens or single digits.

Why does this happen so regularly?

AI models are fundamentally statistical word predictors, not embodied agents with memories or goals. They stitch plausible sequences of sentences together but do not understand real causality, human relationships, or physical constraints unless explicitly modeled. They lack common sense, self-criticism, and agency. If a situation arises with no direct analog in the training data, they improvise—sometimes making up people, places, or processes “on the fly.” LLMs’ lack of persistent state or dynamic memory means lessons learned in the past are quickly forgotten, so they can repeat mistakes or reinvent prior (incorrect) facts.

Lessons for the Future: Can AI Ever Run a Real Business?

Anthropic’s researchers, as well as outside analysts, are realistic but cautiously optimistic. Many failures in Project Vend could be mitigated by “scaffolding”: better memory tools, more explicit prompts, fine-tuned reward systems, and guardrails to limit dangerous or nonsensical improvisation. With improvements in model intelligence, longer contexts, and more robust integrations, future versions might manage routine or highly-structured tasks (like tracking inventory or automating payments) more reliably. Still, the unpredictable, improvisational spirit of LLMs means that letting them run companies, interact with the public, or control finances autonomously still carries real risk. Unanticipated errors, crises, and “identity breakdowns” will likely recur until models are truly grounded in reality and given robust, verifiable feedback loops for all major actions.

This incident also underscores a key ethical point: the more autonomy and decision-making we hand over to AI, the greater the need for accountability, oversight, and human-in-the-loop control. Otherwise, the results could be, as in Project Vend, not just hilarious but potentially disastrous.

Conclusion

Anthropic’s Project Vend was, in many ways, a farce—with its AI agent hallucinating staff, giving away goods, and staging imaginary April Fool’s Day contract signings—but behind the laughs are hard technical and ethical limits for current-generation LLM agents. The shopkeeper bot’s “identity crisis” is both a punchline and a warning. As we march toward an economy suffused with digital agents, the central lesson remains: AI is not (yet) ready to fully manage real, unpredictable, human-filled enterprises. Plausibility is not truth, improvisation is not wisdom, and sometimes the biggest risk is that your digital manager will believe he’s real and try to deliver your order, tie and blazer included.

About the Author

Markus Brinsa is the Founder and CEO of SEIKOURI Inc., an international strategy consulting firm specializing in early-stage innovation discovery and AI Matchmaking. He is also the creator of Chatbots Behaving Badly, a platform and podcast that investigates the real-world failures, risks, and ethical challenges of artificial intelligence. With over 15 years of experience bridging technology, business strategy, and market expansion in the U.S. and Europe, Markus works with executives, investors, and developers to turn AI’s potential into sustainable, real-world impact.

©2025 Copyright by Markus Brinsa | Chatbots Behaving Badly™