Article imageLogo
Chatbots Behaving Badly™

Hi, I’m Claude, the All-Powerful Chatbot. A Third Grader Just Beat Me.

By Markus Brinsa  |  August 20, 2025

Sources

The Setup: A Test That Shouldn’t Fail

Sometimes the simplest challenges expose the largest gaps. I decided to test Claude—Anthropic’s much-lauded AI assistant—with the kind of task even a novice could execute: parse a sitemap.xml and return its 52 URLs. No clever logic, no novel reasoning—just matching <loc> tags and outputting the contents.

Sitemaps are among the most basic XML structures: rigid, predictable, and trivial for both humans and standard XML parsers. A casual Notepad session or a tiny Python script using xml.etree.ElementTree would complete the task in seconds. It felt almost unfair to subject a multimillion-dollar AI to such a mundanely simple job.

What Actually Happened

First, I asked Claude to fetch the sitemap via its URL. Instead of the links, I received an essay on sitemap strategy and SEO importance. It wasn’t directly helpful. Pushed again, Claude finally “confessed” it couldn’t read the file from a link. An honest admission would have sufficed—but it took too long to arrive.

So I opted for the more direct route: paste the entire XML into the chat. That should guarantee success, right? Wrong. Claude proceeded to “analyze,” then “think,” and again “analyze”… endlessly, with no output. Just swirling algorithms stuck in loops rather than concrete extraction. The URLs never materialized.

It was like watching a student stare at the simplest question on an exam, overcomplicate it, and never put pencil to paper.

Parsing vs. Generating: A Fundamental Mismatch

What made such a simple assignment trip up a sophisticated AI? The answer lies in the fundamental difference between structured parsing and generative text modeling.

Language models like Claude are trained not to execute tasks—but to simulate human-like responses based on pattern recognition across vast text corpora. Extracting exact data from structured formats is less of a strength and more of a stumbling block. As one expert puts it:

“LLMs are trained to interpret language, not data.”

That means reading XML and generating URLs requires deterministic fidelity—not free-form reasoning. LLMs inherently favor plausible phrasing instead of literal accuracy, leading them to wander rather than extract. The more rigid and constrained the format, the more likely the AI will falter. 

Benchmarks Confirm the Gap

My Claude experiment isn’t unique. Researchers have shown that LLMs frequently underperform on tasks involving structured outputs—especially when complexity increases. Consider the StrucText-Eval benchmark: it measures LLMs’ ability to reason with structure-rich text. On its harder subset, LLMs scored only ~45.8%, while humans hit ~92.6%.

Similarly, extracting structured data—like JSON fields or invoice details—is a known weak point:

 “LLMs struggle with structured data extraction from unstructured documents … consistency and accuracy trump creativity.”

Another study found that forcing LLMs into strict format constraints like XML or JSON can actually degrade their reasoning performance.

GPT vs. Claude: A Comparison

Naturally, I had to see how Claude’s main rival, GPT, would handle the same challenge. GPT at least attempted extraction, producing a partial list of URLs before formatting errors crept in. It didn’t freeze; it simply garbled the task. Which is arguably even worse—because now you have an output that looks correct but isn’t.

That raises a serious issue for professionals: do you prefer a chatbot that admits it can’t perform, or one that confidently hands you an answer sprinkled with mistakes? In high-stakes use cases like financial reporting or medical record parsing, a half-right answer is far more dangerous than none at all.

Google’s Gemini models add another flavor. They’ve been touted for their multimodality and “reasoning” strengths, yet early testing has shown the same weakness when it comes to rigid extraction. Gemini too tends to output commentary alongside data, or reformat it creatively—precisely what you don’t want from a parser.

So, Claude’s failure isn’t an isolated embarrassment. It’s part of a broader pattern: LLMs excel at fluid, high-context conversation, but wobble when precision, structure, and reliability are non-negotiable.

Sidebar: Other AI Stumbles in the Wild

Claude’s sitemap meltdown reminded me of several real-world failures where structured data tasks humbled the supposed brilliance of AI.

Each case echoes the same lesson: chatbots stumble most when tasks shift from language fluidity to structural accuracy.

Why Claude Isn’t a Parser

Claude, like many LLMs, can be nudged into producing structured output—especially using prompt engineering strategies such as tagging, templates, or schema definitions. Anthropic even recommends using XML or JSON tags to guide Claude’s output.

Nevertheless, these tools aren’t foolproof. Even when given clear XML tags, Claude can overfit, misplace tags, or collapse when prompts become large.

The real fix lies in tool integration—letting Claude delegate structured tasks to parsers or databases. For example, Anthropic’s API supports connecting Claude to external tools that can handle structured workflows. More broadly, methods like Retrieval-Augmented Generation (RAG) can ground chat outputs in real data. Yet these augmentations mostly address hallucinations—they don’t guarantee structural fidelity.

The Human Advantage

Again, the irony: the task Claude failed is something any reasonably tech-savvy human would handle in under a minute. Copy-paste. Write a quick regex. Load it in a browser or use a CLI tool like xmllint. Done.

Language models generate content based on learned patterns. They excel at tone, synthesis, and conversation—but not data accuracy. They’re more “stochastic parrots” than analytic engines.

This is a reminder: LLMs should augment human workflows—not replace fundamental logic tools. For structured tasks, humans (or a proper parser) are still the safest hands.

What This Means Going Forward

My Claude experiment became a microcosm of a broader truth: LLMs shine at storytelling, summarizing, translating—but trip on anything that looks like code, table rows, or XML blocks.

Going deeper, these limitations have systemic consequences. In enterprise settings—extracting dates from contracts, parsing invoices, mapping logs—the margin for error is zero. Consistency over creativity matters. Without integration into deterministic systems, generative models fall short.

Solutions do exist: hybrid systems, RAG pipelines, schema-based prompting, API tool hooks, human-in-the-loop checks. But they require intentional architecture—not just expecting a chat at your keyboard to parse with precision.

Conclusion: Humans Still Hold the Edge

My humble test wasn’t about disproving Claude’s intelligence. It was about reality-checking what “AI can do.” When the task shifts from plausible prose to rigid extraction, LLMs can reveal their brittle underside.

So, ironically, a human with Notepad—or a tiny XML parser—could have solved the job in minutes. Meanwhile, the state-of-the-art AI assistant got looped, stalled, and ultimately defeated by a task that was never meant to be hard.

And that’s the lesson here: even the brightest AI needs the right job. When the test is structural, humans still reign supreme.

About the Author

Markus Brinsa is the Founder and CEO of SEIKOURI Inc., an international strategy consulting firm specializing in early-stage innovation discovery and AI Matchmaking. He is also the creator of Chatbots Behaving Badly, a platform and podcast that investigates the real-world failures, risks, and ethical challenges of artificial intelligence. With over 15 years of experience bridging technology, business strategy, and market expansion in the U.S. and Europe, Markus works with executives, investors, and developers to turn AI’s potential into sustainable, real-world impact.

©2025 Copyright by Markus Brinsa | Chatbots Behaving Badly™