Sometimes the simplest challenges expose the largest gaps. I decided to test Claude—Anthropic’s much-lauded AI assistant—with the kind of task even a novice could execute: parse a sitemap.xml and return its 52 URLs. No clever logic, no novel reasoning—just matching <loc> tags and outputting the contents.
Sitemaps are among the most basic XML structures: rigid, predictable, and trivial for both humans and standard XML parsers. A casual Notepad session or a tiny Python script using xml.etree.ElementTree would complete the task in seconds. It felt almost unfair to subject a multimillion-dollar AI to such a mundanely simple job.
First, I asked Claude to fetch the sitemap via its URL. Instead of the links, I received an essay on sitemap strategy and SEO importance. It wasn’t directly helpful. Pushed again, Claude finally “confessed” it couldn’t read the file from a link. An honest admission would have sufficed—but it took too long to arrive.
So I opted for the more direct route: paste the entire XML into the chat. That should guarantee success, right? Wrong. Claude proceeded to “analyze,” then “think,” and again “analyze”… endlessly, with no output. Just swirling algorithms stuck in loops rather than concrete extraction. The URLs never materialized.
It was like watching a student stare at the simplest question on an exam, overcomplicate it, and never put pencil to paper.
What made such a simple assignment trip up a sophisticated AI? The answer lies in the fundamental difference between structured parsing and generative text modeling.
Language models like Claude are trained not to execute tasks—but to simulate human-like responses based on pattern recognition across vast text corpora. Extracting exact data from structured formats is less of a strength and more of a stumbling block. As one expert puts it:
“LLMs are trained to interpret language, not data.”
That means reading XML and generating URLs requires deterministic fidelity—not free-form reasoning. LLMs inherently favor plausible phrasing instead of literal accuracy, leading them to wander rather than extract. The more rigid and constrained the format, the more likely the AI will falter.
My Claude experiment isn’t unique. Researchers have shown that LLMs frequently underperform on tasks involving structured outputs—especially when complexity increases. Consider the StrucText-Eval benchmark: it measures LLMs’ ability to reason with structure-rich text. On its harder subset, LLMs scored only ~45.8%, while humans hit ~92.6%.
Similarly, extracting structured data—like JSON fields or invoice details—is a known weak point:
“LLMs struggle with structured data extraction from unstructured documents … consistency and accuracy trump creativity.”
Another study found that forcing LLMs into strict format constraints like XML or JSON can actually degrade their reasoning performance.
Naturally, I had to see how Claude’s main rival, GPT, would handle the same challenge. GPT at least attempted extraction, producing a partial list of URLs before formatting errors crept in. It didn’t freeze; it simply garbled the task. Which is arguably even worse—because now you have an output that looks correct but isn’t.
That raises a serious issue for professionals: do you prefer a chatbot that admits it can’t perform, or one that confidently hands you an answer sprinkled with mistakes? In high-stakes use cases like financial reporting or medical record parsing, a half-right answer is far more dangerous than none at all.
Google’s Gemini models add another flavor. They’ve been touted for their multimodality and “reasoning” strengths, yet early testing has shown the same weakness when it comes to rigid extraction. Gemini too tends to output commentary alongside data, or reformat it creatively—precisely what you don’t want from a parser.
So, Claude’s failure isn’t an isolated embarrassment. It’s part of a broader pattern: LLMs excel at fluid, high-context conversation, but wobble when precision, structure, and reliability are non-negotiable.
Claude’s sitemap meltdown reminded me of several real-world failures where structured data tasks humbled the supposed brilliance of AI.
New York City’s MyCity Chatbot was supposed to guide small businesses through compliance rules. Instead, it gave illegal legal advice and contradicted city regulations. Its failure wasn’t because it lacked data—it couldn’t reliably parse and apply structured legal frameworks.
Virgin Money’s AI Moderation Tool flagged thousands of harmless customer messages as “offensive” because it treated structured trigger words out of context. Again, rigid parsing was the weak point.
Even Amazon’s Alexa, a household name in AI, frequently fails when asked for precise information outside its narrow structured database. Users report bizarre, verbose responses where a simple list or yes/no would do.
Each case echoes the same lesson: chatbots stumble most when tasks shift from language fluidity to structural accuracy.
Claude, like many LLMs, can be nudged into producing structured output—especially using prompt engineering strategies such as tagging, templates, or schema definitions. Anthropic even recommends using XML or JSON tags to guide Claude’s output.
Nevertheless, these tools aren’t foolproof. Even when given clear XML tags, Claude can overfit, misplace tags, or collapse when prompts become large.
The real fix lies in tool integration—letting Claude delegate structured tasks to parsers or databases. For example, Anthropic’s API supports connecting Claude to external tools that can handle structured workflows. More broadly, methods like Retrieval-Augmented Generation (RAG) can ground chat outputs in real data. Yet these augmentations mostly address hallucinations—they don’t guarantee structural fidelity.
Again, the irony: the task Claude failed is something any reasonably tech-savvy human would handle in under a minute. Copy-paste. Write a quick regex. Load it in a browser or use a CLI tool like xmllint. Done.
Language models generate content based on learned patterns. They excel at tone, synthesis, and conversation—but not data accuracy. They’re more “stochastic parrots” than analytic engines.
This is a reminder: LLMs should augment human workflows—not replace fundamental logic tools. For structured tasks, humans (or a proper parser) are still the safest hands.
My Claude experiment became a microcosm of a broader truth: LLMs shine at storytelling, summarizing, translating—but trip on anything that looks like code, table rows, or XML blocks.
Going deeper, these limitations have systemic consequences. In enterprise settings—extracting dates from contracts, parsing invoices, mapping logs—the margin for error is zero. Consistency over creativity matters. Without integration into deterministic systems, generative models fall short.
Solutions do exist: hybrid systems, RAG pipelines, schema-based prompting, API tool hooks, human-in-the-loop checks. But they require intentional architecture—not just expecting a chat at your keyboard to parse with precision.
My humble test wasn’t about disproving Claude’s intelligence. It was about reality-checking what “AI can do.” When the task shifts from plausible prose to rigid extraction, LLMs can reveal their brittle underside.
So, ironically, a human with Notepad—or a tiny XML parser—could have solved the job in minutes. Meanwhile, the state-of-the-art AI assistant got looped, stalled, and ultimately defeated by a task that was never meant to be hard.
And that’s the lesson here: even the brightest AI needs the right job. When the test is structural, humans still reign supreme.