We’ve been on a decade-long cognitive diet. Maps remember the city so we don’t have to. Search remembers the facts so we don’t have to. Now large language models draft our thoughts so we don’t have to. Efficiency feels fantastic—until you notice the muscles you stopped using. A clutch of new studies says the quiet cost of all this “help” isn’t just wrong answers. It’s the part of our minds that likes to wrestle with ambiguity, make judgment calls, and stick with hard problems when the path isn’t obvious. That loss doesn’t show up as a red squiggly line. It shows up as us, a little less curious and a lot more certain.
Start with an inconvenient finding for the cult of quick fixes. A massive multiyear analysis of records from more than 600,000 U.S. undergrads reports that philosophy majors—supposedly the most “impractical” tribe on campus—end up outperforming every other major on verbal and logical-reasoning tests and on measured “habits of mind” like curiosity, open-mindedness, and intellectual rigor. Crucially, after the authors adjusted for who these students already were when they arrived, philosophy still delivered a lift. It’s not just selection bias; it’s training for ambiguity. The core activity—arguing carefully about questions that refuse to resolve—appears to sharpen the very muscles we’re now tempted to outsource to machines.
You’ve probably seen this result filtered through pop-science outlets; the Psychology Today piece you sent summarized it correctly, including its main claims and citations. The headline is punchy, but the evidence is solid.
Meanwhile, Apple’s research group took a microscope to “reasoning” models and found something unnerving. As problem complexity rises, these models improve—until they don’t. Past a sharp threshold, accuracy collapses and the models paradoxically shorten their own chain of thought, even when they have budget left to think more. Humans, stubborn creatures that we are, tend to try harder before we quit; the models save face by giving up elegantly. This isn’t a vibes-based take—it’s a controlled analysis across puzzle families designed to isolate complexity. If your workflow leans on machine “reasoning,” you need to know where the cliff is.
To be fair, the paper sparked rebuttals arguing design artifacts can exaggerate the cliff. But even the counterpoints concede a pattern of brittleness at higher complexity. Translation: use these tools as accelerators on the easy-to-medium stuff, and add friction when the stakes or complexity spike.
If you’ve followed a turn-by-turn GPS for months and then struggled to cross town without it, you already understand cognitive offloading. We’ve known for years that when people expect information to be available externally, they remember where to find it, not the thing itself—the famous Google-effect series. With generative AI, offloading graduates from memory to analysis and voice. A 2025 mixed-methods study across 666 participants links heavier AI tool use to lower critical-thinking scores, with offloading mediating the decline. The relationship isn’t linear; past a point, more “help” just flattens you.
That viral “68.9% laziness” stat you’ve seen floating around? It’s a real figure in a 2023 Humanities & Social Sciences Communications paper that surveyed students in Pakistan and China about AI in education. Useful as a signal, limited as a universal claim. It’s self-report in a specific population; treat it like a directional warning light, not a global speed limit.
There’s a nastier cousin to offloading: automation bias—trusting a system because it looks authoritative. In radiology experiments, readers who saw bad AI suggestions became more likely to see aneurysms that weren’t there. In a randomized clinical trial, physicians trained on LLMs showed measurable susceptibility to bad recommendations, even when use was voluntary and the interface looked responsible. The average picture is nuanced—other randomized trials show LLMs can match or beat physicians on vignette accuracy when they fly solo—but the human-AI combo doesn’t magically exceed human judgment without careful guardrails. The bias is real, sticky, and it scales.
Education researchers saw the same pattern play out in programming labs. Give novices an assistant that writes plausible code, and many will finish faster while learning less about why the code works. In a 2024 study observing students with eye-tracking and think-aloud protocols, stronger students treated GenAI like an accelerator—rejecting unhelpful hints, debugging strategically—while struggling students leaned harder on suggestions, collected metacognitive errors, and walked away with an illusion of competence. It’s not a reason to ban the tools; it’s a reason to design assignments that surface reasoning, not just results.
Can today’s models originate scientific breakthroughs? A 2025 Scientific Reports study put GenAI into a hypothesis-driven discovery task and found what many practitioners feel in their bones: the systems are capable, yes, but mostly incremental. They polish language and explore the nearby neighborhood; they don’t have the “what if?” reflex that pushes people into weird corners and paradigm shifts. The broader creativity literature is more balanced—meta-analyses suggest humans collaborating with GenAI often outperform humans alone—but there’s a catch: idea diversity tends to shrink. Homogenization is the tax you pay for convenience. In brand work, that tax shows up as a suspiciously familiar voice across a dozen rivals.
Zoom out beyond the lab. Human-factors research has warned for years that decision support can sap vigilance; now we’re piping not just facts but judgments into our work streams. Pair that with models that sound confident near their competence cliffs and you get a culture that agrees with smooth, wrong answers at scale. If you’re a newsroom, a hospital, or a media agency, the risk isn’t only factual error; it’s conceptual sameness and atrophied skepticism. The more upstream thinking you delegate—framing the question, picking the angle, choosing the metric—the faster your distinct voice converges on the mean of the training data. At that point you’re not “data-driven.” You’re model-shaped.
The way out isn’t abstinence. It’s deliberate friction. Start projects with a human paragraph that states the problem as you see it and what would count as a good answer. Use the model to interrogate the paragraph, not replace it. When the model proposes a solution, force a round where you argue the opposite position. Treat sources as claims to be traced, not decorations to be glued on. In team settings, make the review meeting about how the conclusion was reached, not just whether the prose sings. In classrooms and codebases, grade the path, not only the output. None of this is Luddism. It’s the practice of owning judgment.
If you need a north star, steal one from the philosophers: measure your success by the habits of mind you strengthen. Curiosity is a habit. Rigor is a habit. Open-mindedness is a habit. They don’t survive on a diet of autocomplete.
AI is a ruthless editor, a patient explainer, and a sometimes-brilliant devil’s advocate. It is not a curiosity engine. It will not love your problem enough to sit with it when the path goes dark. That’s your job. Use the machine to amplify effortful thought, not replace it. Make it your spotter, not your lifter. The reward isn’t just better work; it’s keeping the part of your mind that enjoys the heavy lifting.