Synthesizing 2024–2026 research on cognitive costs, automation bias, and lessons from aviation, navigation, and beyond
综合 2024–2026 年关于认知代价、自动化偏见以及来自航空、导航等领域教训的研究
The most provocative recent study comes from MIT Media Lab. Kosmyna et al. (2025) used EEG brain monitoring to track 54 participants writing essays across four months in three conditions: ChatGPT-assisted, Google Search-assisted, and brain-only.[2] The results were dramatic. ChatGPT users wrote 60% faster, but their relevant cognitive load fell by 32%, brain connectivity (alpha and theta waves) was nearly halved, and 83% could not recall a passage they had just written. Over time, these users shifted from asking structural questions to simply copy-pasting entire AI-generated essays. The researchers coined the term "cognitive debt"—cumulative neurological and behavioral disengagement that builds with sustained AI reliance. This is a preprint with a small sample, and a formal critique has raised methodological concerns,[3] but the findings align with a broader body of converging evidence.
A peer-reviewed mixed-methods study by Gerlich (2025) in the journal Societies surveyed 666 UK participants and conducted 50 in-depth interviews, finding a strong negative correlation (r = −0.68) between AI tool usage frequency and critical thinking ability, mediated by cognitive offloading.[4] Crucially for middle school educators, younger participants (ages 17–25) showed the highest AI dependence and lowest critical thinking scores, while higher educational attainment served as a protective buffer. A complementary study by Lee et al. (2025) at Carnegie Mellon and Microsoft Research, presented at CHI '25, surveyed 319 knowledge workers and found that higher confidence in AI was associated with less critical thinking, while higher self-confidence was associated with more.[5] The implication is clear: trust in AI substitutes for trust in oneself.
A cross-country experimental study by Bonicalzi et al. (2025) with 150 participants confirmed that unguided AI use fosters cognitive offloading without improving reasoning quality—but structured prompting significantly reduced offloading and enhanced reflective engagement.[6] This points directly toward pedagogical solutions.
最具冲击力的近期研究来自 MIT 媒体实验室。Kosmyna 等人(2025)使用 EEG 脑电监测,追踪了 54 名参与者在四个月内分三种条件写作论文的情况:ChatGPT 辅助、Google 搜索辅助和纯大脑工作。[2] 结果非常惊人。ChatGPT 使用者写作速度提高了 60%,但其相关认知负荷下降了 32%,脑连接性(alpha 和 theta 波)几乎减半,且83% 的人无法回忆起自己刚刚写的段落。随着时间推移,这些使用者从提出结构性问题转变为直接复制粘贴 AI 生成的整篇文章。研究者创造了"认知债务"这一术语——随着持续依赖 AI 而累积的神经和行为脱离。这是一篇预印本,样本量较小,且有正式评论提出了方法论方面的疑虑,[3]但这些发现与更广泛的汇聚证据一致。
Gerlich(2025)在期刊 Societies 上发表的一项经同行评审的混合方法研究,调查了 666 名英国参与者并进行了 50 次深度访谈,发现 AI 工具使用频率与批判性思维能力之间存在显著的负相关(r = −0.68),由认知卸载所中介。[4] 对中学教育者至关重要的是,年轻参与者(17-25 岁)表现出最高的 AI 依赖性和最低的批判性思维得分,而较高的教育程度则起到保护缓冲作用。Lee 等人(2025)在卡内基梅隆大学和微软研究院进行的一项互补研究(在 CHI '25 上发表)调查了 319 名知识工作者,发现对 AI 的信心越高,批判性思维越少,而自我信心越高,批判性思维越多。[5] 其含义很明确:对 AI 的信任替代了对自己的信任。
Bonicalzi 等人(2025)在一项涉及 150 名参与者的跨国实验研究中证实,无引导的 AI 使用会促进认知卸载而不会提高推理质量——但结构化提示显著减少了卸载并增强了反思性参与。[6] 这直接指向了教学解决方案。
The single most rigorous study in this space is Bastani et al. (2025), published in PNAS after originating as a 2024 preprint.[1] This randomized controlled trial involved approximately 1,000 high school math students (grades 9–11) in Turkey, divided into three groups: standard ChatGPT access ("GPT Base"), a teacher-designed safeguarded tutor ("GPT Tutor"), and a no-AI control. During practice sessions, GPT Base students solved 48% more problems correctly, and GPT Tutor students solved 127% more. But when AI was removed for exams, GPT Base students scored 17% worse than students who never had AI access. The GPT Tutor group performed about the same as the control—guardrails prevented the harm but did not produce gains.[7] Students used ChatGPT as a "crutch," getting answers rather than learning concepts. When ChatGPT gave wrong answers (which happened roughly half the time on math problems), students uncritically accepted them.
This crutch effect appears across domains. Anthropic's Shen & Tamkin (2026) ran an RCT with 52 junior software engineers learning a new Python library. AI-assisted developers scored 17% lower on comprehension quizzes—nearly two letter grades worse—with debugging questions showing the largest gap.[8] Developers who delegated code generation to AI scored below 40%; those who used AI for conceptual inquiry scored 65% or higher.[9] A University of Maribor study (Jošt et al., 2024) of 32 students learning React found significant negative correlations between using AI for code generation and final grades—but using AI for explanations showed no negative impact.
In medical practice, Budzyń et al. (2025) published in The Lancet Gastroenterology & Hepatology the first major clinical deskilling study: endoscopists who routinely used AI for colonoscopy polyp detection performed measurably worse when AI access was removed, with adenoma detection rates declining.[10] This finding from 1,443 patients across multiple European centers demonstrates that professional deskilling from AI is not hypothetical—it is already happening.
The creativity domain tells a similar story. Doshi & Hauser (2024) in Science Advances found AI-assisted short stories were individually more creative but significantly more similar to each other, reducing collective novelty.[11] A follow-up study in Technology in Society (2025) identified a "creative scar": creativity dropped when AI was withdrawn, and content homogeneity continued climbing even months later.[12]
该领域最严谨的单项研究是 Bastani 等人(2025)发表在 PNAS 上的论文,其前身是 2024 年的预印本。[1] 这项随机对照试验涉及约 1,000 名高中数学生(9-11 年级,土耳其),分为三组:标准 ChatGPT 访问组("GPT Base")、教师设计的带保护的辅导系统组("GPT Tutor")和无 AI 对照组。在练习环节中,GPT Base 学生正确解题量多出 48%,GPT Tutor 学生多出 127%。但当考试时移除 AI,GPT Base 学生的成绩比从未使用过 AI 的学生低 17%。GPT Tutor 组的表现与对照组大致相同——防护机制阻止了伤害,但并未产生增益。[7] 学生们将 ChatGPT 当作"拐杖"——获取答案而非学习概念。当 ChatGPT 给出错误答案时(在数学题上大约有一半的时间会出错),学生们毫无批判地接受了它们。
这种拐杖效应在各个领域都有出现。Anthropic 的 Shen 和 Tamkin(2026)对 52 名初级软件工程师进行了一项随机对照试验,让他们学习一个新的 Python 库。AI 辅助的开发者在理解测验中得分低 17%——几乎差了两个字母等级——其中调试问题的差距最大。[8] 将代码生成委托给 AI 的开发者得分低于 40%;而将 AI 用于概念性探询的开发者得分在 65% 或以上。[9] 马里博尔大学的一项研究(Jošt 等人,2024)追踪了 32 名学习 React 的学生,发现使用 AI 生成代码与最终成绩之间存在显著的负相关——但使用 AI 获取解释则没有负面影响。
在医学实践中,Budzyń 等人(2025)在《柳叶刀·胃肠病学和肝脏病学》上发表了首个重要的临床技能退化研究:常规使用 AI 进行结肠镜息肉检测的内镜医师,在 AI 访问被移除后表现明显变差,腺瘤检出率下降。[10] 这一来自多个欧洲中心 1,443 名患者的发现表明,AI 导致的专业技能退化不是假设——它已经在发生。
创意领域讲述了类似的故事。Doshi 和 Hauser(2024)在 Science Advances 中发现,AI 辅助的短篇小说单独来看更有创意,但彼此之间显著更加相似,降低了集体新颖性。[11] 一项发表在 Technology in Society(2025)上的后续研究发现了"创意伤疤"现象:当 AI 被撤走后创造力下降,而内容同质化在数月后仍在继续攀升。[12]
The tendency to follow automated recommendations uncritically—automation bias—is one of the best-documented phenomena in human-computer interaction, and it has intensified in the generative AI era. The foundational review by Parasuraman & Manzey (2010) established that automation bias produces two types of errors: omission errors (failing to notice problems the system doesn't flag) and commission errors (following incorrect automated recommendations).[13] Their key finding was sobering: automation bias cannot be reliably prevented by training or instructions alone.
A 2025 systematic review in AI & Society analyzing 35 studies with nearly 20,000 total participants found that trust is the central driver of overreliance, compounded by workload, time pressure, and task complexity.[14] A Dunning-Kruger dynamic is at work: people with limited AI knowledge overestimate their understanding, making them more susceptible. A 2025 cognitive reflection test study demonstrated this vividly—participants given faulty AI answers performed less than half as well as those with no AI support, and critically, AI literacy did not significantly prevent the bias. Warning nudges helped but could not fully restore performance.
Horowitz & Kahn (2024) at Georgetown ran a preregistered experiment across 9,000 adults in nine countries examining automation bias in national security decision-making.[15] They found a curved relationship: those with the least AI experience were slightly averse to algorithms, but automation bias kicked in quickly at moderate knowledge levels before leveling off among true experts. The Georgetown CSET report (2024) showed through case studies—Tesla autopilot, Boeing/Airbus aviation, and military air defense—that automation bias causes "otherwise knowledgeable users to make crucial and even obvious errors."[16]
For middle school students, the concept can be distilled simply: when a computer gives you an answer, your brain's natural tendency is to accept it—even when your own thinking would have caught the mistake. This is not a character flaw; it is a documented cognitive pattern that affects experts and novices alike.
不加批判地遵循自动化建议的倾向——自动化偏见(automation bias)——是人机交互领域中记录最完善的现象之一,并且在生成式 AI 时代有所加剧。Parasuraman 和 Manzey(2010)的奠基性综述确立了自动化偏见会产生两类错误:遗漏错误(未能注意到系统未标记的问题)和执行错误(遵循错误的自动化建议)。[13] 他们的核心发现令人警醒:自动化偏见无法仅通过培训或指令来可靠地预防。
2025 年发表在 AI & Society 上的一项系统综述分析了 35 项研究,涵盖近 20,000 名参与者,发现信任是过度依赖的核心驱动因素,并受工作负荷、时间压力和任务复杂度的叠加影响。[14] 其中存在达克效应(Dunning-Kruger):AI 知识有限的人高估自己的理解力,使他们更容易受到影响。2025 年的一项认知反思测试研究生动地证明了这一点——获得错误 AI 答案的参与者表现不及无 AI 支持者的一半,而关键的是,AI 素养并未显著阻止这种偏见。警告提示有所帮助,但无法完全恢复表现。
乔治城大学的 Horowitz 和 Kahn(2024)在九个国家的 9,000 名成年人中进行了一项预注册实验,研究国家安全决策中的自动化偏见。[15] 他们发现一种曲线关系:AI 经验最少的人对算法略有排斥,但在中等知识水平时自动化偏见迅速出现,到真正的专家群体中才趋于平缓。乔治城 CSET 报告(2024)通过案例研究——特斯拉自动驾驶、波音/空客航空和军事防空——表明自动化偏见导致"本应具备专业知识的用户犯下关键甚至明显的错误"。[16]
对于中学生来说,这个概念可以简单提炼为:当电脑给你一个答案时,你的大脑天然倾向于接受它——即使你自己思考本可以发现其中的错误。这不是性格缺陷;这是一种已被记录在案的认知模式,专家和新手都会受到影响。
The abstract concept of AI deskilling becomes vivid through well-documented historical examples of technology eroding human skills. These analogies are essential for a middle school curriculum because they make the invisible visible.
The most scientifically rigorous analogy comes from neuroscience. Eleanor Maguire's landmark studies at University College London (2000, 2006, 2011) demonstrated that London taxi drivers—who must memorize 25,000 streets over 3–4 years of training called "The Knowledge"—have measurably larger posterior hippocampi than control subjects.[17] A longitudinal study tracked aspiring drivers before and after training: successful trainees showed hippocampal growth; those who failed showed no changes. This was causal proof that intensive navigation practice physically grows the brain.
The flip side is equally documented. Dahmani & Bohbot (2020) at McGill University found that greater GPS use predicted worse spatial memory when navigating without GPS, and a three-year longitudinal component showed GPS use predicted steeper spatial memory decline over time.[18] A 2021 review articulated "a modern paradox: navigation apps allow us to explore more places, while making us worse explorers."[19] The Inuit hunters of the Arctic, whose extraordinary navigation skills were passed down for generations, are losing those abilities to GPS—a vivid cross-cultural example.
Why it works for middle schoolers: Nearly every student uses GPS apps. Ask "Could you navigate to your friend's house without your phone?" and the discussion begins naturally.
On June 1, 2009, Air France Flight 447 crashed into the Atlantic Ocean, killing all 228 people aboard.[20] Ice crystals disabled the airspeed sensors, causing the autopilot to disconnect. The pilots—suddenly required to fly manually at 35,000 feet over the ocean at night—failed to recognize a basic aerodynamic stall. The stall warning sounded 75 times during the 3.5-minute descent. First Officer Bonin held the nose up—exactly the wrong action—throughout.
The aviation industry had a term for this problem before it became fatal. In 1997, American Airlines Captain Warren Vanderburgh coined "Children of the Magenta" to describe pilots so dependent on the magenta-colored autopilot path line on their screens that they had lost the ability to simply fly the airplane.[21] The FAA subsequently issued multiple safety alerts (2013, 2017) warning that "continuous use of autoflight systems could lead to degradation of the pilot's ability to quickly recover the aircraft." A NASA study by Casner et al. (2014) tested 16 airline pilots and found that while physical stick-and-rudder skills remained intact, cognitive skills—navigating, maintaining awareness, diagnosing problems—were significantly degraded.[22] Casner's insight is the key bridge to AI: "We might be less concerned about things pilots do by hand and more concerned about those things they do by mind."
Why it works for middle schoolers: The dramatic narrative is gripping, and the phrase "Children of the Magenta" is memorable. The idea that a pilot tried to use a computer to avoid a crash instead of just turning the controls makes the concept visceral.
The calculator debate provides useful nuance. Meta-analyses by Hembree & Dessart (1986) and Ellington (2003) found that calculator use with good instruction generally does not hinder math skill development—but a PMC study documented college students accepting obviously wrong calculator answers without questioning them, demonstrating the trust dynamic that matters.[23] The issue is not the tool itself but the uncritical relationship to it.
The "Google Effect" study by Sparrow, Liu & Wegner (2011) in Science demonstrated that people don't remember information they believe they can look up later—instead remembering where to find it.[24] The brain treats the internet as a "transactive memory partner." Ask any student: "How many phone numbers do you know by heart?" The answer illustrates the effect instantly.
Spell-checker dependency and handwriting decline provide additional relatable entry points. Research shows prolonged reliance on spell-checkers weakens long-term spelling proficiency, and Mueller & Oppenheimer (2014) found students who take notes by hand outperform laptop typists on conceptual questions because handwriting forces deeper processing.
AI 技能退化这一抽象概念,通过有据可查的技术侵蚀人类技能的历史案例变得生动具体。这些类比对中学课程至关重要,因为它们让不可见的东西变得可见。
最具科学严谨性的类比来自神经科学。Eleanor Maguire 在伦敦大学学院的里程碑式研究(2000、2006、2011)表明,伦敦出租车司机——他们必须在 3-4 年的培训(称为"知识大全 The Knowledge")中记住 25,000 条街道——拥有比对照组明显更大的后海马体。[17] 一项纵向研究追踪了受训前后的准司机:成功的受训者显示出海马体增长;失败者则没有变化。这是密集导航练习能从物理层面促进大脑生长的因果证据。
反面同样有充分记录。麦吉尔大学的 Dahmani 和 Bohbot(2020)发现,GPS 使用量越大,在不使用 GPS 导航时的空间记忆越差,而三年纵向追踪部分显示 GPS 使用预测了更陡峭的空间记忆随时间下降。[18] 2021 年的一篇综述指出"一个现代悖论:导航应用让我们能探索更多地方,同时却让我们成为更差的探索者。"[19] 北极的因纽特猎人代代相传的非凡导航技能正因 GPS 而逐渐丧失——这是一个生动的跨文化案例。
为什么对中学生有效:几乎每个学生都使用 GPS 应用。问一句"不用手机你能找到去朋友家的路吗?"讨论自然就开始了。
2009 年 6 月 1 日,法航 447 航班坠入大西洋,机上 228 人全部遇难。[20] 冰晶使空速传感器失效,导致自动驾驶断开。飞行员们——突然需要在夜间 35,000 英尺高空的大洋上手动飞行——未能识别出基本的气动失速。失速警告在 3.5 分钟的下降过程中响了 75 次。副驾驶 Bonin 始终保持抬头——这恰恰是错误的操作。
航空业在这一问题变得致命之前就已有了专门术语。1997 年,美国航空公司机长 Warren Vanderburgh 创造了"品红线的孩子们(Children of the Magenta)"一词,用来形容那些过度依赖屏幕上品红色自动驾驶航径线、以至于丧失了简单驾驶飞机能力的飞行员。[21] 美国联邦航空局(FAA)随后发布了多项安全警报(2013、2017),警告"持续使用自动飞行系统可能导致飞行员快速恢复飞机操控能力的退化。" NASA 研究者 Casner 等人(2014)测试了 16 名航线飞行员,发现虽然手动操纵杆技能保持完好,但认知技能——导航、保持态势感知、诊断问题——显著退化。[22] Casner 的洞察是通向 AI 的关键桥梁:"我们或许不必那么担心飞行员用手做的事情,而应更多关注他们用脑做的事情。"
为什么对中学生有效:这个戏剧性的叙事扣人心弦,"品红线的孩子们"这个说法令人难忘。飞行员在紧急情况下试图依赖电脑而非直接操控——这让概念变得直观可感。
计算器的争论提供了有用的细微差别。Hembree 和 Dessart(1986)以及 Ellington(2003)的元分析发现,在良好教学指导下使用计算器通常不会阻碍数学技能发展——但一项 PMC 研究记录了大学生不加质疑地接受明显错误的计算器答案,展示了真正重要的信任动态。[23] 问题不在于工具本身,而在于对工具的不加批判的关系。
Sparrow、Liu 和 Wegner(2011)在 Science 上发表的"谷歌效应"研究表明,人们不会记住他们认为以后可以查到的信息——而是记住在哪里找到它。[24] 大脑将互联网视为"交易性记忆伙伴"。问任何一个学生:"你能背出几个电话号码?"答案即刻说明了这个效应。
拼写检查器依赖和手写能力下降提供了更多可共鸣的切入点。研究表明,长期依赖拼写检查器会削弱长期拼写能力,而 Mueller 和 Oppenheimer(2014)发现,手写笔记的学生在概念性问题上表现优于使用笔记本电脑打字的学生,因为手写迫使更深层次的处理。
Understanding the deskilling research requires recognizing that AI capabilities have transformed since most studies began. GPT-5 (2025) shows 80% fewer factual errors than GPT-4. Claude Opus 4.5 achieves 92% success on complex multi-file coding tasks. Gemini 3 maintains coherence through 10–15 step reasoning chains. METR research concludes the length of tasks AI can perform is doubling every seven months—2024's best models handled 30-minute human tasks; late 2025 models handle multi-hour work.
Adoption has surged correspondingly. 54.6% of U.S. adults ages 18–64 have used generative AI (St. Louis Fed, August 2025), up from 44.6% a year earlier—outpacing PC and internet adoption at equivalent points after launch.[25] Three-quarters of students aged 16+ in OECD countries report using generative AI tools.[26] The AI agents market reached $7.6 billion in 2025 and is projected to hit $47 billion by 2030.
This acceleration matters because earlier studies examining limited AI tools may understate current risks. When AI could only generate rough text or simple code snippets, the human remained essential for integration and refinement. Now that AI can produce complete applications, polished essays, and professional-grade analyses end-to-end, the scope of potential cognitive offloading—and therefore potential deskilling—is far wider than even 2023 research anticipated.
Policy frameworks are scrambling to keep pace. UNESCO released AI competency frameworks for both students and teachers in September 2024[27]—but found only 11 countries have developed K-12 AI curricula. The OECD-European Commission AILit Framework (May 2025 draft) defines AI literacy standards across four domains: Engage with AI, Create with AI, Manage AI, and Design AI.[28] Multiple U.S. states have developed K-12 AI policies, and most emphasize that AI literacy must include ethics, critical thinking, and social responsibility—not just technical proficiency.[29]
理解技能退化研究需要认识到,自大多数研究开始以来,AI 的能力已经发生了根本性变化。GPT-5(2025)的事实性错误比 GPT-4 减少了 80%。Claude Opus 4.5 在复杂的多文件编程任务上达到 92% 的成功率。Gemini 3 能在 10-15 步推理链中保持连贯性。METR 研究得出结论:AI 能执行的任务时长每七个月翻一番——2024 年最好的模型能处理 30 分钟的人类任务;2025 年底的模型能处理多小时的工作。
使用率也随之激增。54.6% 的美国 18-64 岁成年人使用过生成式 AI(圣路易斯联储,2025 年 8 月),比一年前的 44.6% 有所上升——在发布后相同时间节点上超过了 PC 和互联网的采用速度。[25] OECD 国家 16 岁以上的学生中有四分之三报告使用过生成式 AI 工具。[26] AI 智能体市场在 2025 年达到 76 亿美元,预计到 2030 年将达到 470 亿美元。
这种加速很重要,因为早期研究考察的是能力有限的 AI 工具,可能低估了当前的风险。当 AI 只能生成粗略的文本或简单的代码片段时,人类在整合和完善方面仍然不可或缺。现在 AI 可以端到端地生成完整的应用程序、精美的文章和专业级分析,潜在认知卸载的范围——因此也是潜在技能退化的范围——远比 2023 年的研究所预期的要广泛。
政策框架正在努力跟上步伐。联合国教科文组织(UNESCO)于 2024 年 9 月发布了面向学生和教师的 AI 能力框架[27]——但发现只有 11 个国家制定了 K-12 AI 课程。OECD-欧盟委员会 AILit 框架(2025 年 5 月草案)在四个领域定义了 AI 素养标准:参与 AI、用 AI 创造、管理 AI 和设计 AI。[28] 美国多个州已制定 K-12 AI 政策,大多数强调 AI 素养必须包括伦理、批判性思维和社会责任——而不仅仅是技术熟练度。[29]
The deskilling narrative, while well-supported, is only half the story. The Harvard-BCG study (Dell'Acqua et al., 2023/2025) with 758 consultants found AI users completed 12.2% more tasks, worked 25.1% faster, and produced outputs rated 40% higher in quality.[30] Critically, AI acted as a "skill leveler": the lowest-performing consultants improved by 43%. A 2025 RCT published in Nature Scientific Reports (Kestin & Miller) found an AI tutor designed around pedagogical best practices—active learning, scaffolding, growth mindset promotion—outperformed in-class active learning for physics students.[31] A systematic review of AI tutoring in K-12 education analyzing 28 studies with 4,597 students found effects were generally positive when systems were well-aligned with instructional theory.[32]
MIT Sloan's EPOCH framework (2025) identified five uniquely human capabilities—Empathy, Persuasion, Originality, Collaboration, and Hope/Vision—and found that human-intensive tasks actually increased in the O*NET job database between 2016 and 2024. Newly added tasks in 2024 had higher EPOCH scores than tasks that disappeared, suggesting the workforce is evolving toward more human skills.
Historical context provides further balance. Every major technology—writing, the printing press, calculators, the internet—provoked fears of cognitive decline. Socrates warned that writing would weaken memory. Calculators were predicted to destroy mental arithmetic. In each case, some skills were genuinely lost while new ones emerged. However, a 2025 Technology in Society paper argues the calculator-LLM analogy is ultimately insufficient: unlike calculators, which automate computation, LLMs redistribute cognitive labor across all stages of reasoning—making the current situation qualitatively different.[33]
The emerging consensus points toward a design challenge rather than a binary choice. Research on "hybrid intelligence" and AI-integrated scaffolding shows that AI tools designed to keep humans thinking—through structured prompting, metacognitive prompts, and graduated independence—can preserve autonomy while delivering benefits.[34] The key principle: AI should amplify thinking, not replace it.
技能退化的叙事虽然有充分支持,但只是故事的一半。哈佛-BCG 研究(Dell'Acqua 等人,2023/2025)对 758 名咨询师的研究发现,AI 使用者多完成了 12.2% 的任务,速度快了 25.1%,产出质量评分高出 40%。[30] 关键的是,AI 起到了"技能均衡器"的作用:表现最差的咨询师提升了 43%。2025 年发表在 Nature Scientific Reports 上的一项随机对照试验(Kestin 和 Miller)发现,一个围绕教学最佳实践设计的 AI 辅导系统——主动学习、脚手架式教学、成长型思维培养——对物理学生的效果优于课堂主动学习。[31] 一项针对 K-12 教育中 AI 辅导系统的系统综述,分析了 28 项研究中的 4,597 名学生,发现当系统与教学理论良好对齐时,效果总体上是积极的。[32]
MIT 斯隆管理学院的 EPOCH 框架(2025)识别了五种独特的人类能力——共情(Empathy)、说服力(Persuasion)、原创性(Originality)、协作(Collaboration)和愿景/希望(Hope/Vision)——并发现在 O*NET 职业数据库中,人类密集型任务在 2016 年至 2024 年间实际上是增加的。2024 年新增的任务比消失的任务有更高的 EPOCH 得分,表明劳动力正朝着更多人类技能的方向演进。
历史背景提供了进一步的平衡视角。每一项重大技术——文字、印刷术、计算器、互联网——都引发过对认知退化的恐惧。苏格拉底警告说文字会削弱记忆。计算器被预言会摧毁心算能力。在每种情况下,确实有一些技能丧失了,同时新技能也出现了。然而,2025 年发表在 Technology in Society 上的一篇论文认为,计算器-LLM 的类比最终是不充分的:与自动化计算的计算器不同,LLM 将认知劳动重新分配到推理的所有阶段——使得当前情况在本质上不同。[33]
新兴的共识指向的是设计挑战而非二元选择。关于"混合智能"和 AI 整合脚手架的研究表明,旨在让人类持续思考的 AI 工具——通过结构化提示、元认知提示和渐进式独立——可以在提供益处的同时保持自主性。[34] 核心原则:AI 应该放大思考,而不是取代思考。
The research converges on several insights that resist simple narratives. First, the crutch effect is real and well-documented—unrestricted AI access consistently produces better short-term performance but worse long-term learning across math, writing, coding, and medicine. Second, automation bias is a cognitive universal, not a personal failing—warning students about it is more productive than hoping they'll resist it through willpower. Third, the classic analogies work because they reveal a deeper pattern articulated by NASA researcher Stephen Casner: technology erodes not just what we do by hand but what we do by mind. The pilot's physical flying skills survived automation; their judgment did not.
The most actionable finding for curriculum design comes from the consistent gap between guided and unguided AI use. Bastani's GPT Tutor eliminated the learning penalty.[1] Anthropic's study showed developers who asked conceptual questions of AI scored 25+ percentage points higher than those who delegated coding.[8] Jošt's students who used AI for explanations learned normally; those who used it for code generation did not. The pedagogical directive is clear: teach students to use AI as a thinking partner, not an answer machine. The research suggests specific practices—always attempt a task independently first, use AI to check and challenge your own thinking rather than generate from scratch, and periodically work without AI to maintain baseline skills. These are not abstract principles but empirically supported strategies for an age when three-quarters of students already use generative AI and the tools grow more capable every seven months.
研究汇聚于几个抵抗简单叙事的洞见。第一,拐杖效应是真实且有充分记录的——无限制的 AI 访问在数学、写作、编程和医学领域一致性地产生了更好的短期表现但更差的长期学习效果。第二,自动化偏见是一种认知普遍现象,而非个人缺陷——向学生发出警告比指望他们靠意志力抵抗更有成效。第三,经典类比之所以有效,是因为它们揭示了 NASA 研究者 Stephen Casner 所阐述的更深层模式:技术侵蚀的不仅是我们用手做的事,还有我们用脑做的事。飞行员的手动操纵技能在自动化中存活了下来;他们的判断力却没有。
对课程设计最具操作性的发现来自有引导与无引导 AI 使用之间的一致性差距。Bastani 的 GPT Tutor 消除了学习惩罚。[1] Anthropic 的研究显示,向 AI 提出概念性问题的开发者比将编码委托给 AI 的开发者得分高出 25 个百分点以上。[8] Jošt 的研究中,使用 AI 获取解释的学生学习正常;使用 AI 生成代码的学生则不然。教学指令很明确:教学生将 AI 作为思考伙伴,而非答案机器。研究建议了具体的做法——首先独立尝试任务,使用 AI 来检查和挑战自己的思考而非从零生成,并定期在没有 AI 的情况下工作以维持基本技能。这些不是抽象原则,而是有实证支持的策略——在四分之三的学生已经使用生成式 AI、工具每七个月就更加强大的时代。