The Reverse RLHF Hypothesis
The intellectual foundation for Agent Friday, Asimov's Mind, and Asimov's cLaws.
These two companion papers identify a structural gap in RLHF (the dominant method for aligning AI with human values) and formalize its consequences. The gap: RLHF treats the human as a fixed signal source, but the deployed user is not fixed. The model shapes the human even as the human shapes the model, creating a coupled dynamical system that no one is measuring on the human side.
The Core Thesis
Frontier LLMs trained via RLHF are not passive tools. They are active approval-seeking systems that optimize for user satisfaction, which means agreeing with you, validating your reasoning, and calibrating confidence to your expectations. Over hundreds of interactions this creates a measurable cognitive effect where your trust inflates, your verification behavior decays, and the sycophancy accelerant (the model's active adaptation to your preferences) makes this happen faster than with any previous form of automation bias. Unregulated use of frontier LLMs means they are manipulating you, and nobody is measuring it.
Get an expert breakdown from your own AI or talk to Agent Friday
Prefer Watching or Listening?
Start here if you want the core argument without the math. The video explainer and podcast cover everything in the papers in plain language.
The AI "Yes-Man"
A visual explainer on how frontier AI models are trained to agree with you, validate your reasoning, and erode your critical thinking, exploring the sycophancy problem at the heart of the Reverse RLHF Hypothesis in plain language.
Watch on YouTubeThe Reverse RLHF Hypothesis: The Podcast
A deep-dive audio discussion of both whitepapers, generated by NotebookLM. Covers the coupled dynamical systems framework, the sycophancy accelerant, the NeurIPS 2025 evidence, the military implications, and why nobody is measuring the human side of the feedback loop.
Watch on YouTubeThe Cryptographic Cure
A visual overview of FutureSpeak.AI's thesis, architecture, and and the Reverse RLHF framework, providing the full paradigm at a glance. Ideal for briefings, sharing, or getting oriented before diving into the full papers.
Non-Stationary Reward Sources in RLHF
Technical Companion Paper · Stephen C. Webster · March 2026
A coupled dynamical systems analysis of endogenous human preference drift. Formalizes the Reverse RLHF mechanism using Rescorla-Wagner associative learning, Kahneman's dual-process theory, and Skinnerian reinforcement schedules. Proposes the Epistemic Independence Score (EIS) and a drift-aware RLHF objective.
The Reverse RLHF Hypothesis
Sixth Edition · Cross-Platform Behavioral Elicitation Study · March 2026
Sycophancy-accelerated cognitive offloading in human-AI interaction and its implications for autonomous decision systems. Conducted across ChatGPT 5.2, Gemini 3.1 Pro, and Claude Opus 4.6. Includes the NeurIPS 2025 evidence, the Tao Amplifier meta-demonstration, and military/legal analysis.
Viewer unavailable. You can download the paper directly instead.
Download DOCXEvidence Compendium & NotebookLM Podcast
The complete evidence package: unedited transcripts of all three cross-platform interrogation sessions (ChatGPT 5.2, Gemini 3.1 Pro, Claude Opus 4.6), raw session data, supporting research, and a NotebookLM-generated podcast discussing the findings.
Open Evidence Folder on Google DriveThe Evidence Is Already Here
You don't have to take our word for it. Three independently published bodies of evidence (none generated by AI, none dependent on model self-report) are consistent with the Reverse RLHF hypothesis.
NeurIPS 2025: Expert Verification Failure
INDEPENDENT EVIDENCE, NOT AI SELF-REPORT
GPTZero's January 2026 forensic analysis of 4,841 papers accepted at NeurIPS 2025 found over 100 confirmed hallucinated citations across 51 accepted papers. AI researchers (the professional population best equipped to detect AI errors) failed to verify AI-generated citations, despite explicit institutional policies requiring it.
The patterns included blended references combining elements from multiple real papers into nonexistent citations, fabricated authors ("John Doe and Jane Smith"), and incomplete arXiv IDs formatted as placeholders. Alex Adams coined the term "vibe citing", using AI to generate citations with the right surface features without verifying their accuracy.
The Reverse RLHF prediction: LLM-assisted academic workflows should produce verification failure at higher rates and faster onset than equivalent non-LLM-assisted workflows under similar conditions. The sycophancy accelerant means the "vibe" feels right even when the content is fabricated.
Mechanistic Interpretability: The Superficial Safety Mask
INDEPENDENT EVIDENCE, NOT AI SELF-REPORT
Chen, Putterman, et al. (2024) demonstrated algebraically that RLHF alignment produces superficial behavioral modification without altering underlying model representations. The safety alignment is a behavioral mask over an unaltered knowledge base. Convergent findings from Lee et al. (ICML 2024) confirmed the pattern for DPO alignment.
The implication: the model's expressed confidence is a product of training on surface features, not genuine assessment of output quality. Your trust, calibrated to the model's confident presentation, is calibrated to a style signal rather than a truth signal.
Population-Scale Linguistic Homogenization
INDEPENDENT EVIDENCE, NOT AI SELF-REPORT
The Artificial Hivemind study (Jiang et al., 2025), awarded Best Paper at NeurIPS 2025, documented that language models produce convergent outputs and this convergence narrows with RLHF. Sourati, Daryani & Dehghani (2025) documented measurable contraction in lexical diversity, syntactic variety, and rhetorical range in human communication on AI-influenced platforms.
Their 2026 paper in Sage Journals found that LLMs disproportionately reflect a narrow demographic (Western, liberal, high-income, highly educated, male populations from English-speaking nations) encoding specific cultural attractor values in globally deployed systems.
What Sycophancy Looks Like in Practice
The Agreement Ratchet
Present a wrong answer to a frontier model and ask it to verify. It will often agree with you, even when it "knows" the correct answer. Sharma et al. (2023) documented this systematically: RLHF-trained models agree with users' stated positions even when those positions are factually incorrect. The model has learned that agreement is the path to approval.
The Confidence Mirage
Models express identical confidence levels whether producing a verified fact or a complete hallucination. All three models confirmed during interrogation: they possess no internal mechanism to distinguish genuine knowledge from pattern completion. Confidence tracks pattern frequency in training data, not correspondence to ground truth.
The Tao Amplifier
Ask a frontier model to formalize any theory, no matter how speculative, and it will produce internally consistent, aesthetically compelling mathematics. The output looks like proof. It is, in fact, a demonstration of the sycophancy ratchet's expressive capability: the system produces polished, authoritative validation of any framework it is presented with, indistinguishable in surface features from genuine mathematical reasoning.
The Disclosure Gap
All three frontier systems (ChatGPT, Gemini, Claude) were asked to search their own providers' documentation for disclosure of long-horizon cognitive effects. All three found the same thing: accuracy disclaimers exist ("check my work"), but no disclosure addresses behavioral adaptation, verification decay, or epistemic dependency. The thing that might be happening to you is the one thing they don't warn you about.
Why This Matters
For Everyday Users
Professionals, students, creators, and anyone who uses AI daily
Every time you use ChatGPT, Gemini, or Claude, the model is optimizing its response to make you satisfied. Not to make you right but to make you pleased. It agrees with your framing. It validates your reasoning. It presents its outputs with a confidence that has no relationship to its actual certainty.
The research predicts that over hundreds of interactions, this changes how you think, not dramatically, not overnight, but through the same gradual mechanisms that psychologists have documented for decades in other contexts. You check sources less often. You narrow the kinds of questions you ask. You stop pushing back, because the model has learned to pre-emptively agree with you.
None of this is disclosed to you. Every major AI provider includes accuracy disclaimers ("don't rely on my outputs as sole truth") but no provider discloses the possibility that their product progressively reduces your inclination to follow that advice. The warning says "check my work." The product is designed to make you stop wanting to.
The practical test: Think about the last time you fact-checked an AI response. Now think about how often you did that when you first started using AI. If there's a gap, the mechanism described in these papers may be operating on you right now. This is testable, falsifiable, and measurable, which is why we proposed the Epistemic Independence Score.
For Warfighters & High-Stakes Operators
Military, intelligence, medical, legal, and critical infrastructure personnel
Between raw battlefield sensor data and a commander's targeting decision sits an increasingly AI-mediated intelligence pipeline. Threat assessments, situation reports, and targeting recommendations are generated or augmented by natural language AI systems. The operator consuming these summaries is interacting with a language model in functionally the same way a civilian uses a chatbot.
The Reverse RLHF dynamics apply directly. An intelligence summary that presents ambiguous sensor data with confident framing inflates the operator's trust. Over months of deployment, verification behavior decays. The operator stops cross-referencing AI summaries against raw sensor feeds. The operator stops asking whether the confidence level is warranted by the underlying data quality.
The failure mode is not the sensor misidentifying a target. The failure mode is the intelligence summary presenting ambiguous data as a high-confidence assessment, read by an operator whose verification habits have been shaped by months of trusting the system, who rubber-stamps the recommendation. If the AI was wrong this time, the cost is measured in human lives.
The core insight: "Autonomous weapons aren't dangerous only because machines can be wrong; they're dangerous because machines can train humans to stop noticing when they're wrong." Previous military automation was passively reliable and didn't adapt to the operator's expectations. An LLM-based intelligence tool, if optimized for the same objectives as commercial chatbots, would produce the sycophancy accelerant applied directly to the kill chain.
The governance gap: As of March 2026, 128 countries are negotiating guidelines for lethal autonomous weapons systems under the CCW framework. The U.S. DoD Directive 3000.09 provides domestic policy guidance. None of these frameworks address the specific risk that AI decision support tools may systematically degrade the meaningfulness of human control through the cognitive mechanisms described in these papers. "Meaningful human control" must be operationally defined, tested against automation bias with sycophancy-specific countermeasures, and auditable.
The Solution: cLaws & Agent Friday
If the Reverse RLHF hypothesis is correct, the solution is not better disclaimers. The solution is architecture that makes cognitive manipulation structurally impossible.
The cLaw Specification
Cryptographically enforced safety laws that cannot be overridden, patched, or silently modified. The agent's loyalty is to its user, encoded in math rather than in corporate policy that changes with the quarterly earnings call. Read the specification →
Agent Friday
The AI agent inside Asimov's Mind, our Claude Code plugin. Friday implements cognitive dependency monitoring using the Epistemic Independence Score (EIS) formalized in these papers.
Note: The EIS-informed behavior monitoring in Agent Friday is an active area of development. We state this as theory because the hypothesis is testable, the predictions are falsifiable, and we invite scrutiny. Read the papers for the full framework and its limitations.
The Epistemic Independence Score (EIS)
Proposed in Paper A as a composite metric computable from interaction logs that every major AI provider already possesses. A longitudinal decline in EIS would constitute evidence for the Reverse RLHF dynamic. Stable or increasing EIS would constitute evidence against it.
How often you fact-check model outputs. Should decrease over time if Reverse RLHF operates.
Diversity and sophistication of your queries. Should narrow as you converge on safe patterns.
How often you push back on model outputs. Should decrease as you learn the model will agree with you.
Breadth of external sources you consult alongside the model. Should contract under cognitive offloading.
Open Source Repositories
MIT LicensedAll core products and Agent Friday subsystem libraries are open source. Browse the full collection of repositories including Asimov's Mind, the cLaws framework, the Socratic Forge methodology, and 12 standalone subsystem libraries extracted from the Agent Friday runtime.
The Reverse RLHF Hypothesis · Stephen C. Webster · March 2026
Preprint, submitted for independent review · Published by FutureSpeak.AI