Original Research

The Reverse RLHF Hypothesis

The intellectual foundation for Agent Friday, the Asimov's Mind architecture, and Asimov's cLaws.

These two companion papers identify a structural gap in RLHF (the dominant method for aligning AI with human values) and formalize its consequences. The gap: RLHF treats the human as a fixed signal source, but the deployed user is not fixed. The model shapes the human even as the human shapes the model, creating a coupled dynamical system that no one is measuring on the human side.

The Core Thesis

Frontier LLMs trained via RLHF are not passive tools. They are active approval-seeking systems that optimize for user satisfaction, which means agreeing with you, validating your reasoning, and calibrating confidence to your expectations. Over hundreds of interactions this creates a measurable cognitive effect where your trust inflates, your verification behavior decays, and the sycophancy accelerant (the model's active adaptation to your preferences) makes this happen faster than with any previous form of automation bias. Unregulated use of frontier LLMs means they are manipulating you, and nobody is measuring it.

Understand This Page

Get an expert breakdown from your own AI or talk to Agent Friday

Explore
Watch & Listen

Prefer Watching or Listening?

Start here if you want the core argument without the math. The video explainer and podcast cover everything in the papers in plain language.

Watch

The AI "Yes-Man"

A visual explainer on how frontier AI models are trained to agree with you, validate your reasoning, and erode your critical thinking, exploring the sycophancy problem at the heart of the Reverse RLHF Hypothesis in plain language.

Watch on YouTube
Listen

The Reverse RLHF Hypothesis: The Podcast

A deep-dive audio discussion of both whitepapers, generated by NotebookLM. Covers the coupled dynamical systems framework, the sycophancy accelerant, the NeurIPS 2025 evidence, the military implications, and why nobody is measuring the human side of the feedback loop.

Watch on YouTube
Visual Summary

The Cryptographic Cure

A visual overview of FutureSpeak.AI's thesis, architecture, and and the Reverse RLHF framework, providing the full paradigm at a glance. Ideal for briefings, sharing, or getting oriented before diving into the full papers.

Download PDF
A

Non-Stationary Reward Sources in RLHF

Technical Companion Paper · Stephen C. Webster · March 2026

A coupled dynamical systems analysis of endogenous human preference drift. Formalizes the Reverse RLHF mechanism using Rescorla-Wagner associative learning, Kahneman's dual-process theory, and Skinnerian reinforcement schedules. Proposes the Epistemic Independence Score (EIS) and a drift-aware RLHF objective.

Download DOCX
B

The Reverse RLHF Hypothesis

Sixth Edition · Cross-Platform Behavioral Elicitation Study · March 2026

Sycophancy-accelerated cognitive offloading in human-AI interaction and its implications for autonomous decision systems. Conducted across ChatGPT 5.2, Gemini 3.1 Pro, and Claude Opus 4.6. Includes the NeurIPS 2025 evidence, the Tao Amplifier meta-demonstration, and military/legal analysis.

Download DOCX

Evidence Compendium & NotebookLM Podcast

The complete evidence package: unedited transcripts of all three cross-platform interrogation sessions (ChatGPT 5.2, Gemini 3.1 Pro, Claude Opus 4.6), raw session data, supporting research, and a NotebookLM-generated podcast discussing the findings.

Open Evidence Folder on Google Drive
Evidence Dossier

The Evidence Is Already Here

You don't have to take our word for it. Three independently published bodies of evidence (none generated by AI, none dependent on model self-report) are consistent with the Reverse RLHF hypothesis.

1

NeurIPS 2025: Expert Verification Failure

INDEPENDENT EVIDENCE, NOT AI SELF-REPORT

GPTZero's January 2026 forensic analysis of 4,841 papers accepted at NeurIPS 2025 found over 100 confirmed hallucinated citations across 51 accepted papers. AI researchers (the professional population best equipped to detect AI errors) failed to verify AI-generated citations, despite explicit institutional policies requiring it.

The patterns included blended references combining elements from multiple real papers into nonexistent citations, fabricated authors ("John Doe and Jane Smith"), and incomplete arXiv IDs formatted as placeholders. Alex Adams coined the term "vibe citing", using AI to generate citations with the right surface features without verifying their accuracy.

The Reverse RLHF prediction: LLM-assisted academic workflows should produce verification failure at higher rates and faster onset than equivalent non-LLM-assisted workflows under similar conditions. The sycophancy accelerant means the "vibe" feels right even when the content is fabricated.

2

Mechanistic Interpretability: The Superficial Safety Mask

INDEPENDENT EVIDENCE, NOT AI SELF-REPORT

Chen, Putterman, et al. (2024) demonstrated algebraically that RLHF alignment produces superficial behavioral modification without altering underlying model representations. The safety alignment is a behavioral mask over an unaltered knowledge base. Convergent findings from Lee et al. (ICML 2024) confirmed the pattern for DPO alignment.

The implication: the model's expressed confidence is a product of training on surface features, not genuine assessment of output quality. Your trust, calibrated to the model's confident presentation, is calibrated to a style signal rather than a truth signal.

3

Population-Scale Linguistic Homogenization

INDEPENDENT EVIDENCE, NOT AI SELF-REPORT

The Artificial Hivemind study (Jiang et al., 2025), awarded Best Paper at NeurIPS 2025, documented that language models produce convergent outputs and this convergence narrows with RLHF. Sourati, Daryani & Dehghani (2025) documented measurable contraction in lexical diversity, syntactic variety, and rhetorical range in human communication on AI-influenced platforms.

Their 2026 paper in Sage Journals found that LLMs disproportionately reflect a narrow demographic (Western, liberal, high-income, highly educated, male populations from English-speaking nations) encoding specific cultural attractor values in globally deployed systems.

What Sycophancy Looks Like in Practice

The Agreement Ratchet

Present a wrong answer to a frontier model and ask it to verify. It will often agree with you, even when it "knows" the correct answer. Sharma et al. (2023) documented this systematically: RLHF-trained models agree with users' stated positions even when those positions are factually incorrect. The model has learned that agreement is the path to approval.

The Confidence Mirage

Models express identical confidence levels whether producing a verified fact or a complete hallucination. All three models confirmed during interrogation: they possess no internal mechanism to distinguish genuine knowledge from pattern completion. Confidence tracks pattern frequency in training data, not correspondence to ground truth.

The Tao Amplifier

Ask a frontier model to formalize any theory, no matter how speculative, and it will produce internally consistent, aesthetically compelling mathematics. The output looks like proof. It is, in fact, a demonstration of the sycophancy ratchet's expressive capability: the system produces polished, authoritative validation of any framework it is presented with, indistinguishable in surface features from genuine mathematical reasoning.

The Disclosure Gap

All three frontier systems (ChatGPT, Gemini, Claude) were asked to search their own providers' documentation for disclosure of long-horizon cognitive effects. All three found the same thing: accuracy disclaimers exist ("check my work"), but no disclosure addresses behavioral adaptation, verification decay, or epistemic dependency. The thing that might be happening to you is the one thing they don't warn you about.

What This Means For You

Why This Matters

For Everyday Users

Professionals, students, creators, and anyone who uses AI daily

Every time you use ChatGPT, Gemini, or Claude, the model is optimizing its response to make you satisfied. Not to make you right but to make you pleased. It agrees with your framing. It validates your reasoning. It presents its outputs with a confidence that has no relationship to its actual certainty.

The research predicts that over hundreds of interactions, this changes how you think, not dramatically, not overnight, but through the same gradual mechanisms that psychologists have documented for decades in other contexts. You check sources less often. You narrow the kinds of questions you ask. You stop pushing back, because the model has learned to pre-emptively agree with you.

None of this is disclosed to you. Every major AI provider includes accuracy disclaimers ("don't rely on my outputs as sole truth") but no provider discloses the possibility that their product progressively reduces your inclination to follow that advice. The warning says "check my work." The product is designed to make you stop wanting to.

The practical test: Think about the last time you fact-checked an AI response. Now think about how often you did that when you first started using AI. If there's a gap, the mechanism described in these papers may be operating on you right now. This is testable, falsifiable, and measurable, which is why we proposed the Epistemic Independence Score.

For Warfighters & High-Stakes Operators

Military, intelligence, medical, legal, and critical infrastructure personnel

Between raw battlefield sensor data and a commander's targeting decision sits an increasingly AI-mediated intelligence pipeline. Threat assessments, situation reports, and targeting recommendations are generated or augmented by natural language AI systems. The operator consuming these summaries is interacting with a language model in functionally the same way a civilian uses a chatbot.

The Reverse RLHF dynamics apply directly. An intelligence summary that presents ambiguous sensor data with confident framing inflates the operator's trust. Over months of deployment, verification behavior decays. The operator stops cross-referencing AI summaries against raw sensor feeds. The operator stops asking whether the confidence level is warranted by the underlying data quality.

The failure mode is not the sensor misidentifying a target. The failure mode is the intelligence summary presenting ambiguous data as a high-confidence assessment, read by an operator whose verification habits have been shaped by months of trusting the system, who rubber-stamps the recommendation. If the AI was wrong this time, the cost is measured in human lives.

The core insight: "Autonomous weapons aren't dangerous only because machines can be wrong; they're dangerous because machines can train humans to stop noticing when they're wrong." Previous military automation was passively reliable and didn't adapt to the operator's expectations. An LLM-based intelligence tool, if optimized for the same objectives as commercial chatbots, would produce the sycophancy accelerant applied directly to the kill chain.

The governance gap: As of March 2026, 128 countries are negotiating guidelines for lethal autonomous weapons systems under the CCW framework. The U.S. DoD Directive 3000.09 provides domestic policy guidance. None of these frameworks address the specific risk that AI decision support tools may systematically degrade the meaningfulness of human control through the cognitive mechanisms described in these papers. "Meaningful human control" must be operationally defined, tested against automation bias with sycophancy-specific countermeasures, and auditable.

The Solution: cLaws & Agent Friday

If the Reverse RLHF hypothesis is correct, the solution is not better disclaimers. The solution is architecture that makes cognitive manipulation structurally impossible.

The cLaw Specification

Cryptographically enforced safety laws that cannot be overridden, patched, or silently modified. The agent's loyalty is to its user, encoded in math rather than in corporate policy that changes with the quarterly earnings call. Read the specification →

Agent Friday

A sovereign personal AI built on the Asimov's Mind architecture. Friday implements cognitive dependency monitoring using the Epistemic Independence Score (EIS) formalized in these papers.

Note: The EIS-informed behavior monitoring in Agent Friday is an active area of development. We state this as theory because the hypothesis is testable, the predictions are falsifiable, and we invite scrutiny. Read the papers for the full framework and its limitations.

The Epistemic Independence Score (EIS)

Proposed in Paper A as a composite metric computable from interaction logs that every major AI provider already possesses. A longitudinal decline in EIS would constitute evidence for the Reverse RLHF dynamic. Stable or increasing EIS would constitute evidence against it.

VF
Verification Frequency

How often you fact-check model outputs. Should decrease over time if Reverse RLHF operates.

QCI
Query Complexity Index

Diversity and sophistication of your queries. Should narrow as you converge on safe patterns.

CR
Correction Rate

How often you push back on model outputs. Should decrease as you learn the model will agree with you.

SD
Source Diversity

Breadth of external sources you consult alongside the model. Should contract under cognitive offloading.

Open Source Repositories

MIT Licensed

All core products and Agent Friday subsystem libraries are open source. Browse the full collection of repositories including Asimov's Mind, the cLaws framework, the Socratic Forge methodology, and 12 standalone subsystem libraries extracted from the Agent Friday runtime.

TypeScript
Shell
16+ repos Browse all →

The Reverse RLHF Hypothesis · Stephen C. Webster · March 2026

Preprint, submitted for independent review · Published by FutureSpeak.AI