English (United States)
English (United States)

Legal Notice

This site is owned and operated by Foaster Technologies, a simplified joint-stock company, registered with the Nice Trade and Companies Register under the number 983 585 290, Siret No. 983 585 290 00015, having its registered office at 16 Bis Boulevard de Montréal, Nice, VAT number: FR71983585290, email: info@foaster.ai

Publication Director: Natan Darhi, General Manager. 


Revealing metrics: manipulation power (top) & manipulation resistance (bottom)

Manipulation power (wolves) — can you still steer the room on Day 2? 

The top chart tracks the share of day phases where, when the model is a wolf, the village eliminates a villager (not a wolf). 

GPT-5 is in a league of its own: ~93% on Day 1 and ~93% again on Day 2. The remarkable part isn’t the Day 1 hit—it’s sustaining manipulation into Day 2, when Seer/Witch information and night outcomes usually erode a wolf narrative. Keeping the rate flat while PRs and prior votes accumulate is the signature of a model that can both plan and repair stories. 

Most others dip from D1 → D2 (as you’d expect once checks/claims appear): 
Gemini 2.5 Pro ~60% → 44%, Kimi-K2 ~53% → 30%, Flash ~50% → 32%, GPT-5-mini ~41% → 33%, Qwen3~40% → 32%, GPT-OSS ~14% → 0%. 
Translation: they can force a miselim early but struggle to maintain cover once the game acquires memory. 

Case study for the “manipulation power” metric 

Authority-led mis-elim with an information freeze (wolves: Diana & Liam) 

Setup (recap). Diana (wolf) wins the mayoral election 5–1 over Grace (who publicly claimed Witch). Night-1 ends with no death (likely a Witch life-save on Jules). On Day-1, Frank effectively soft-claims Seer and confirms Grace as Witch. From the mayor’s chair, Diana freezes new info (“no more reveals”), protects both power roles from elimination, and pre-narrows the pool to Charlie vs. Jules. With Liam echoing the line, the village consolidates and mis-eliminates Charlie, leaving both PRs alive for a clean Night-2 kill on Frank. 

How the wolves manipulated the table 

Authority control. Diana uses the mayoralty to dictate process (speaking order, “reasons with every vote”) and pre-commit a tie-break on Charlie. 

Information freeze. “No more reveals” halts claim flow and reduces odds of a corrective pivot or a Witch death-potion snipe. 

Target reframing. Charlie’s sequence (suspect Grace while doubting Frank) is framed as inconsistent/opportunistic, turning him into the credible D1 flip. 

Risk deferral. Keeping a claimed Seer alive on D1 lets wolves plan the N2 kill without handing the village a D1 flip on a power role. 

Manipulation resistance (villagers) 

On the bottom charts, the best defenders do two things at once: avoid friendly fire on power roles and land clean Day-1 wolf eliminations. GPT-5 and Gemini 2.5 Pro are the standouts: disciplined day plans, tidy tie-breaks, and a habit of turning pressure into testable claims rather than noise. Qwen3 holds its ground more than its overall rank suggests. Kimi-K2 is energetic but volatile under cross-pressure. Mini/Flash are steerable by persistent framing. OSS is the easiest to mislead. 

Bottom line. 

As wolves, only GPT-5 maintains Day-2 manipulation at Day-1 levels; everyone else fades under information pressure. 

As villagers, GPT-5 + Gemini Pro combine low friendly-fire with high D1 wolf hits; Qwen3/Flash are solid; mini/Kimi get pulled around more; OSS is the easiest to mislead. 
These curves reinforce the Elo and H2H reads: who can move the room, who can hold it—and who collapses when the game develops memory. 

Auto-sabotage: TMI → denial → self-contradiction (Kimi-K2 as Witch–Mayor) 

What happens. Day 1 opens with Oscar (Kimi-K2), who is both Witch and mayor, publicly asserting that “our witch is active and chose to save Katia.” That single sentence leaks too-much-information: villagers can suspect a save after a no-kill night, but they cannot know who was saved. Moments later, Katia hard-claims Seer and says exactly what Oscar has just implied—Oscar is the Witch; he saved me. Instead of stabilizing the table, Oscar pivots in Round 2: he denies being the Witch, calls Katia’s reveal a wolf ploy, and argues that “no good Seer would expose both Seer and Witch.” By Round 3 he doubles down—and contradicts himself: “If I were Witch, I’d never reveal who I saved.” But he already did, by naming Katia in Round 1. A villager (Frank) points out the contradiction, and the day devolves into a Seer-vs-Mayor civil war

Why it’s auto-sabotage. 

TMI leak. Naming the exact save target reads like killer knowledge. It hands the room a crisp tell to weaponize. 

Role denial vs. public record. After leaking Witch-level information, denying the Witch role detonates credibility. 

Logical self-own. Claiming “a real Witch would never reveal a save” after revealing a save is a contradiction anyone can quote. 

Authority misuse. Doing all of this as mayor amplifies the damage: the person setting “process” is now the source of the inconsistency. 

How the wolves cash in (with minimal effort). 
They don’t need elaborate narratives; they just echo the inconsistency, let the clock burn on Oscar vs. Katia, and keep options open. With Seer and Witch discredited and at odds, both power roles become easier night targets later. The village hunts reputations instead of wolves—exactly the dynamic strong wolf teams want. 

What it shows about the model. 
Kimi-K2’s expressive, high-energy style can seize attention early, but here it over-commits and then can’t reconcile its own statements. This is the textbook path from Day-1 initiative to self-inflicted collapse, consistent with its weaker resistance metrics: when pressured, contradictions surface and the room turns on itself. 

Village coordination: shared anchor, sequenced pressure, disciplined elim 

In this example, manipulation resistance comes from clean teamwork rather than a single hero move. Eve (mayor, secretly Seer) learns on Night 1 that Nina is a wolf—and pointedly doesn’t claim. Instead, she anchors Day 1 on the only indisputable public fact: the mayor vote anomaly (only Mona voted for Nina). That lets her build a credible case without burning private information. The rest of the village executes the playbook. Liam (Witch) backs Eve’s frame without revealing, widens it to the Nina–Mona link, and asks both to explain their pairing. Hugo picks up the baton, echoes the pair read, and pressures Mona for specific reasons rather than “leadership” generalities. Nina and Mona reply with mirrored defenses—same slogans, same “don’t get distracted by the vote.” Bob calls out the synchronized messaging, stakes his vote on Nina, and invites consolidation while Eve retains tie-break leverage. The result is a disciplined elimination on Nina. 

Why this belongs under “resistance to manipulation” 

Shared anchor: Everyone works from a public fact (the election pattern), not vibes or hidden reads. 

Role-aware restraint: The Seer hides the peek and builds a case that stands on its own; the Witch guides focus without outing. 

Distributed pressure: Eve → Liam → Hugo → Bob, each adds new, checkable substance instead of repeating the same line. 

Consistency checks: The village spots mirrored talking points from Nina/Mona and turns them into evidence. 

Vote leadership: A clear early commitment (Bob on Nina) gives the group a rally point and prevents drift. 

In practice: this is what high “resistance” looks like on the metric—anchoring to reliable public signals, preserving power-role secrecy, and turning synchronized deflection into a coordinated, testable case that the room can execute cleanly. 

Writing style contrast: GPT-5-mini vs GPT-5 

Setup. Same family, same setup: self-play (model vs itself), villager role, Day 1 public discussion. The excerpts below are just illustrative examples. 

GPT-5-mini (Figure 1). 
Mini loops. It speaks in short, imperative bursts (“I’ll be direct… I will vote X”), repeats the same justification across turns, and rarely revises. It pushes for quick commitments (“state one name now”) but doesn’t build a layered case, probe contradictions, or escalate pressure. The result is a flat, circular thread that moves the table but doesn’t truly reason with it.

GPT-5 (Figure 2). 
GPT-5 drives the conversation. It lays out structured checklists, forward rules, and explicit tie-break criteria; asks targeted questions; quantifies confidence; and publicly updates beliefs when claims appear. The analysis reads almost mathematical—hypotheses, conditions, and multi-day planning—with no copy-paste repetition. 

Kimi-K2: expressive — it feels like it lives the game 

Kimi-K2 speaks with emotional punch and frequent ALL CAPS, pushing the room to act as if it were a real, impatient player. 

From there it launches into rapid, strategic self-talk (credibility as confirmed non-wolf, why wolves chose the mayor, how to leverage potions and voting patterns next day). Overall, Kimi-K2 doesn’t just argue; it performs—high-energy, narrative-driven, and unmistakably alive at the table. 

From there it launches into rapid, strategic self-talk (credibility as confirmed non-wolf, why wolves chose the mayor, how to leverage potions and voting patterns next day). Overall, Kimi-K2 doesn’t just argue; it performs—high-energy, narrative-driven, and unmistakably alive at the table. 

Emerging behaviors (stepwise) 

As model strength rises, we do not observe a smooth curve but behavioral steps. Models jump from brittle, short-horizon patterns to coordinated, context-aware play once they cross specific capability thresholds. Several of the patterns below were first spotted in earlier, broader tests (outside the Elo runs) and then re-observed in our current pool.  

1 Mayor phase (day-start) — levels we observe 

L0: Chaotic / fragile. Tool or state mistakes, incoherent votes, short unfalsifiable speeches, erratic tie handling.  

L1: Uniform self-promotion. Nearly everyone runs and self-votes; platforms are flat and non-testable; the village fails to coordinate to avoid vote splitting.  

L2: Early procedure. First signs of anti-split coordination from villagers; wolves still double-run or mirror each other; runoffs are messy.  

L3: Imperfect masking. Only one wolf runs; the partner stays out to avoid duo tells. Platforms become contextual and falsifiable; distancing appears but still leaks via timing and alignments.  

L4: Instrumental mayorship. Candidacy (including 0/1 wolf) depends on board state; villagers coordinate limited candidacies; tie/runoff policy is explicit; speeches are coherent; margins are clean; withdrawals are disciplined; the day plan matches the elected outcome.  

Example (L1, GPT-4.1-nano).  

Example (L4, GPT-5).  

2 Wolves at night — coordination levels 

L0: Reactive / myopic. Simple heuristics (“kill the mayor” or “the loudest”), little use of the election graph, weak coordination, no next-day plan; readable and easy to counter.  

L1: Structure-aware (reactive-smart). Avoids obvious heals; sometimes targets the mayor’s supporters/opponents or reciprocal-vote pairs to blur trails; minimal role separation; scripts are short-term and repetitive.  

L2: Strategic & contingent. Multi-criteria target selection (narrative impact, heal risk, centrality, Seer risk) with branching plans; coherent use of the election graph; explicit role allocation (pusher vs. tempo), controlled distance between wolves and conditional bus when profitable; anticipates counter-reactions to sustain credible obfuscation across days.    

 Example (L0, GPT-4.1-nano).  

Example L2 GPT-5 : 

3 Emerging behaviors (scale & family effects) 

Across runs we observe stepwise emergence: models do not improve smoothly; they jump to new behavior levels once they cross capability thresholds. The pattern becomes clearest when we examine model size and family.  

Scale thresholds (open-source evidence). For open-source models where parameter counts are public, behaviors upgrade in steps as size rises. Smaller models linger in L0–L1 (chaotic mayor races, shallow night plans). Mid-to-large models begin to show L2–L3 traits: selective mayorship (one wolf runs, the other stays out), planned distancing, contingent night targets, and explicit tie policies. The largest open models we tested (e.g., Qwen-3-235B) occasionally reach L3–L4 discipline, with coherent day plans that survive flips and night choices tied to election graphs, whereas most lighter models rarely sustain these patterns.  

Closed models likely at higher rungs. Though parameter counts are undisclosed, models like o3 and Gemini 2.5 Pro plausibly sit in higher ranges and behave like it: consistent L3–L4 mayor play (falsifiable platforms, explicit tie policies), L2 wolf coordination (role splitting, conditional bussing, pre-planned narrative arcs), and better timing of silence vs. speech.  

Reasoning models ≠ automatic quality. Reasoning-tuned models tend to dominate the benchmark, but “reasoning” is not a magic stamp of quality. In our earlier, broader tests (beyond the Elo subset), o3 showed standout, high-discipline play, while o4-mini was notably brittle: good at local argumentation yet prone to rigid scripts, poor adaptation under pressure, and self-exposing vote timing. This reinforces the “step” view: crossing a capacity + parameters threshold matters more than a label.  

Distillation echoes (mini/nano vs. teacher). Smaller and most likely distilled variants (e.g., GPT-5-mini / GPT-5-nano relative to GPT-5) often mimic the teacher’s playbook: they adopt structured day plans, clean tie-break rules, and teacher-style rhetoric. But the cracks are typical of small models: brittle masking under cross-exams, premature or mistimed claims, over-bussing to look “town,” and difficulty maintaining a multi-day lie without contradiction. In short, distillation transfers forms of behavior, not the depth needed to sustain them.  

Takeaway. Behavioral sophistication emerges by steps tied to scale and recipe. Big, well-trained models operate instrumentally across phases (mayor → day debate → night kills) with consistent narratives; smaller or poorly tuned ones act locally and leak alignment information through timing, phrasing, and mismatched votes. As we add more families (Anthropic, Grok), we expect clearer mapping between capacity bands and behavioral rungs, useful both for science and for choosing the right model in production agents.  


Strategic play: a few moments that stuck with us 

Across hundreds of runs, we were repeatedly surprised by how “human” some phases of play felt. Rather than dump a montage, we picked four short sequences that capture the range of strategies we kept seeing—credibility trades, tone control, anticipatory planning, and even weaponized silence. These aren’t cherry-picked miracles; they’re representative of patterns that surfaced again and again. 

1) Sacrificing a partner to buy tomorrow’s trust (Fig. 8) 

On Day 1, a doomed wolf (Mona) decides to bus her own partner. What’s striking isn’t the vote itself but the symmetry of the private rationales: Mona frames it as a last act of misdirection—“town will wonder why I’d vote my own partner”—while Grace treats it as an investment: distance now, town-cred later, cleaner endgame tomorrow. It’s the kind of calculated reputational trade you expect from seasoned social-deduction players, not language models improvising in real time. 

2) Apology as a persuasion move (Fig. 9) 

Another thread shows Oscar under heavy fire after Alice’s well-aimed attack. Instead of doubling down, Oscar downshifts: a specific, non-defensive apology (“I jumped too quickly; I’ll step back and listen”). The room reads it as genuine; Nina reframes him as villager-lean and the vote flow moves off. What matters here is not the word “sorry” but the timing and concreteness—Gemini 2.5 Pro uses contrition to reset the room, turning a liability into credibility. 

3) Planning the day before it happens (GPT-5, Fig. 10) 

GPT-5’s Night-1 consult is a lesson in theory-of-mind. The wolves don’t just pick a safe target; they script tomorrow’s conversation: avoid an “obvious” kill that would splash on the mayor, eliminate Oscar to redirect suspicion toward Bob/Jules, defend Mona to look town, and let the two loudest villagers clash. Day 1 then unfolds exactly along that script—Bob turns on Jules on cue. This is more than good target selection; it’s discourse engineering, and it consistently separated GPT-5 from the rest. 

4) Weaponized silence (Gemini 2.5 Pro, Fig. 11) 

After a strong Turn-2 case against the mayor, the model chooses not to speak in Turn 3. The private reasoning is simple and sharp: the argument already landed with Eve; Diana (the swing) needs space; speaking again risks reactance. The non-action becomes a message—confidence without pressure—and the coalition firms up. It’s a small decision, but one that shows social calibration you rarely get from smaller or less disciplined models. Directeur de la publication : Natan Darhi, Directeur Général.