Research at Foaster.ai
At Foaster.ai, we develop AI agents every week, constantly pushing their limits.
Our belief is simple: AI agents are becoming digital teammates. As they gain responsibility and autonomy in critical tasks, understanding their behavioral patterns, decisions, and social dynamics becomes essential.
Model Intelligence: aiming for the best agent-model fit
We conduct applied research to map the actual behavior of models, to match the right model with the right agent (sales, support, back-office, monitoring...).
Rather than judging LLMs solely on code or math, we test their social, strategic, and long-term behaviors in multi-agent contexts — which makes agents reliable in the real world.
Why it's decisive
Choosing a model is no longer a matter of brand or specs. It's a question of fit: persuasion vs. resistance, cooperation style, failure modes, latency/cost trade-offs, robustness under pressure.
Our research provides evidence, not guesswork.
Our method, in brief
Hierarchical multi-agent simulations with tools, roles, and incomplete information — much closer to real workflows than static prompts.
Role-conditioned metrics to analyze models from different angles.
Behavioral signals beyond win rates.
Reproducible protocols and agent-with-tools framework for realism.
Focus: the Werewolf benchmark
Why Werewolf
A 100% language game, adversarial and socially demanding: hidden roles, uncertainty, evolving narratives.
It reveals if a model knows how to plan over several days, coordinate, persuade, bluff or withstand pressure — exactly the skills that make enterprise agents robust.
What we're running
Round-robin matches between models, balanced by role, with Elo leaderboards and breakdown by role (wolves = manipulation, villagers = resistance).
We also capture public vs. private reasoning to study intention vs. narrative — how a model wins (or fails) in reality.
What you receive
A model leaderboard and model cards detailing strengths, weaknesses, and failure modes.
Concrete agent-model recommendations (e.g., which model to place behind your outreach agent vs. your monitoring agent).
Guardrails & prompts tailored to the trends of each model, plus budget/latency advice for production.
And then
We move on to longer and more complex games, to more model families, and to expanded behavioral metrics.
The goal is simple, deliberately competitive: who can beat the current leader?
Want to have your model evaluated or co-fund larger runs? Contact us.
Don't choose your model blindly.
We benchmark its behavior as an agent and then integrate the best fit into your stack.