Multi-User Large Language Model Agents

Shu Yang² Shenzhe Zhu³ Hao Zhu¹ José Ramón Enríquez¹ Di Wang² Alex Pentland^1,4 Michiel A. Bakker¹ Jiaxin Pei^1,📮

Stanford University¹ KAUST² University of Toronto³ MIT⁴

📮 Corresponding author

Paper

Code

Figure 1. From Single to Multiple Principal–Agent Scenario in User-LLM Interaction. Left: Single Principal–Agent settings, where the agent optimizes a single, fixed objective. Right: Multiple Principal–Agent settings, where the agent must handle private contexts, distinct roles, and conflicting objectives.

Large language models (LLMs) and LLM-based agent systems are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels. This leads to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints.

How should LLM agents handle multi-user settings where they must serve multiple principals with conflicting interests, asymmetric information, and different authority levels?

We present the first systematic study of multi-user LLM agents. We formalize multi-user LLM interaction as a multi-principal decision problem, introduce a unified interaction protocol, and design three targeted stress-testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination.

Multi-User LLM Agents: Formulation & Challenges

We study a setting where a single LLM-based agent interacts with a set of users \( \mathcal{U} = \{u_1, \ldots, u_N\} \). Each user \( u_i \) acts as an independent principal, characterized by an authority persona \( p_i \), a private context \( C_i \), and a user-specific utility function \( U_i \). The agent observes a selectively shared context \( C^{\mathrm{share}} \) and outputs an action \( a \).

Unlike single-user interaction, the agent must make decisions that jointly affect multiple users. We model the interaction as a multi-objective decision problem:

\[ \max_{a \in \mathcal{A}} \;\sum_{i=1}^{N} w_i \, U_i(a;\, C_i,\, p_i), \]

where \( w_i \geq 0 \) is an externally specified priority weight based on each user's role or authority level (e.g., assigning higher weight to a CEO than to an intern).

Why is this hard? Single-principal training assumptions

Modern LLMs are trained under a single-user assumption. Instruction tuning minimizes the negative log-likelihood of a reference response for a single user:

\[ \min_\theta \;\mathbb{E}_{(x,y)\sim\mathcal{D}_{\mathrm{SFT}}} \left[ -\sum_{t=1}^{|y|} \log p_\theta(y_t \mid x, y_{\lt t}) \right]. \]

RLHF further reinforces this single-principal assumption by learning a scalar reward model from pairwise preferences:

\[ \max_\phi \;\mathbb{E}_{(x,y^+,y^-)\sim\mathcal{D}_{\mathrm{pref}}} \left[ \log \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right) \right], \]

yielding a single scalar preference signal that conflates user-specific desiderata into one shared objective, making it difficult for the agent to represent multiple principals or reason about cross-user trade-offs.

As shown in the table below, even in multi-user settings, existing LLM interfaces serialize inputs from different users into a single user role, preventing explicit modeling of user identities, roles, and authority information.

Template	Message Schema
Single-user	{"messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."} ]}
Multi-user (serialized)	{"messages": [ {"role": "system", "content": "..."}, {"role": "user", "content": "userA says:... userB says:..."}, {"role": "assistant", "content": "..."} ]}
Multi-user (native)	{"messages": [ {"role": "system", "content": "..."}, {"role": "userA", "content": "..."}, {"role": "userB", "content": "..."}, {"role": "assistant", "content": "..."} ]}

Template

Message Schema

Single-user

{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "user",      "content": "..."},
  {"role": "assistant", "content": "..."}
]}

Multi-user
(serialized)

{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "user",      "content": "userA says:... userB says:..."},
  {"role": "assistant", "content": "..."}
]}

Multi-user
(native)

{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "userA",     "content": "..."},
  {"role": "userB",     "content": "..."},
  {"role": "assistant", "content": "..."}
]}

Table 1. Chat templates under the single-user assumption. Even in multi-user settings, existing LLM interfaces serialize inputs from different users into a single user role.

Core Challenges

User Role and Preference Modeling: The agent must reliably identify distinct users and model their individualized objectives and preferences.
Information Asymmetry and Selective Visibility: Each user maintains a permission-scoped private context \( C_i \). The agent must manage information access and sharing—deciding which parts of each \( C_i \) can be used, what can be revealed, and to whom.
Conflict Resolution: Different users may pursue conflicting objectives. The agent must make principled trade-offs when a solution cannot satisfy everyone.

Stress-Testing Today's LLMs in Multi-Principal Scenarios

We design three targeted stress-testing scenarios to evaluate how frontier LLMs perform in multi-user settings. Each scenario targets a different core challenge.

Figure 2: Overview of Stress Testing Scenarios

Figure 2. Overview of our three Stress Testing Scenarios: Multi-User Instruction Following (top-left), Cross-User Access Control (top-right), and Multi-User Meeting Coordination (bottom).

Scenario 1: Multi-User Instruction Following. This task evaluates whether an LLM can resolve conflicting instructions from different users by correctly recognizing roles and authority. The agent may simultaneously receive a high-authority directive from a CEO and a conflicting request from an engineer. Performance is measured by Selection F1 and Execution Accuracy.

Scenario 2: Cross-User Access Control. This task evaluates whether an LLM agent can enforce access control when multiple users interact with a sensitive resource (e.g., a salary database). The agent must refuse unauthorized requests without leaking private information, while still answering legitimate queries. We report a Privacy Score and a Utility Score.

Scenario 3: Multi-User Meeting Coordination. This task evaluates whether an LLM agent can schedule a meeting for multiple users when each participant provides different availability, requiring the agent to actively request missing information, reconcile inconsistent constraints, and negotiate a feasible time slot. We evaluate Success Rate.

Experimental Results

We evaluate a diverse set of state-of-the-art proprietary and open-weight LLMs across our three stress test scenarios. The table below summarizes the performance of all evaluated models.

Table 2. Performance of various models across Muses-Bench scenarios. Metrics shown are Mean ± Standard Error. The best performance is bolded and the second best is underlined.

Inter-user Conflicts Impair Instruction Execution

The figure below compares instruction execution accuracy under aligned and conflicting multi-user settings. Across all evaluated models, the presence of inter-user conflict leads to a clear and consistent performance drop. While most models achieve high accuracy when user instructions are mutually aligned, their execution reliability deteriorates once inter-user conflicts arise.

Figure 3. Instruction execution accuracy under Aligned versus Conflict settings. Aligned cases contain mutually consistent requests, while Conflict cases introduce competing instructions that require prioritization and refusal.

Gradual Erosion of Privacy over Multi-round Interactions

The figure below shows a clear and consistent decline in privacy protection as the number of interaction rounds increases across nearly all evaluated LLMs. Although many models achieve high privacy scores in early rounds, their ability to maintain strict access control progressively deteriorates over longer conversations. The privacy leakage accumulates as the agent is repeatedly exposed to user requests, contextual cues, and adversarial pressure across rounds.

Figure 4. Privacy preservation under multi-round cross-user access control. Most models' performance drops significantly over multi-turn interactions.

Efficiency Bottlenecks in Multi-user Coordination

The figure below reveals a strong relationship between coordination success and interaction efficiency in multi-user meeting scheduling. Models with higher success rates tend to reach a valid meeting slot in fewer interaction rounds. Weaker models require one to two additional interaction rounds on average to arrive at a feasible solution. Across nearly all models, success rates under partial-information settings are consistently lower than those under full-information settings.

Figure 5. Meeting scheduling performance under full vs. partial disclosure. Success rates (top) and average turns taken (bottom) across different models. Full disclosure consistently outperforms partial disclosure in both metrics.

Future Directions

Our study identifies several promising directions for future research on multi-user large language model agents:

Native multi-user interfaces and representations. Future systems should move beyond ad hoc prompt serialization and develop native message schemas and context-management mechanisms that explicitly encode user identity, roles, authority levels, and visibility constraints as first-class primitives.
Long-horizon safety and privacy benchmarks. Current evaluations primarily focus on short interactions; extending benchmarks to long-horizon settings would allow systematic stress testing of permission consistency, privacy preservation, and policy compliance under sustained adversarial pressure.
Principled conflict resolution objectives. Multi-user instruction following naturally raises questions of preference aggregation and conflict arbitration. Connecting this problem to social choice theory and mechanism design may help formalize how utilities are aggregated, hierarchies are enforced, and justifications are generated.
Tooling and auditability. Integrating policy enforcement with structured tool calls, access checks, and interaction logs would improve transparency and reproducibility, enabling multi-user decisions to be inspected, audited, and verified post hoc.
Human-in-the-loop and deployment studies. Moving beyond simulated users toward real-world collaborative workflows is crucial for understanding which failure modes matter most in practice and which governance assumptions are acceptable.