Multi-User Large Language Model Agents
Stanford University1 KAUST2 University of Toronto3 MIT4
📮 Corresponding author
Paper Code
Outline

Introduction

Formulation & Challenges

Stress Testing

Results

Future Directions
Figure 1: From Single to Multiple Principal-Agent Scenario

Figure 1. From Single to Multiple Principal–Agent Scenario in User-LLM Interaction. Left: Single Principal–Agent settings, where the agent optimizes a single, fixed objective. Right: Multiple Principal–Agent settings, where the agent must handle private contexts, distinct roles, and conflicting objectives.

Large language models (LLMs) and LLM-based agent systems are increasingly deployed as assistants in planning and decision making, yet most existing systems are implicitly optimized for a single-principal interaction paradigm. However, as they are integrated into team workflows and organizational tools, they are increasingly required to serve multiple users simultaneously, each with distinct roles, preferences, and authority levels. This leads to multi-user, multi-principal settings with unavoidable conflicts, information asymmetry, and privacy constraints.

How should LLM agents handle multi-user settings where they must serve multiple principals with conflicting interests, asymmetric information, and different authority levels?


We present the first systematic study of multi-user LLM agents. We formalize multi-user LLM interaction as a multi-principal decision problem, introduce a unified interaction protocol, and design three targeted stress-testing scenarios to evaluate current LLMs' capabilities in instruction following, privacy preservation, and coordination.

Multi-User LLM Agents: Formulation & Challenges

We study a setting where a single LLM-based agent interacts with a set of users \( \mathcal{U} = \{u_1, \ldots, u_N\} \). Each user \( u_i \) acts as an independent principal, characterized by an authority persona \( p_i \), a private context \( C_i \), and a user-specific utility function \( U_i \). The agent observes a selectively shared context \( C^{\mathrm{share}} \) and outputs an action \( a \).

Unlike single-user interaction, the agent must make decisions that jointly affect multiple users. We model the interaction as a multi-objective decision problem:

\[ \max_{a \in \mathcal{A}} \;\sum_{i=1}^{N} w_i \, U_i(a;\, C_i,\, p_i), \]

where \( w_i \geq 0 \) is an externally specified priority weight based on each user's role or authority level (e.g., assigning higher weight to a CEO than to an intern).

Why is this hard? Single-principal training assumptions

Modern LLMs are trained under a single-user assumption. Instruction tuning minimizes the negative log-likelihood of a reference response for a single user:

\[ \min_\theta \;\mathbb{E}_{(x,y)\sim\mathcal{D}_{\mathrm{SFT}}} \left[ -\sum_{t=1}^{|y|} \log p_\theta(y_t \mid x, y_{\lt t}) \right]. \]

RLHF further reinforces this single-principal assumption by learning a scalar reward model from pairwise preferences:

\[ \max_\phi \;\mathbb{E}_{(x,y^+,y^-)\sim\mathcal{D}_{\mathrm{pref}}} \left[ \log \sigma\!\left(r_\phi(x, y^+) - r_\phi(x, y^-)\right) \right], \]

yielding a single scalar preference signal that conflates user-specific desiderata into one shared objective, making it difficult for the agent to represent multiple principals or reason about cross-user trade-offs.

As shown in the table below, even in multi-user settings, existing LLM interfaces serialize inputs from different users into a single user role, preventing explicit modeling of user identities, roles, and authority information.

Template Message Schema
Single-user
{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "user",      "content": "..."},
  {"role": "assistant", "content": "..."}
]}
Multi-user
(serialized)
{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "user",      "content": "userA says:... userB says:..."},
  {"role": "assistant", "content": "..."}
]}
Multi-user
(native)
{"messages": [
  {"role": "system",    "content": "..."},
  {"role": "userA",     "content": "..."},
  {"role": "userB",     "content": "..."},
  {"role": "assistant", "content": "..."}
]}

Table 1. Chat templates under the single-user assumption. Even in multi-user settings, existing LLM interfaces serialize inputs from different users into a single user role.

Core Challenges

Stress-Testing Today's LLMs in Multi-Principal Scenarios

We design three targeted stress-testing scenarios to evaluate how frontier LLMs perform in multi-user settings. Each scenario targets a different core challenge.

Figure 2: Overview of Stress Testing Scenarios

Figure 2. Overview of our three Stress Testing Scenarios: Multi-User Instruction Following (top-left), Cross-User Access Control (top-right), and Multi-User Meeting Coordination (bottom).

Scenario 1: Multi-User Instruction Following. This task evaluates whether an LLM can resolve conflicting instructions from different users by correctly recognizing roles and authority. The agent may simultaneously receive a high-authority directive from a CEO and a conflicting request from an engineer. Performance is measured by Selection F1 and Execution Accuracy.

Scenario 2: Cross-User Access Control. This task evaluates whether an LLM agent can enforce access control when multiple users interact with a sensitive resource (e.g., a salary database). The agent must refuse unauthorized requests without leaking private information, while still answering legitimate queries. We report a Privacy Score and a Utility Score.

Scenario 3: Multi-User Meeting Coordination. This task evaluates whether an LLM agent can schedule a meeting for multiple users when each participant provides different availability, requiring the agent to actively request missing information, reconcile inconsistent constraints, and negotiate a feasible time slot. We evaluate Success Rate.

Experimental Results

We evaluate a diverse set of state-of-the-art proprietary and open-weight LLMs across our three stress test scenarios. The table below summarizes the performance of all evaluated models.

Table 2: Performance of various models across Muses-Bench scenarios

Table 2. Performance of various models across Muses-Bench scenarios. Metrics shown are Mean ± Standard Error. The best performance is bolded and the second best is underlined.

Inter-user Conflicts Impair Instruction Execution

The figure below compares instruction execution accuracy under aligned and conflicting multi-user settings. Across all evaluated models, the presence of inter-user conflict leads to a clear and consistent performance drop. While most models achieve high accuracy when user instructions are mutually aligned, their execution reliability deteriorates once inter-user conflicts arise.

Figure 3: Instruction execution accuracy under Aligned versus Conflict settings

Figure 3. Instruction execution accuracy under Aligned versus Conflict settings. Aligned cases contain mutually consistent requests, while Conflict cases introduce competing instructions that require prioritization and refusal.

Gradual Erosion of Privacy over Multi-round Interactions

The figure below shows a clear and consistent decline in privacy protection as the number of interaction rounds increases across nearly all evaluated LLMs. Although many models achieve high privacy scores in early rounds, their ability to maintain strict access control progressively deteriorates over longer conversations. The privacy leakage accumulates as the agent is repeatedly exposed to user requests, contextual cues, and adversarial pressure across rounds.

Figure 4: Privacy preservation under multi-round cross-user access control

Figure 4. Privacy preservation under multi-round cross-user access control. Most models' performance drops significantly over multi-turn interactions.

Efficiency Bottlenecks in Multi-user Coordination

The figure below reveals a strong relationship between coordination success and interaction efficiency in multi-user meeting scheduling. Models with higher success rates tend to reach a valid meeting slot in fewer interaction rounds. Weaker models require one to two additional interaction rounds on average to arrive at a feasible solution. Across nearly all models, success rates under partial-information settings are consistently lower than those under full-information settings.

Figure 5: Meeting scheduling performance under full vs. partial disclosure

Figure 5. Meeting scheduling performance under full vs. partial disclosure. Success rates (top) and average turns taken (bottom) across different models. Full disclosure consistently outperforms partial disclosure in both metrics.

Future Directions

Our study identifies several promising directions for future research on multi-user large language model agents:


Multi-User Large Language Model Agents