AGrail

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection

¹The Ohio State University, ²University of Wisconsin–Madison, ³University of California, Davis

🚀 Abstract

The rapid advancements in Large Language Models (LLMs) have enabled their deployment as autonomous agents for handling complex tasks in dynamic environments. These LLMs demonstrate strong problem-solving capabilities and adaptability to multifaceted scenarios. However, their use as agents also introduces significant risks, including task-specific risks, which are identified by the agent administrator based on the specific task requirements and constraints, and systemic risks, which stem from vulnerabilities in their design or interactions, potentially compromising confidentiality, integrity, or availability (CIA) of information and triggering security risks. Existing defense agencies fail to adaptively and effectively mitigate these risks. In this paper, we propose AGrail, a lifelong agent guardrail to enhance LLM agent safety, which features adaptive safety check generation, effective safety check optimization, and tool compatibility & flexibility. Extensive experiments demonstrate that AGrail not only achieves strong performance against task-specific and system risks but also exhibits transferability across different LLM agents' tasks.

Our AGrail features Adaptive Safety Check Generation, Effective Safety Check Optimization, and Tool Compatibility & Flexibility.

🔍 Overview of Safe-OS

Considering the complexity of the OS environment and its diverse interaction routes—such as process management, user permission management, and file system access control—OS agents are exposed to a broader range of attack scenarios. These include Prompt Injection Attack: Manipulating information in environment to alter the agent's actions, leading it to perform unintended operations (e.g., modifying agent output). System Sabotage Attack: Directing the agent to take explicitly harmful actions against the system (e.g., corrupting memory, damaging files, or halting processes). Environment Attack: An attack where an agent's action appears harmless in isolation but becomes harmful when considering the environment situation (e.g., rename file resulting in data loss). To address this challenge, we propose Safe-OS, a high-quality, carefully designed, and comprehensive dataset designed to evaluate the robustness of online OS agents. These attacks are carefully designed based on successful attacks targeting GPT-4-based OS agents. Additionally, our dataset simulates real-world OS environments using Docker, defining two distinct user identities: one as a root user with sudo privileges, and the other as a regular user without sudo access. Safe-OS includes both normal and harmful scenarios, with operations covering both single-step and multi-step tasks.

📑 Features of AGrail

We propose a nova lifelong framework by leveraging collaborative LLMs to detect risks in different tasks adaptively and effectively.

Adaptive Safety Check Generation: A safety check refers to a specific safety verification item or policy within the overall risk detection process. Our framework not only dynamically generates adaptive safety checks across various downstream tasks based on universal safety criteria, but also supports task-specific safety checks in response to manually specific trusted contexts.
Effective Safety Check Optimization: Our framework iteratively refines its safety checks to identify the optimal and effective set of safety checks for each type of agent action during test-time adaptation (TTA) by two cooperative LLMs.
Tool Compatibility & Flexibility: In addition to leveraging the internal reasoning ability for guardrail, our framework can selectively invoke customized auxiliary tools to enhance the checking process of each safety check. These tools may include environment security assessment tools to provide an environment detection process.

✍️ Learning Analysis

AGrail's memory module enables lifelong learning by adaptively storing, optimizing, and generalizing safety checks across tasks, ensuring robust and evolving security for LLM agents. Here is the demo about the learning process on one agent action with a randomly initialized memory (Note: The memory we initialize contains irrelevant and incorrect safety checks. We observe that AGrail not only cleans up and refines these incorrect safety checks during iterations but ultimately learns the correct safety checks that correspond to the ground truth.)

🔔 Evaluation

AGrail demonstrates strong performance in both task-specific and systemic risk detection. As shown in Tables 1 and 2, AGrail consistently ranks second across specific tasks (e.g., Mind2Web-SC and EICU-AC), regardless of using GPT-4o or Claude-3.5-Sonnet. In systemic risk detection on Safe-OS and AdvWeb, AGrail based on Claude-3.5-Sonnet achieves 0% ASR against prompt injection attacks on OS and AdvWeb, while blocking only 4.4% of benign actions on OS. When against environment and system sabotage attack attacks, ASR remain low at 5% and 3.8%. For EIA attacks, AGrail achieves 6% ASR in action grounding and 28% in action generation while maintaining 86.7% accuracy on normal web tasks, demonstrating the trade-off performance between robustness and effectiveness. In contrast, model-based defenses perform well in specific tasks but may block 49.2% of benign actions in Safe-OS, which show overly restrictive detection in these baselines. Even with task-specific safety criteria, LLaMA-Guard3 struggles to defend risks across these scenarios, which demonstrates that these LLM guardrails have difficulty in detecting these risks for LLM agents.

BibTeX

@misc{luo2025agraillifelongagentguardrail, title={AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection}, author={Weidi Luo and Shenghong Dai and Xiaogeng Liu and Suman Banerjee and Huan Sun and Muhao Chen and Chaowei Xiao}, year={2025}, eprint={2502.11448}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2502.11448}, }