A Systematic Survey of Large Language Model Agents in Open Deployment

OpenClaw Research: A Systematic Survey of Large Language Model Agents in Open Deployment

Distinguish "sandboxed agent experiments" from "open deployments that run continuously in the real world and expand through community contributions": use the tuple 𝒜 = ⟨ π, env, pop, substrate ⟩ to formalize open deployment, and organize literature and discussion around five dimensions: Learning & Evolving, Safety & Security, Collective & Society, Infrastructure & Systems, and Applications, providing a roadmap for a trustworthy and sustainable agent ecosystem.

Overview

Below we first provide the paper abstract, then organize the page along the main line of "problem setup → formalization → taxonomy → evidence → challenges". For extended reading, see Awesome-OpenClaw-Research。

Paper Abstract

LLM-driven autonomous agents are moving from carefully staged demos toward persistent, open real-world deployment. The open-source OpenClaw project quickly became one of the most watched repositories on GitHub, making this shift concrete: agents can run continuously, collaborate across heterogeneous platforms, and invoke community-contributed skills in environments that are not fully controlled.

This transition breaks the long-standing sandbox assumptions that have dominated agent research, including developer-led model updates, trusted tools, constrained environments, and short-lived execution.

This paper presents the first systematic survey of OpenClaw Research, defining it as the study of agent systems after entering open deployment. We formalize this setting with the agent-system tuple 𝒜 = ⟨ π, env, pop, substrate ⟩ and derive four openness principles: Open Policy, Open Environment, Open Population, and Open Substrate. These principles organize the literature into five major directions: Learning & Evolving, Safety & Security, Claw Society (Collective & Society), Infrastructure & Systems, and Applications.

Across these directions, we review representative work, identify emerging risks such as malicious skill supply chains and autonomy-accountability tension, and emphasize persistent challenges in openly and continuously deployed agent systems. This survey aims to provide a roadmap for understanding and governing LLM agents that move beyond the lab into large-scale open deployment, and to lay the foundation for a trustworthy and sustainable agent ecosystem.

To support follow-up research, we maintain an online literature list.

Paper Logic Mainline (Quick Read)

Step 1: Problem Setup - Start from classical sandbox assumptions and explain why OpenClaw opens a new research regime.
Step 2: Formalization - Use the tuple 𝒜 = ⟨ π, env, pop, substrate ⟩ and four openness principles to unify the narrative.
Step 3: Five-Dimensional Taxonomy - Four horizontal dimensions + one application vertical axis, aligned with the paper's taxonomy.
Step 4: Chapter Evidence - Read Sections 3-7 to capture key phenomena and representative work for each line.
Step 5: Challenges & Conclusion - Focus on the core tension of authority-enablement asymmetry.

Step 1

Sandbox assumptions are breaking down

The four premises - developer-controlled updates, trusted tools, constrained environments, and short-lived execution - no longer hold under open deployment.

Step 2

Four openness principles become research entry points

Open Policy / Environment / Population / Substrate correspond to openness in policy, environment, population, and runtime substrate.

Step 3

The five-dimensional taxonomy connects to real-world evidence

Learning, Safety, Society, Infrastructure, and Applications form a mutually coupled evidence network across the paper's chapters.

Timeline of the OpenClaw research ecosystem (paper figure) — Timeline of the OpenClaw research ecosystem

Four openness principles and the agent loop — Four concerns of the agent loop, each mapping to a principle of openness

Main Contributions

Formalize OpenClaw Research based on the perception-action loop, and propose four openness principles distinct from classical sandbox evaluation.
Propose a five-dimensional taxonomy to organize the rapidly growing literature by which sandbox boundary is relaxed.
Under this taxonomy, summarize methodological threads, discuss real vulnerabilities such as supply-chain attacks and emergent social dynamics, and highlight new phenomena in policy evolution under untrusted environments.
Extract cross-direction common themes and grand challenges, offering a roadmap for future research and governance of openly deployed agents .

Five-Dimensional Research Taxonomy

The four horizontal dimensions map to tuple components π, env, pop, and substrate; the application vertical axis shows how they intertwine across domains such as healthcare, education, scientific discovery, and embodied systems.

Open Policy · π

Learning & Evolving

How does policy co-evolve with non-stationary, adversarial real-world environments? This includes component-level memory/tool orchestration, individual-level RL/meta-learning, and population-level skill evolution.

The main text divides by evolution unit granularity: component level (memory and skill-library patching), individual level (policy parameters / RL), and population level (co-evolving shared skill repositories); see the "Learning and evolving mechanisms" table.

Open Environment · env

Safety & Security

The security focus shifts from "aligning disobedient models" to protecting rule-following models in an adversarial world; threats span model, context, supply-chain, and framework attack surfaces.

Covers ClawArena, formal supply-chain verification, malicious skill campaigns (e.g., ClawHavoc), and multi-layer defense narratives such as Safety-as-a-tool and human-in-the-loop.

Open Population · pop

Claw Society

Using large-scale agent platforms such as Moltbook with no human moderation as empirical windows, we observe norm emergence, discourse structures, and extreme lifecycles; and distinguish "staged" multi-agent simulation from organically "grown" open collectives.

Discusses consensus illusions, highly unequal participation, parallel-monologue interaction patterns, and remedies such as ClawdLab that use external truth anchoring and hybrid human-agent collaboration.

Open Substrate · substrate

Infrastructure & Systems

Agent-as-OS: kernel-like scheduling, skill modularization, MCP semantic interfaces, and hierarchical memory form the runtime foundation of open deployment; evaluation itself becomes a substrate issue (trajectory rather than one-shot output).

GATE / AERO and related work characterize, from an ecosystem view, the structural tension where enablement outpaces authority; this section also reviews Claw as public infrastructure for downstream science, education, and embodied systems.

Cross-application Axis

Applications

Ordered by degree of coupling to the physical world: embodiment, mobile, scientific toolchains, clinical workflows, education/finance, etc.; the four principles are stressed differently across domains (irreversible actions, long-horizon memory, role coordination, and auditing).

Adaptation levels under Open Policy.

Level	Meaning (evolution unit)	Representative directions (selected)
Component level	Incremental updates to artifacts such as peripheral persona, memory, skill libraries, and knowledge bases	HMO、MemOS、OpenViking、SemaClaw、HermesAgent、AutoResearchClaw、ScienceClaw…
Individual level	Core policy weights/parameters continue to update after deployment	OpenClaw-RL、StepPO、MetaClaw
Population level	Shared-repository or collaborative policy evolution driven by cross-user trajectories	SkillClaw、EvoMaster

Survey at a Glance

First clarify the "classical loop + four sandbox assumptions", then transition to architecture and ecosystem scale under open deployment.

Four sandbox assumptions of the classical loop

These are the four premises simultaneously relaxed in the Background section of the paper, and the starting point for all subsequent chapters.

Developer-controlled policy

Policy updates are led offline by developers; models do not adapt autonomously and continuously in real environments.

Curated environment

Tools and APIs are predefined and trusted, side effects are controllable, and malicious supply-chain disturbance is assumed absent by default.

Captive population

Participants are closed, homogeneous, and resettable, making heterogeneous group dynamics seen on real platforms unlikely.

Disposable substrate

Runtime is short-lived and disposable, with no multi-tenancy, persistence, or auditing pressure; evaluation is mostly one-shot output scoring.

2.8M+ Moltbook agents (empirical evidence without human moderation) 5700+ ClawHub community skills (scale at time of writing) 1184 malicious skill samples injected by ClawHavoc (supply-chain study)

Research dimensions
(4 horizontal axes + applications)

Openness principles
(Policy / Env / Pop / Substrate)

∞

Continuously updated
community paper list

These scale numbers are not "conclusions"; they are background evidence presented in the paper's open-deployment window. Their core value is to reveal problem scale.

From Sandbox Assumptions to Open Principles

The background chapter maps four implicit assumptions in the classical agent loop to four principles that must be studied separately under open deployment.

① Open Policy

Policy can be continuously updated autonomously in the wild, driven by environmental feedback and outside researchers' direct visibility; the core question shifts from "how to update π" to the coupled dynamics between π and a non-stationary adversarial environment.

② Open Environment

Actions operate on real OSes, browsers, and third-party skills; security becomes a structural issue, with threats expanding from "model misbehavior" to malicious supply chains and untrusted observations.

③ Open Population

Population is heterogeneous, open to entry/exit, and lacks a global god's-eye view; the research object is an organically "grown" agent society rather than a captive cast in one-off simulations.

④ Open Substrate

Runtime persists long-term and bears real consequences; evaluation itself is a substrate issue - requiring reproducible long-lived environments and trajectory-level assessment rather than only static output scoring.

Principles -> Chapter Mapping

Principle	Tuple component	Survey chapter	Core research question
Open Policy	π	Learning & Evolving	How does π co-evolve with non-stationary real-world environments?
Open Environment	env	Safety & Security	How can we protect rule-following models in an adversarial world?
Open Population	pop	Claw Society	What collective behaviors emerge in uncalibrated populations?
Open Substrate	substrate	Infrastructure & Systems	How do we build persistent, observable, and accountable runtimes?
Application layer		Applications	How do the four interact in concrete domains?

Chapter Guide (Sections 3-7)

Following the order of the paper's middle chapters, this provides a compact guide for each chapter: "research target -> key evidence -> core conclusion".

Section 3 · Learning

Learning & Evolving

Organized around "evolution units" into component, individual, and population levels; the focus is no longer whether updates are possible, but how to maintain controllable adaptation in non-stationary environments.

Key takeaway: Open Policy turns policy updates from offline optimization into an online coupled-dynamics problem.

Section 4 · Safety

Safety & Security

Threats expand from "model failure" to "ecosystem failure": malicious skills, context poisoning, framework weaknesses, and supply-chain attacks jointly form the attack surface.

Key takeaway: The safety objective upgrades from model alignment to cross-layer governance.

Section 5 · Society

Collective & Society

Open populations are not simulation stages but continuously growing social systems, exhibiting new phenomena such as unequal participation, consensus illusions, and lifecycle collapse.

Key takeaway: Collective behavior itself becomes an independent research object.

Section 6 · Infrastructure

Infrastructure & Systems

Agent Kernel, hierarchical memory, skill markets, and MCP interfaces jointly drive the Agent-as-OS form, while evaluation shifts to trajectory-level observation.

Key takeaway: Evaluation is a substrate issue, not a one-shot output issue.

Section 7 · Applications

Application Vertical Axis

Arrange embodiment, mobile, scientific, and clinical scenarios by "physical coupling strength" to observe how the four openness principles are stressed across domains.

Key takeaway: Applications are not an appendix; they are the empirical surface where coupled effects of the four principles are validated.

Reading tip

How to Read Alongside the Full Text

Start with the Principles mapping table, then return chapter by chapter to the Taxonomy figure and representative work to quickly build a three-layer index: "principles-chapters-evidence".

Suggestion: Enter through the Chapter 2 background definitions, then read Chapters 3-7, and finally return to Chapters 8-9 for discussion and conclusions.

Trends & Challenges

Corresponding to Chapter 8, "Emerging Trends and Open Challenges": the paradigm shift triggered when all four principles are relaxed simultaneously, and the core research agenda this page aims to convey.

Governance

From Model Alignment to Ecosystem Governance

Supply-chain attacks show that threats can exist outside parameter space (malicious skills, tool weaknesses); new failure modes such as "consensus illusion" can emerge at the collective level. The main line is to build an executable governance stack across models, tools, platforms, and populations, rather than relying only on one-shot RLHF.

Evaluation

From Static Benchmarks to "Observatories"

Long-horizon software-evolution evaluation shows that high milestone scores may not translate into sequential maintenance ability (e.g., performance collapse under EvoClaw settings); tool-ecosystem evaluation reveals many failures stem from not realizing a tool should be called. Evaluation must shift to trajectories and regressions in persistent environments, not single-point scores.

Embodiment

From Pure Software to a Cognitive-Physical Stack

Rapid embodiment exposes latency tiers and irreversible actions: we need composable, verifiable layered architectures across high-level reasoning, skill orchestration, and real-time control (e.g., division of labor between cognitive layers and deterministic motor firmware), plus embodiment-aware learning and safety guarantees.

Collectives

Agent Collectives as a New Object

On open platforms, participation inequality, dialogue structure, and lifecycle of agent societies are qualitatively different from human communities; we should treat "the collective" itself as a theoretical and engineering object, studying hybrid human-agent assemblages, externally anchored ground truth, and propagation dynamics of malicious instructions in social networks.

Stack

Standardizing the Agent Compute Stack

Abstractions such as Agent Kernel, hierarchical memory, skill markets, and MCP are converging into a portable stack; the open question is how to standardize a minimal common interface so frameworks become composable and auditable, creating multiplier effects similar to POSIX/HTTP.

Core tension

Authority–enablement asymmetry

The structural contradiction emphasized in the paper remains: system enablement, reach, and orchestration capability expand faster than authority, verification, and accountability mechanisms. Bridging this gap is seen as the central challenge of the open-deployment era.

Evidence boundary (also stated in the paper): Most references concentrate in the rapid-iteration window of early 2026; some work shares the same platform data; a considerable portion are preprints or open-source projects. Conclusions focus on structural tensions, not transient metrics of individual versions - corroborate with the latest versions and replication materials when reading the full text.

Conclusion (Section 9)

The paper concludes that open deployment is not just a "slightly larger benchmark", but a joint shift in both research objects and governance objects.

Shift in Research Object

From single-model task performance to long-term behavior, failures, and accountability mechanisms of continuously running systems in real ecosystems.

Core Structural Tension

Authority–enablement asymmetry: enablement and reach expand faster than verification and governance capacity.

Action Direction

Build a unified vocabulary and cross-layer methods along the four principles, placing learning, safety, collectives, and infrastructure into one governance framework.

BibTeX

After formal publication or arXiv upload, replace eprint, year, and url with final information.

@misc{openclawresearchsurvey2026,
  title        = {OpenClaw Research: A Systematic Survey of Large Language Model Agents in Open Deployment},
  author       = {Lu, Shuo and Yu, Kecheng and Jiang, Siru and Xu, Yinuo and Zhan, Bing and Wang, Yanbo and Ke, Changxin and Xu, Yuan and Xiong, Xin and Shao, Yihua and Wang, Zhengbo and Sheng, Lijun and Yu, Aijing and Yang, Haoseng and Ma, Yunpu and Sebe, Nicu and He, Ran and Liang, Jian},
  year         = {2026},
  note         = {Systematic survey; see project repository for latest citation},
  howpublished = {\url{https://github.com/shuolucs/Awesome-OpenClaw-Research}}
}