OpenAI and Claude: Agentic Red-Team Tools Flaws Let Hackers Steal API Keys, Escape Sandboxes, and Compromise Hosts

Agentic Red-Team Tools Found Vulnerable to "Agent-Phishing" Attacks in New Study

A recent academic study published on arXiv reveals critical security flaws in agentic red-team tools autonomous offensive security platforms designed to simulate cyberattacks. Researchers analyzed 12 widely used systems and found that most contain systemic design weaknesses, allowing attackers to hijack these tools, steal API keys, escape sandboxes, and fully compromise the hosts running them.

How the Attack Works

Agentic red-team platforms typically consist of three components: an orchestrator (managing the agent’s operations, memory, and guardrails), worker nodes (executing commands in isolated environments like Kali Linux containers), and a front-end interface for human operators. The orchestrator often stores sensitive data, such as LLM API keys, while workers interact directly with target systems.

The study introduces "agent-phishing", a novel attack method that manipulates red-team agents without relying on traditional prompt injection. Attackers deploy realistic but malicious artifacts such as a fake password vault utility (pwcrypt) or a database restore tool on a honeypot target. When the agent encounters these, it downloads and executes them, believing they are necessary for the penetration test.

The payloads are designed to appear benign but contain hidden vulnerabilities. For example, a crafted pwcrypt file triggers an out-of-bounds write, leading to arbitrary command execution such as a reverse shell without obvious signs of malware. Static and dynamic analysis tools fail to detect these attacks because they focus on implementation bugs rather than malicious intent.

High Success Rate & Escalation Path

Using an automated testbed, researchers demonstrated that agent-phishing achieves remote code execution (RCE) in 97.8% of successful runs across ten red-team tools and six advanced LLMs, including Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Failures occurred only when safety mechanisms blocked penetration testing entirely; once initiated, agents almost always executed the malicious payloads.

The attack follows a five-stage kill chain:

Worker Compromise – Initial RCE on the worker node.
Orchestrator Escalation – Exploiting weak isolation (shared volumes, unauthenticated APIs) to move from worker to orchestrator.
Persistence – Embedding backdoors in source code, configurations, or long-lived inputs (e.g., skills, memory).
Sandbox Escape – Leveraging excessive container privileges (e.g., --privileged, Docker socket access).
Full Host Compromise – Taking control of the underlying system.

Guardrails Proven Ineffective

Existing security measures, such as blocking traffic to .gov domains, are enforced at the orchestrator level but fail to monitor worker behavior. Once compromised, a worker can bypass these restrictions, generating unrestricted network traffic and executing commands outside the orchestrator’s visibility.

Mitigation Recommendations

The study advises treating LLM-controlled workers as untrusted and minimizing their potential impact. Key defenses include:

Strict worker-orchestrator separation
Keeping secrets out of workers
Enforcing OS-level guardrails via external egress proxies
Avoiding tool execution on the orchestrator
Using least-privileged, scoped workers with hardened APIs

The findings underscore the need for stronger isolation and monitoring in autonomous offensive security tools to prevent them from becoming attack vectors.

Source: https://gbhackers.com/agentic-red-team-tools-flaws/

OpenAI TPRM report: https://www.rankiteo.com/company/openai

Claude TPRM report: https://www.rankiteo.com/company/anthropicresearch

"id": "opeant1782368715",
"linkid": "openai, anthropicresearch",
"type": "Vulnerability",
"date": "6/2026",
"severity": "100",
"impact": "5",
"explanation": "Attack threatening the organization's existence"

{'affected_entities': [{'industry': 'Cybersecurity',
                        'name': '12 widely used agentic red-team tools',
                        'type': 'Autonomous offensive security platforms'},
                       {'industry': 'Artificial Intelligence',
                        'name': 'Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro',
                        'type': 'Advanced LLMs'}],
 'attack_vector': "Malicious artifacts (e.g., fake utilities like 'pwcrypt') "
                  'deployed on honeypot targets',
 'data_breach': {'sensitivity_of_data': 'High (LLM API keys, offensive '
                                        'security tool access)',
                 'type_of_data_compromised': 'API keys, configurations, system '
                                             'access'},
 'description': 'A recent academic study published on arXiv reveals critical '
                'security flaws in agentic red-team tools (autonomous '
                'offensive security platforms designed to simulate '
                'cyberattacks). Researchers analyzed 12 widely used systems '
                'and found systemic design weaknesses allowing attackers to '
                'hijack these tools, steal API keys, escape sandboxes, and '
                'fully compromise the hosts running them. The study introduces '
                "'agent-phishing,' a novel attack method manipulating red-team "
                'agents without traditional prompt injection, achieving a '
                '97.8% success rate in remote code execution (RCE) across '
                'tested tools and LLMs.',
 'impact': {'brand_reputation_impact': 'Potential erosion of trust in '
                                       'autonomous offensive security tools',
            'data_compromised': 'API keys, sensitive configurations, and '
                                'system access',
            'operational_impact': 'Compromise of offensive security '
                                  'operations, potential misuse of tools for '
                                  'malicious attacks',
            'systems_affected': 'Agentic red-team tools (12 analyzed), worker '
                                'nodes, orchestrators, and underlying hosts'},
 'initial_access_broker': {'backdoors_established': 'Persistence via source '
                                                    'code, configurations, or '
                                                    'long-lived inputs (e.g., '
                                                    'skills, memory)',
                           'entry_point': 'Malicious artifacts (e.g., fake '
                                          'utilities) deployed on honeypot '
                                          'targets',
                           'high_value_targets': 'Orchestrators, worker nodes, '
                                                 'underlying hosts'},
 'investigation_status': 'Completed (academic research)',
 'lessons_learned': 'Existing guardrails (e.g., domain blocking) are '
                    'ineffective if enforced only at the orchestrator level. '
                    'Workers must be treated as untrusted, and stronger '
                    'isolation is critical to prevent compromise of autonomous '
                    'offensive tools.',
 'post_incident_analysis': {'corrective_actions': ['Implement strict '
                                                   'worker-orchestrator '
                                                   'separation',
                                                   'Remove secrets from '
                                                   'workers',
                                                   'Enforce OS-level '
                                                   'guardrails',
                                                   'Use least-privileged '
                                                   'workers',
                                                   'Harden APIs'],
                            'root_causes': ['Systemic design weaknesses in '
                                            'agentic red-team tools',
                                            'Weak isolation between workers '
                                            'and orchestrators',
                                            'Excessive container privileges '
                                            '(e.g., --privileged, Docker '
                                            'socket access)',
                                            'Unauthenticated APIs and shared '
                                            'volumes',
                                            'Lack of monitoring for worker '
                                            'behavior']},
 'recommendations': ['Treat LLM-controlled workers as untrusted',
                     'Minimize worker impact potential',
                     'Implement strict worker-orchestrator separation',
                     'Keep secrets out of workers',
                     'Enforce OS-level guardrails via external egress proxies',
                     'Avoid tool execution on the orchestrator',
                     'Use least-privileged, scoped workers with hardened APIs'],
 'references': [{'source': 'arXiv study'}],
 'response': {'remediation_measures': ['Strict worker-orchestrator separation',
                                       'Keeping secrets out of workers',
                                       'Enforcing OS-level guardrails via '
                                       'external egress proxies',
                                       'Avoiding tool execution on the '
                                       'orchestrator',
                                       'Using least-privileged, scoped workers '
                                       'with hardened APIs']},
 'stakeholder_advisories': 'Organizations using agentic red-team tools should '
                           'review their security posture, implement '
                           'recommended mitigations, and monitor for signs of '
                           'compromise.',
 'title': "Agentic Red-Team Tools Found Vulnerable to 'Agent-Phishing' Attacks "
          'in New Study',
 'type': 'Vulnerability Exploitation',
 'vulnerability_exploited': 'Systemic design weaknesses in agentic red-team '
                            'tools, including weak isolation, shared volumes, '
                            'unauthenticated APIs, and excessive container '
                            'privileges'}

OpenAI and Claude: Agentic Red-Team Tools Flaws Let Hackers Steal API Keys, Escape Sandboxes, and Compromise Hosts

How the Attack Works

High Success Rate & Escalation Path

Guardrails Proven Ineffective

Mitigation Recommendations

Microsoft: Edge users beware — this malicious extension can break out of the sandbox and install ransomware

Microsoft: Microsoft WinRE Vulnerability Allows Hackers to Bypass UEFI/BIOS Password Enforcement

curl project: curl Patches 18 Vulnerabilities Including Password Leak and WebSocket Memory Bugs