Is this a game, or is it real?

Security-Gym: A Streaming RL Benchmark from Real Attack Data

Written on

Security-Gym: A Streaming RL Benchmark from Real Attack Data

Over the past several months I've been running streaming RL experiments on real attack data extracted from a Cowrie SSH honeypot. The RL agent performance on the Cowrie data is promising and I wanted to continue testing different algorithms. However I want the agent to function on real Linux server logs and kernel events from eBPF and needed to produce an environment for the agent to run in that would produce multiple log sources (web/auth/syslog etc). I began with a full blown T-Pot Honeypot but due to the architecture of T-Pot it proved to be too complex to get at the individual logs if was even possible. I kept running into the same problem in that there's no good RL benchmark for cybersecurity that looks anything like what a SOC analyst deals with. The existing environments are all discrete state machines with pre-defined state spaces and episodic structure.

To setup an environment that would support my experiments I built Security-Gym. Security-Gym is a gymnasium compatible environment for continual reinforcement learning on raw cybersecurity log streams. It's pip-installable, published on PyPI, and archived on Zenodo with a DOI to provide the data sets. This post explains what it is, how I built it, and what I've validated so far.

Why Another Benchmark?

I'm trying to emulate an environment a real SOC analyst works in. When investigating incidents SOC analysts are analysing raw log data through your SIEM or EDR platform scrolling through syslog output, web access logs, and sometimes, kernel-level telemetry from eBPF-based endpoint detection agents. The analyst has to learn build up the skills to investigate alerts using raw text, to decide when to act and make decisions to block, quarantine, re-image devices.

None of the existing RL environments I looked at capture that type of data. Here's a breakdown of the difference in Security-Gym.

Environment Observations Actions Stream Data Source
Security-Gym Raw text (6 channels) Defensive (6 + risk score) Continuous Real infrastructure
CyberBattleSim Network graph Exploit/lateral Episodic Simulated
CAGE Network state vector Defensive Episodic Simulated
NASim Network adjacency Exploit Episodic Simulated
gym-idsgame Network topology Attack/defend Episodic Simulated

Security-Gym is the only environment that presents raw text observations, includes eBPF kernel events, supports continuous (non-episodic) streams, and uses data collected from real infrastructure. The data is replayed from SQLite through the environment, and a StreamComposer mixes benign and attack databases into composed experiment streams with configurable duration, attack density, and seed for reproducibility.

This design aligns directly with the Alberta Plan: the environment provides ordinary experience as a continuous stream with no episode boundaries, and the agent must learn its own representations from raw text rather than consuming pre-extracted features. terminated is always False. The stream never ends.

Collecting The Data: 8 Million Events from Real Infrastructure

I collected data from a purpose-built Debian 11 VM running multiple vulnerable services:

  • SSH (port 22): OpenSSH with password authentication enabled
  • Log4Shell (port 8080): Vulnerable Log4j application (CVE-2021-44228)
  • Nginx (port 80): Reverse proxy with default configuration
  • Redis (port 6379): Redis server with CVE-2022-0543 (Lua sandbox escape)

Six attack modules generate labeled data aligned with MITRE ATT&CK:

Module MITRE Technique Description
recon T1046 Network Service Discovery SYN port scan via scapy raw sockets
ssh_brute_force T1110.001 Password Guessing SSH brute force via paramiko with IP aliasing
credential_stuffing T1110.004 Credential Stuffing Breach dump credentials, each tried once
log4shell T1190 Exploit Public-Facing App Log4Shell JNDI injection via HTTP
redis_lua_escape T1190 Exploit Public-Facing App Redis Lua sandbox escape (CVE-2022-0543)
ssh_post_auth T1059.004 Unix Shell Post-auth command execution + payload download

Campaigns are defined in YAML configs with timing profiles, IP aliasing strategies, and automatic labeling through time+IP window matching. I scripted full kill chain campaigns — recon → credential stuffing → post-auth execution, and recon → Redis exploit → SSH pivot. The exploit scripts are provided in the repository for anyone who wants to create their own data sets. The v2 campaign database also includes eBPF kernel events (process, network, and file activity) captured during attacks, giving the agent visibility into both log-level and kernel-level indicators of compromise.

Benign log data comes from servers hosting my own infrastructure plus old backups of personal servers. I ran those logs through filters to strip sensitive data and normalize hostnames and IPs so they appear to originate from a single server.

The resulting dataset:

Database Benign Malicious eBPF Total
benign_v3.db 7,915,858 0 0 7,915,858
campaigns_v2.db (+eBPF) 30,032 30,436 48,928 60,468
Raw total 7,945,890 30,436 48,928 7,976,326

eBPF events are a subset of the benign/malicious counts (already classified by time+IP window matching), not additive. Total = Benign + Malicious.

These raw databases are composed into experiment streams of varying duration and attack density. The largest is a 365-day realistic scenario that contains 9.4 million events across 897 attack campaigns, with benign traffic cycled to fill the full year.

eBPF: Seeing What Logs Can't

I've done plenty of traffic analysis and firewall work using Berkeley Packet Filters as a sysadmin but have never used eBPF features. That made this one of the more interesting parts to build. Traditional log files only capture what applications choose to log. eBPF lets you attach to kernel tracepoints and observe syscall-level activity that's invisible to auth.log or syslog.

The eBPF collector daemon attaches to six kernel tracepoints via BCC:

Channel Tracepoints Key Fields
process_events sys_enter_execve, sched_process_exit pid, ppid, uid, comm, parent_comm, args
network_events sys_enter_connect, sys_enter_accept4 pid, uid, comm, dst IP:port
file_events sys_enter_openat, sys_enter_unlinkat pid, comm, path, flags

Process events include parent process ancestry via task->real_parent, which enables causal chain reconstruction. When an attacker exploits Log4Shell, the process chain looks like java -> bash -> wget. That is visible in the eBPF events but difficult to reconstruct from the application logs. An agent that can learn to recognize these causal chains has a significant advantage over one limited to text logs.

The labeler applies the same time+IP window matching to both log events and eBPF kernel events. An execve wget from the attacker's IP during an attack phase is correctly labeled malicious.

Environment Design

What the Agent Sees

The agent observes a Dict space with six text channels and one numeric channel:

Channel Content
auth_log SSH authentication events (/var/log/auth.log)
syslog System events (/var/log/syslog)
web_log Combined web access and error logs
process_events eBPF: execve, exit kernel events
network_events eBPF: connect, accept socket events
file_events eBPF: open, unlink file events
system_stats [load_avg, mem_used_frac, disk_used_frac]

Each text channel is a ring buffer of recent lines (configurable tail_lines and max_chars), updated on every step. The agent sees the same data a SOC analyst would look at raw log lines as opposed to pre-extracted features.

What the Agent Does

I'm still working on prediction and feature finding so I don't have a need for actions yet but it's built into Security Gym for when I get to control.

The action space is a Dict with a discrete defensive action and a continuous risk score:

Action Effect
pass Continue monitoring
alert Flag for human review
throttle Rate-limit source IP (~90% drop)
block_source Add source IP to firewall blocklist
unblock Remove source IP from blocklist/throttle list
isolate Quarantine server (block all network events)

It's going to be interesting to see how the actions affect future observations. For example blocking an attacker and the attack signal will disappear, erasing the evidence that the attack is ongoing. I'm going to have to figure out a way to identify false positives to the agent so I can give negative rewards. The environment maintains a defense state (blocklist, throttle list, isolation mode) that filters what the agent sees on future steps. This creates a partially observable feedback loop that doesn't exist in classification-based intrusion detection.

Rewards

Again, not used yet but built it for future use.

The reward function is asymmetric, reflecting the real cost structure of security operations:

Action During Attack During Benign
block_source +1.0 -1.0
throttle +0.75 -0.5
alert +0.5 -0.3
pass -0.5 0.0
isolate +0.25 -2.0

Blocked and throttled events also accumulate ongoing reward: +0.05 per blocked attack event, -0.1 per blocked benign event. The agent feels the sustained cost of false positives. Isolation is the nuclear option — extremely costly unless the situation truly warrants it.

Composed Experiment Streams

Security-Gym includes a StreamComposer that offline-mixes benign and attack EventStore databases into composed experiment streams:

Stream Benign Malicious eBPF Total
exp_7d_brute 114,335 25,980 24,883 140,315
exp_30d_heavy 493,735 509,340 474,868 1,003,075
exp01_90d 1,446,244 474,915 446,915 1,921,159
exp_365d_realistic 7,766,934 1,644,012 1,552,102 9,410,946

Composition is deterministic given a seed. Attack campaigns are scheduled via Poisson process with MITRE ATT&CK-weighted type distributions (discovery: 0.35, brute_force: 0.30, web_exploit: 0.20, credential_stuffing: 0.10, execution: 0.05). Intra-session timing is preserved when transplanting attack sessions into the benign stream so the temporal characteristics of each attack type are realistic.

Agent Validation

I built a standalone RL security daemon, rlsecd, to validate Security-Gym as a learning environment. rlsecd processes one event at a time with per-step predict/update. No batching, no replay buffers. I tested against Security-Gym's composed streams and had some promising results on the first few tests. The full architecture, detection results, and feature importance analysis are in the rlsecd post.

Getting Started

pip install security-gym

# Download the datasets
security-gym download

# Use it like any Gymnasium environment
import gymnasium as gym

env = gym.make("SecurityLogStream-v1", db_path="exp_30d_heavy.db")
obs, info = env.reset()

while True:
    action = agent.act(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    agent.learn(obs, reward)
    if truncated:
        break

The environment is Gymnasium v1.0.0+ compliant. The standard RL training loop works unmodified — the only difference is that terminated is always False and truncated signals end of data.

What's Next

Security-Gym currently covers 5 of hundreds of ATT&CK techniques. Persistence (TA0003), privilege escalation (TA0004), and exfiltration (TA0010) modules are planned. I'd also like to build a streaming server mode for real-time agent evaluation and expand beyond single-host to multi-host network environments with lateral movement.

On the agent side, the current baseline results use hand-crafted features from rlsecd. The bigger challenge, and the whole point of presenting raw text observations, is getting an agent to learn its own representations directly from log lines. That's Step 2 (Representation II) of the Alberta Plan and it's where I'm focusing right now but it's really tempting to just move into control and see rlsecd in action.

The datasets, environment, and all the code are open source. If you're working on RL for cybersecurity or continual learning on non-stationary data, I'd be interested to hear what you find.


Lab Setup:

  • OS: Debian 13.3 (Trixie)
  • GPU: NVIDIA RTX 3070 (8GB VRAM, 5,888 CUDA cores)
  • CPU: Intel i5-12400 (6 cores, 12 threads)
  • Memory: 24GB RAM

Data Collection Infrastructure:

  • Target VM: "Isildur" — Debian 11, SSH + Log4Shell + Nginx + Redis, BCC 0.18 for eBPF
  • Attack orchestration: 6 MITRE ATT&CK-aligned modules with YAML campaign configs
  • eBPF tracepoints: execve, connect, accept4, openat, unlinkat via BCC

References:

  • Elsayed, M., Vasan, G., & Mahmood, A. R. (2024). Streaming deep reinforcement learning finally works. arXiv:2410.14606.
  • Sutton, R. S., Bowling, M., & Pilarski, P. M. (2022). The Alberta Plan for AI research. arXiv:2208.11173.
  • Towers, M. et al. (2024). Gymnasium: A standard interface for reinforcement learning environments. arXiv:2407.17032.
  • MITRE Corporation. (2024). ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge.