Security-Gym: A Streaming RL Benchmark from Real Attack Data

Over the past several months I've been running streaming RL experiments on real attack data extracted from a Cowrie SSH honeypot. The RL agent performance on the Cowrie data is promising and I wanted to continue testing different algorithms. However I want the agent to function on real Linux server logs and kernel events from eBPF and needed to produce an environment for the agent to run in that would produce multiple log sources (web/auth/syslog etc). I began with a full blown T-Pot Honeypot but due to the architecture of T-Pot it proved to be too complex to get at the individual logs if was even possible. I kept running into the same problem in that there's no good RL benchmark for cybersecurity that looks anything like what a SOC analyst deals with. The existing environments are all discrete state machines with pre-defined state spaces and episodic structure.

To setup an environment that would support my experiments I built Security-Gym. Security-Gym is a gymnasium compatible environment for continual reinforcement learning on raw cybersecurity log streams. It's pip-installable, published on PyPI, and archived on Zenodo with a DOI to provide the data sets. This post explains what it is, how I built it, and what I've validated so far.

Why Another Benchmark?

I'm trying to emulate an environment a real SOC analyst works in. When investigating incidents SOC analysts are analysing raw log data through your SIEM or EDR platform scrolling through syslog output, web access logs, and sometimes, kernel-level telemetry from eBPF-based endpoint detection agents. The analyst has to learn build up the skills to investigate alerts using raw text, to decide when to act and make decisions to block, quarantine, re-image devices.

None of the existing RL environments I looked at capture that type of data. Here's a breakdown of the difference in Security-Gym.

Environment	Observations	Actions	Stream	Data Source
Security-Gym	Raw text (6 channels)	Defensive (6 + risk score)	Continuous	Real infrastructure
CyberBattleSim	Network graph	Exploit/lateral	Episodic	Simulated
CAGE	Network state vector	Defensive	Episodic	Simulated
NASim	Network adjacency	Exploit	Episodic	Simulated
gym-idsgame	Network topology	Attack/defend	Episodic	Simulated

Security-Gym is the only environment that presents raw text observations, includes eBPF kernel events, supports continuous (non-episodic) streams, and uses data collected from real infrastructure. The data is replayed from SQLite through the environment, and a StreamComposer mixes benign and attack databases into composed experiment streams with configurable duration, attack density, and seed for reproducibility.

This design aligns directly with the Alberta Plan: the environment provides ordinary experience as a continuous stream with no episode boundaries, and the agent must learn its own representations from raw text rather than consuming pre-extracted features. terminated is always False. The stream never ends.

Collecting The Data: 8 Million Events from Real Infrastructure

I collected data from a purpose-built Debian 11 VM running multiple vulnerable services:

SSH (port 22): OpenSSH with password authentication enabled
Log4Shell (port 8080): Vulnerable Log4j application (CVE-2021-44228)
Nginx (port 80): Reverse proxy with default configuration
Redis (port 6379): Redis server with CVE-2022-0543 (Lua sandbox escape)

Six attack modules generate labeled data aligned with MITRE ATT&CK:

Module	MITRE Technique	Description
`recon`	T1046 Network Service Discovery	SYN port scan via scapy raw sockets
`ssh_brute_force`	T1110.001 Password Guessing	SSH brute force via paramiko with IP aliasing
`credential_stuffing`	T1110.004 Credential Stuffing	Breach dump credentials, each tried once
`log4shell`	T1190 Exploit Public-Facing App	Log4Shell JNDI injection via HTTP
`redis_lua_escape`	T1190 Exploit Public-Facing App	Redis Lua sandbox escape (CVE-2022-0543)
`ssh_post_auth`	T1059.004 Unix Shell	Post-auth command execution + payload download

Campaigns are defined in YAML configs with timing profiles, IP aliasing strategies, and automatic labeling through time+IP window matching. I scripted full kill chain campaigns — recon → credential stuffing → post-auth execution, and recon → Redis exploit → SSH pivot. The exploit scripts are provided in the repository for anyone who wants to create their own data sets. The v2 campaign database also includes eBPF kernel events (process, network, and file activity) captured during attacks, giving the agent visibility into both log-level and kernel-level indicators of compromise.

Benign log data comes from servers hosting my own infrastructure plus old backups of personal servers. I ran those logs through filters to strip sensitive data and normalize hostnames and IPs so they appear to originate from a single server.

The resulting dataset:

Database	Benign	Malicious	eBPF	Total
benign_v3.db	7,915,858	0	0	7,915,858
campaigns_v2.db (+eBPF)	30,032	30,436	48,928	60,468
Raw total	7,945,890	30,436	48,928	7,976,326

eBPF events are a subset of the benign/malicious counts (already classified by time+IP window matching), not additive. Total = Benign + Malicious.

These raw databases are composed into experiment streams of varying duration and attack density. The largest is a 365-day realistic scenario that contains 9.4 million events across 897 attack campaigns, with benign traffic cycled to fill the full year.

eBPF: Seeing What Logs Can't

I've done plenty of traffic analysis and firewall work using Berkeley Packet Filters as a sysadmin but have never used eBPF features. That made this one of the more interesting parts to build. Traditional log files only capture what applications choose to log. eBPF lets you attach to kernel tracepoints and observe syscall-level activity that's invisible to auth.log or syslog.

The eBPF collector daemon attaches to six kernel tracepoints via BCC:

Channel	Tracepoints	Key Fields
`process_events`	`sys_enter_execve`, `sched_process_exit`	pid, ppid, uid, comm, parent_comm, args
`network_events`	`sys_enter_connect`, `sys_enter_accept4`	pid, uid, comm, dst IP:port
`file_events`	`sys_enter_openat`, `sys_enter_unlinkat`	pid, comm, path, flags

Process events include parent process ancestry via task->real_parent, which enables causal chain reconstruction. When an attacker exploits Log4Shell, the process chain looks like java -> bash -> wget. That is visible in the eBPF events but difficult to reconstruct from the application logs. An agent that can learn to recognize these causal chains has a significant advantage over one limited to text logs.

The labeler applies the same time+IP window matching to both log events and eBPF kernel events. An execve wget from the attacker's IP during an attack phase is correctly labeled malicious.

Environment Design

What the Agent Sees

The agent observes a Dict space with six text channels and one numeric channel:

Channel	Content
`auth_log`	SSH authentication events (`/var/log/auth.log`)
`syslog`	System events (`/var/log/syslog`)
`web_log`	Combined web access and error logs
`process_events`	eBPF: `execve`, `exit` kernel events
`network_events`	eBPF: `connect`, `accept` socket events
`file_events`	eBPF: `open`, `unlink` file events
`system_stats`	`[load_avg, mem_used_frac, disk_used_frac]`

Each text channel is a ring buffer of recent lines (configurable tail_lines and max_chars), updated on every step. The agent sees the same data a SOC analyst would look at raw log lines as opposed to pre-extracted features.

What the Agent Does

I'm still working on prediction and feature finding so I don't have a need for actions yet but it's built into Security Gym for when I get to control.

The action space is a Dict with a discrete defensive action and a continuous risk score:

Action	Effect
`pass`	Continue monitoring
`alert`	Flag for human review
`throttle`	Rate-limit source IP (~90% drop)
`block_source`	Add source IP to firewall blocklist
`unblock`	Remove source IP from blocklist/throttle list
`isolate`	Quarantine server (block all network events)

It's going to be interesting to see how the actions affect future observations. For example blocking an attacker and the attack signal will disappear, erasing the evidence that the attack is ongoing. I'm going to have to figure out a way to identify false positives to the agent so I can give negative rewards. The environment maintains a defense state (blocklist, throttle list, isolation mode) that filters what the agent sees on future steps. This creates a partially observable feedback loop that doesn't exist in classification-based intrusion detection.

Rewards

Again, not used yet but built it for future use.

The reward function is asymmetric, reflecting the real cost structure of security operations:

Action	During Attack	During Benign
`block_source`	+1.0	-1.0
`throttle`	+0.75	-0.5
`alert`	+0.5	-0.3
`pass`	-0.5	0.0
`isolate`	+0.25	-2.0

Blocked and throttled events also accumulate ongoing reward: +0.05 per blocked attack event, -0.1 per blocked benign event. The agent feels the sustained cost of false positives. Isolation is the nuclear option — extremely costly unless the situation truly warrants it.

Composed Experiment Streams

Security-Gym includes a StreamComposer that offline-mixes benign and attack EventStore databases into composed experiment streams:

Stream	Benign	Malicious	eBPF	Total
exp_7d_brute	114,335	25,980	24,883	140,315
exp_30d_heavy	493,735	509,340	474,868	1,003,075
exp01_90d	1,446,244	474,915	446,915	1,921,159
exp_365d_realistic	7,766,934	1,644,012	1,552,102	9,410,946

Composition is deterministic given a seed. Attack campaigns are scheduled via Poisson process with MITRE ATT&CK-weighted type distributions (discovery: 0.35, brute_force: 0.30, web_exploit: 0.20, credential_stuffing: 0.10, execution: 0.05). Intra-session timing is preserved when transplanting attack sessions into the benign stream so the temporal characteristics of each attack type are realistic.

Agent Validation

I built a standalone RL security daemon, rlsecd, to validate Security-Gym as a learning environment. rlsecd processes one event at a time with per-step predict/update. No batching, no replay buffers. I tested against Security-Gym's composed streams and had some promising results on the first few tests. The full architecture, detection results, and feature importance analysis are in the rlsecd post.

Getting Started

pip install security-gym

# Download the datasets
security-gym download

# Use it like any Gymnasium environment

import gymnasium as gym

env = gym.make("SecurityLogStream-v1", db_path="exp_30d_heavy.db")
obs, info = env.reset()

while True:
    action = agent.act(obs)
    obs, reward, terminated, truncated, info = env.step(action)
    agent.learn(obs, reward)
    if truncated:
        break

The environment is Gymnasium v1.0.0+ compliant. The standard RL training loop works unmodified — the only difference is that terminated is always False and truncated signals end of data.

What's Next

Security-Gym currently covers 5 of hundreds of ATT&CK techniques. Persistence (TA0003), privilege escalation (TA0004), and exfiltration (TA0010) modules are planned. I'd also like to build a streaming server mode for real-time agent evaluation and expand beyond single-host to multi-host network environments with lateral movement.

On the agent side, the current baseline results use hand-crafted features from rlsecd. The bigger challenge, and the whole point of presenting raw text observations, is getting an agent to learn its own representations directly from log lines. That's Step 2 (Representation II) of the Alberta Plan and it's where I'm focusing right now but it's really tempting to just move into control and see rlsecd in action.

The datasets, environment, and all the code are open source. If you're working on RL for cybersecurity or continual learning on non-stationary data, I'd be interested to hear what you find.

PyPI: security-gym
GitHub: j-klawson/security-gym
Zenodo (software): DOI 10.5281/zenodo.18810298
Zenodo (dataset): DOI 10.5281/zenodo.18901542
rlsecd: j-klawson/rlsecd
alberta-framework: j-klawson/alberta-framework

Lab Setup:

OS: Debian 13.3 (Trixie)
GPU: NVIDIA RTX 3070 (8GB VRAM, 5,888 CUDA cores)
CPU: Intel i5-12400 (6 cores, 12 threads)
Memory: 24GB RAM

Data Collection Infrastructure:

Target VM: "Isildur" — Debian 11, SSH + Log4Shell + Nginx + Redis, BCC 0.18 for eBPF
Attack orchestration: 6 MITRE ATT&CK-aligned modules with YAML campaign configs
eBPF tracepoints: execve, connect, accept4, openat, unlinkat via BCC

References:

Elsayed, M., Vasan, G., & Mahmood, A. R. (2024). Streaming deep reinforcement learning finally works. arXiv:2410.14606.
Sutton, R. S., Bowling, M., & Pilarski, P. M. (2022). The Alberta Plan for AI research. arXiv:2208.11173.
Towers, M. et al. (2024). Gymnasium: A standard interface for reinforcement learning environments. arXiv:2407.17032.
MITRE Corporation. (2024). ATT&CK: Adversarial Tactics, Techniques, and Common Knowledge.