rlsecd: A Streaming RL Security Daemon

In my Security-Gym post I described the environment I built to generate realistic cybersecurity log streams for RL experiments. The next conditions I wanted to test were to see if a continual learning agent can actually detect attacks in this data running as a continual, online RL agent processing one event at a time.

I built rlsecd (RL Security Daemon), a standalone Linux daemon that monitors server logs from the security-gym observation space and classifies sessions as malicious or benign using a streaming RL agent. I built rlsecd from my earlier chronos-sec experiment codebase into a standalone package with the goal of running it as a systemd service on production servers. This post covers the architecture, the experiment results across three Security-Gym datasets, and what I learned running a continual learning agent on 9.4 million events.

Two Modes, Same Code Path

rlsecd has two operational modes:

Production: I haven't tested this yet but the concept will be tailing real logs (e.g. auth.log, syslog) on a Linux server and collecting kernel events from eBPF to classify sessions in real-time, emit alerts via syslog or webhooks
Experiment: Replay Security-Gym EventStore databases through the same pipeline (rlsecd --gym <db>)

The experiment mode uses the exact same daemon code path as production. The only difference is the log source, instead of tailing a file, it reads from Security-Gym's observation space genereated from SQLite event stores. This means the results I get from experiments are representative of what the daemon will do in production.

Architecture of rlsecd

The codebase is organized into six packages with a strict JAX boundary only one module imports JAX and alberta-framework. Everything else is pure Python and numpy:

Log sources are pluggable via a LogSourceRegistry that maps config type strings to source classes with a @register('type') decorator. This is how the same daemon handles both production log tailing and experiment replay — the YAML config (or --gym flag) selects the source at startup without any code changes.

The event pipeline:

Log Source (auth.log tail or EventStore replay)
    |  <- config-driven via LogSourceRegistry
    v
Session Tracker -> SessionState
    |  <- groups events by session ID, tracks auth attempts/methods/usernames
    v
State Builder -> 12-feature observation vector
    v
Security Agent (5-head MLP)
    |-- predict(obs) -> per-head scores
    \-- update(obs, labels) <- NaN-masked partial ground truth
    v
Session Score Tracker (per-session EMA, decay=0.8)
    |  <- smooths noisy per-event malicious predictions
    v
Alert Generator (tiered thresholds + per-session cooldown)
    v
Alert Sinks (syslog, SQLite, webhook/Slack)

The daemon runs an asyncio event loop that processes events one at a time as they arrive from log sources. For each event: track the session, build an observation, predict, learn from ground truth if available, smooth the score via session EMA, and check alert thresholds. There is no batching and no replay buffer. This is to adhere to the Alberta Plan's temporal uniformity.

The 12-Feature State Vector

These features aren't extracted from individual log lines as no single event contains all 12 values. Instead, the SessionTracker groups events by session ID and incrementally updates a SessionState object as events arrive from any configured source. Each event contributes what it has: an AUTH_FAILURE increments the failure count and adds the username to the set, a SESSION_OPEN might only contribute an IP and timestamp, an eBPF process_exec event currently maps to OTHER but still updates session timing. The ProductionStateBuilder then reads the accumulated session state to produce the feature vector. Features that haven't been populated yet use neutral defaults —0.5 for "unknown" (ip_reputation, known_user, password_vs_key, success_rate) and 0 for "nothing observed" (auth attempts, unique usernames, session count). Temporal features are always available since every event has a timestamp. This means the very first event in a session produces a valid 12-dimensional vector that becomes more informative as the session progresses.

The ProductionStateBuilder extracts 12 features from each session's current state:

Feature	Description
`ip_reputation`	Threat intel score (stubbed at 0.5 for now)
`auth_attempts_1min`	Auth attempts in last 60 seconds
`auth_attempts_5min`	Auth attempts in last 5 minutes
`unique_usernames`	Distinct usernames tried in this session
`password_vs_key`	Auth method (0=password, 1=key)
`known_user`	Whether the username exists on the system
`hour_sin`, `hour_cos`	Time of day (cyclical encoding)
`day_sin`, `day_cos`	Day of week (cyclical encoding)
`session_count_24h`	Sessions from this IP in last 24 hours
`success_rate`	Historical auth success rate for this IP

These are hand-crafted features, which I know is not ideal. The whole point of presenting raw text observations in Security-Gym is eventually getting an agent to learn its own representations. But for validating the daemon pipeline and establishing baselines, explicit features work.

The 5-Head MLP

The agent uses a single MultiHeadMLPLearner from alberta-framework with five prediction heads sharing a common MLP trunk (two hidden layers of 64 units):

Head	Name	Type	Range
0	`is_malicious`	Binary	0/1
1	`attack_type`	Categorical	8 types (normalized to [0,1])
2	`attack_stage`	Ordinal	5 stages (normalized to [0,1])
3	`severity`	Ordinal	0-3 (normalized to [0,1])
4	`session_value`	Continuous	[0,1] scaled

NaN-masked targets allow partial label updates — if only is_malicious is available, the other heads are skipped during gradient computation. The trunk uses Autostep optimization (Mahmood et al. 2012) with ObGD bounding, EMA normalization, layer normalization, and sparse initialization. This is the stack from Elsayed et al. 2024 with Autostep as the step-size algorithm for tuning-free adaptation.

Results: 7 Days to 365 Days

I ran rlsecd against all three of Security-Gym's composed experiment streams. All runs used the default configuration: Autostep optimizer, detection threshold 0.55, two hidden layers of 64 units.

End-to-End Performance

Dataset	Events	Time	Throughput	is_mal MAE	atk_type MAE	atk_stage MAE	severity MAE
7d brute	140,315	71s	1,970 evt/s	0.0004	0.0080	0.0126	0.0245
30d heavy	1,003,075	380s	2,638 evt/s	0.0026	0.0052	0.0028	0.0045
365d realistic	9,410,946	6,063s	1,552 evt/s	0.0266	0.0080	0.0050	0.0100

The 365-day run processed 9.4 million events in 101 minutes on CPU. The agent learns continuously through the entire stream with no divergence.

365-Day Detection (threshold=0.55)

Per-event:

Metric	Value
Detection rate (TPR)	99.4%
False positive rate	0.0%
Precision	96.6%
F1	97.9%
Confusion	TP=38,983 FP=1,387 TN=7,765,547 FN=251

Per-session:

Metric	Value
Detection rate (TPR)	99.8%
False positive rate	0.0%
Precision	100.0%
F1	99.9%
Confusion	TP=4,443 FP=0 TN=0 FN=11

Zero false positive sessions over a full simulated year. Every alert the daemon produced was a real attack. The 11 missed sessions were all post-exploitation execution stage which are sessions where an attacker who brute-forced in on one session later returns for command execution. The execution session itself then looks like normal authenticated activity. The 12-feature state vector has no cross-session context, so the agent can't link the reconnaissance/brute-force to subsequent execution from the same IP. A SessionCorrelator (planned for Phase 2) would close this gap. That's on the todo list.

Threshold Sweep

The detection threshold is the cutoff applied to the session EMA score. Sessions scoring above it are classified as malicious and generate alerts, sessions below are treated as benign. Too low and benign sessions trigger false alarms; too high and real attacks slip through undetected.

I ran threshold sweeps to find the optimal operating point. The agent's predictions have a sharp decision boundary around 0.50-0.51, so the threshold choice matters.

Dataset	Threshold	TPR	FPR	Precision	F1
7d (max F1)	0.58	99.5%	0.0%	99.8%	99.6%
30d (max F1)	0.59	98.7%	0.1%	97.1%	97.9%

Per-session detection on the 7d dataset was perfect (100% TPR, 0% FPR) at any threshold up to 0.66. On the 30d dataset: 99.1% TPR, 0% FPR at threshold=0.60, with the same 12 missed execution-stage sessions.

The recommended default of 0.55 balances 99%+ recall with near-zero false positive rate across all datasets.

Feature Importance: What the Agent Learned

I ran Jacobian sensitivity analysis on the 365-day dataset to understand what the agent actually learned. This computes the mean absolute Jacobian $|\partial\text{Prediction}/\partial\text{Feature}|$ averaged over observation samples.

is_malicious head (primary detection):

Feature	Sensitivity Share
`auth_attempts_5min`	41.9%
`day_cos`	22.3%
`day_sin`	10.4%
`success_rate`	7.7%
`ip_reputation`	7.6%
`unique_usernames`	4.5%
`session_count_24h`	2.3%
Others (5 features)	<3.3% combined

The top finding: auth_attempts_5min dominates detection at 41.9%. High authentication volume in a 5-minute window is the strongest attack signal, which makes intuitive sense as brute force attacks are the most common attack type in the data. I'm working on adding exploits as initial access to see how the agent performs on different attack types for the initial compromise.

The temporal features split in an interesting way. day_cos and day_sin encode day of week and contribute 32.7% combined. The Security-Gym campaign scheduler uses Poisson rates that vary by day, so the agent learned that attack probability depends on which day it is. In contrast, hour_sin and hour_cos encode hour of day and are nearly irrelevant (<0.2% combined). Within a given day, attacks are spread across all hours rather than clustering around business hours or off-hours. The hour features are candidates for removal or replacement with something more useful.

ip_reputation shows 58.6% sensitivity on the attack_type head, but this is misleading since ip_reputation is currently stubbed at a constant 0.5 for all events. The agent has learned large weights on this constant input, effectively using it as a bias term for attack type classification rather than responding to actual threat intelligence data. Wiring this feature to a real threat intel feed would likely improve attack type discrimination significantly.

Top 5 features account for 90% of is_malicious sensitivity. The agent figured out what matters.

Throughput: From 48 to 2,514 Events/Second

The first version of rlsecd processed about 48 events per second. Profiling revealed the bottleneck: un-JIT'd JAX predict/update calls taking ~20ms each.

As of alberta-framework v0.13.0, JIT compilation is built into MultiHeadMLPLearner upstream. After warmup traces at init, the predict/update calls drop to ~0.3ms (52x speedup).

Component	Mean (us)	% Wall Time
update	307	60.7%
predict	148	29.3%
build_obs	12	2.4%
alert_check	5	1.0%
session_track	1.4	0.4%
source/IO/async	--	6.2%

The agent's predict/update cycle consumes 90% of wall time. The remaining pipeline (session tracking, observation building, alerting) is negligible. Further throughput gains would require either faster JAX kernels or pipelining the predict/update off the critical path.

What's Next

The immediate gap is cross-session correlation. The 11 missed sessions in the 365-day run are all from attackers who return after an initial brute force. A SessionCorrelator that links recent brute-force activity from the same IP to subsequent sessions would likely close this gap entirely.

Beyond that, the 12-feature state vector is a stopgap. The whole point of Security-Gym presenting raw text observations is to eventually learn representations directly from log lines for Step 2 (Representation II) of the Alberta Plan. rlsecd's Phase 2 will port the feature hashing pipeline from chronos-sec to replace hand-crafted features with learned ones.

The detection threshold is also hand-tuned. Alberta Plan Steps 3-4 (control and planning) describe how an agent should learn to set its own thresholds based on the cost structure of false positives vs false negatives. That's also on the todo list.

I'm eager to test this as an actual daemon running on a public Internet facing server. I plan to install it on one a server I host some of my personal sites on to see how it performs. The daemon already has a systemd unit file, install/uninstall scripts, and config validation. The plan is to run it in shadow mode on a real server to validate performance on live traffic before moving into control.

rlsecd is in experiment phase still so I haven't published it publicly yet but alberta-framework where the RL algorithms are implemented and Security-Gym the gymnasium enviornment these experimentes used are available.

Security-Gym: j-klawson/security-gym
alberta-framework: j-klawson/alberta-framework

Lab Setup:

OS: Debian 13.3 (Trixie)
CPU: Intel i5-12400 (6 cores, 12 threads) — all rlsecd runs on CPU
Memory: 24GB RAM
rlsecd: v0.3.2, 139 tests passing, mypy strict, ruff clean

References:

Elsayed, M., Vasan, G., & Mahmood, A. R. (2024). Streaming deep reinforcement learning finally works. arXiv:2410.14606.
Mahmood, A. R., Sutton, R. S., Degris, T., & Pilarski, P. M. (2012). Tuning-free step-size adaptation. ICASSP 2012.
Sutton, R. S., Bowling, M., & Pilarski, P. M. (2022). The Alberta Plan for AI research. arXiv:2208.11173.