Is this a game, or is it real?

RL Roadmap - Updated 3 Month Plan

Written on

I'm a month into my initial Reinforcement Learning Roadmap and a couple things have changed. First of all I've been accepted to University of Michigan Dearborn to begin a Doctor of Engineering degree in January 2026. Second my learnings from the first four weeks have convinced me that I need to focus more on the foundations of RL and less on other peoples implementations. The Hugging Face course on RL is great but it's higher level than I want to be at right now. This updated roadmap focuses on continued study of the basics and my own implementations of the algorithms which I have started in a Python module I've dubbed learnrl.

To prepare for the D.Eng the updated plan removes all the Hugging Face courses and replaces the time I would spend on that with reading research papers and a capstone project. The revised plan is:

  1. Complete Coursera courses C1-C3 by end of Week 4 (✅ on track)
  2. Complete Coursera C4 capstone in 2 intensive weeks (Weeks 5-6)
  3. Use 6 full weeks (Weeks 7-12) for healthcare IAM research capstone
  4. Implement algorithms in learnrl module throughout, with production-quality implementations for the healthcare capstone

Why compress C4 into 2 weeks? This gives maximum time (6 weeks instead of 4) for the healthcare IAM capstone, which aligns with the Alberta Plan for AI Research (Step 12: Intelligence Amplification) and provides stronger foundation for D.Eng. dissertation work. The C4 capstone will naturally drive implementation of key algorithms needed for the IAM project.

For the capstone I'm currently leaning towards something related to Identity and Access Managment (IAM) as that is a growing concern as we move more critical resources to the cloud. However I'm also interested in autonomous networks and may look into something in that area. I'm going to have to see how things evolve as I continue to learn and dive deeper into research papers.

3-Month RL Foundations Roadmap (~10 hrs/week)

Cadence: ~10 hrs/week × 12 weeks (completing before D.Eng. start in January) Stack: Python, NumPy, PyTorch, Gymnasium, pytest, MLflow, Docker Focus: Deep understanding of RL fundamentals through from-scratch implementations, applied to healthcare cybersecurity (adaptive IAM?)


Table of Contents


Phase 1: Foundations (Weeks 1–4)

Week Core RL Learning (7 hrs) Engineering/Implementation (3 hrs) Status
1 C1 Fundamentals + C2 Sample-based Learning; S&B Ch.1–6 Created learnrl/ package, EpsilonGreedyBandit, BanditTestEnvironment with comprehensive tests; PolicyIteration/ValueIteration ✅ COMPLETE
2 C3 M1–M3: Function Approximation foundations, tile coding; S&B Ch.9 GridWorld environment, DP algorithm comparisons and visualizations ✅ COMPLETE
3 C3 M4: Control with approximation, semi-gradient methods; S&B Ch.10 Continued refinement of DP implementations, test coverage to 92% ✅ COMPLETE
4 C3 M5: Policy Gradient methods; S&B Ch.11, Ch.13 Basic policy gradient (REINFORCE) in learnrl/policy_gradient/; prepare for C4 capstone 🔄 IN PROGRESS

Math (2-3 hrs/week): 3Blue1Brown Linear Algebra + Khan Academy Probability basics

Key Achievement: By end of Week 4, completed C1, C2, and C3 Coursera courses with solid implementations of bandits and dynamic programming algorithms.


Phase 2: Coursera C4 Capstone - Intensive 2-Week Sprint (Weeks 5–6)

Goal: Complete full Coursera capstone in 2 weeks to maximize time for healthcare IAM research capstone

Strategy: Intensive focus (8-10 hrs/week) on getting algorithms working and meeting C4 requirements. Write code in learnrl/capstone_coursera/ during capstone, then refactor into clean library modules during Weeks 7-8.

Week Coursera Work (8-10 hrs) Implementation Focus
5 C4 M1-M3 (Intensive): Formalize Coursera's problem as MDP, implement environment, apply 3 algorithms and compare performance, identify key parameters and explore parameter space Create learnrl/capstone_coursera/ environment; implement or adapt TD algorithms (SARSA, Q-Learning, Expected SARSA); run algorithm comparison experiments; execute parameter study (learning rate, epsilon, discount factor)
6 C4 M4-M5 (Intensive): Implement Expected SARSA or Q-Learning with Neural Networks + RMSProp optimizer, verify correctness; complete parameter study with statistical analysis, visualize learned agents, submit C4 capstone Implement neural network function approximation with PyTorch + RMSProp; statistical comparison utilities (confidence intervals, significance tests); policy/value visualization tools; C4 CAPSTONE SUBMITTED

Math (2-3 hrs/week): Hyperparameter optimization, RMSProp and adaptive learning rates, statistical comparison of algorithms (t-tests, confidence intervals)

Healthcare/Cyber Reading: Function approximation in RL, safe RL and constrained optimization, begin Alberta Plan paper

Key Decision: Prioritize working code over perfect code during C4. Refactor into production-quality learnrl library modules during healthcare IAM capstone preparation (Weeks 7-8).


Phase 3: Healthcare IAM Research Capstone - Extended 6-Week Project (Weeks 7–12)

Project: Intelligence Amplification for Dynamic Authorization in Healthcare IAM Alberta Plan Alignment: Step 12 (Intelligence Amplification) - Real-world IA in safety-critical domain

Weeks 7-8: Foundation & Infrastructure

Goal: Establish solid theoretical foundation and build core infrastructure

Week Core Learning (5 hrs) Capstone Implementation (5 hrs)
7 Deep dive: Alberta Plan paper (Sutton, Bowling, Pilarski 2023); GVF architecture and Horde; RLHF fundamentals; Healthcare PBAC systems MDP Formulation: Analyze hospital IAM data structure, formalize state space (user attributes, role history, peer patterns), action space (role assignments, ITSM escalation), reward function (security + usability + compliance); Library Refactoring: Extract C4 code into clean learnrl/function_approx/ and learnrl/td/ modules
8 Baselines & Evaluation: Role mining algorithms, peer-based assignment, safe RL deployment; IA evaluation methodologies; Continual learning in non-stationary environments PBAC Simulator: Build simulator using real hospital PBAC group structure (300+ roles); Baseline Policies: Implement manual assignment, role-tier heuristics, peer-based recommendation; Evaluation Framework: Define IA metrics (decision quality, cognitive load, time-to-competence)

Weeks 9-10: RL+RLHF Implementation

Goal: Implement core RL agent with human feedback integration

Week Core Learning (3 hrs) Capstone Implementation (7 hrs)
9 GVF Design: Reward-respecting subtasks, prediction targets for IAM; Uncertainty Estimation: Ensemble disagreement, epistemic vs aleatoric uncertainty GVF Ensemble: Implement 5+ GVFs (predict future access needs per application, audit risk, approval probability); Explainability: Use GVF predictions to explain recommendations ("94% of peers have this role"); RL Agent: Build using algorithms from learnrl
10 RLHF Integration: Learning from human preferences (Christiano et al.), reward modeling in practice; Active Learning: Query strategies for uncertainty ITSM Escalation Policy: Uncertainty-aware escalation (agent knows when to ask human); Continual Learning: Update reward model from every human decision (ITSM tickets); Temporal Uniformity: Learning on every timestep as per Alberta Plan

Weeks 11-12: Evaluation & Publication

Goal: Comprehensive evaluation and paper draft ready for submission

Week Core Learning (2 hrs) Capstone Implementation (8 hrs)
11 Statistical Validation: Security-specific statistical testing; Safety Analysis: HIPAA/PHIPA compliance validation, privilege escalation detection Comprehensive Evaluation: Run experiments with IA metrics (decision quality, cognitive load, learning efficiency); Safety Analysis: Verify HIPAA/PHIPA compliance, test for privilege escalation vulnerabilities; Statistical Comparison: Compare RL+RLHF vs baselines with confidence intervals
12 Academic Writing: Paper structure for IA/security research; Alberta Plan Contribution: Framing as Step 12 validation in real-world safety-critical domain Paper Draft: "Intelligence Amplification for Dynamic Authorization in Healthcare IAM" (Alberta Plan Step 12 contribution); Demo Interface: Build dashboard for IT staff to interact with agent; Results Visualization: Learning curves, decision quality over time, GVF prediction accuracy; Final Deliverables: Paper ready for workshop submission (USENIX HealthSec, SACMAT)

Math (2-3 hrs/week): Statistics for A/B testing, confidence intervals, GVF prediction accuracy, IA metrics (decision quality, learning efficiency), statistical validation for security applications

Healthcare/Cyber Reading: Role mining and PBAC optimization, RLHF papers (Christiano et al. 2017), Alberta Plan and GVF papers (Sutton et al. 2011, 2023), IA case studies (Licklider 1960, Engelbart 1962), healthcare IAM challenges, safe RL deployment in production

Benefits of 6-Week Timeline: - Deeper research: Time to properly understand and implement Alberta Plan concepts (GVFs, temporal uniformity, continual learning) - Better baselines: Can implement sophisticated comparison policies (peer-based, role mining algorithms) - Stronger evaluation: Comprehensive IA metrics, rigorous statistical validation, safety analysis - Publication quality: Time for thorough writing, peer review from colleagues, workshop submission - D.Eng. foundation: Stronger starting point for dissertation research


Applied Math Track (2-3 hrs/week)

Focus: Practical application rather than theoretical depth

Weeks 1-4: Foundations - 3Blue1Brown – Essence of Linear Algebra (focus on matrix operations, eigenvectors in practice) - Khan Academy probability exercises (focus on distributions, expectation) - StatQuest – Probability & Bayes

Weeks 5-6: Optimization & Statistics (Intensive) - Gradient descent variations and practical considerations - Hyperparameter optimization methods (grid search, random search) - Statistical significance testing for ML (t-tests, confidence intervals) - RMSProp and adaptive learning rates

Weeks 7-12: Applied Statistics & Security (Extended) - A/B testing methodology for RL experiments - Confidence intervals and error analysis - Sample efficiency metrics and learning curves - Risk quantification and safety constraints in RL - Statistical validation for security applications - GVF prediction accuracy and uncertainty quantification - IA-specific metrics (decision quality, cognitive load, time-to-competence)

Milestone: By week 12, should be able to: - Implement gradient-based optimization from scratch - Design statistically valid RL experiments with safety constraints - Explain mathematical concepts behind RL algorithms in engineering terms - Apply statistical methods to evaluate security/healthcare RL systems


Healthcare/Cybersecurity Paper Reading (1-2 hrs/week)

Focus: RL applications in healthcare and cybersecurity domains

Weeks 1-4: Foundations & Survey Papers - Survey papers on RL in healthcare and cybersecurity - Case studies of ML/RL in clinical decision support - Overview of adaptive security systems

Weeks 5-6: Function Approximation & Safe RL (Intensive) - Function approximation in RL (tile coding, neural networks) - Safe RL and constrained optimization - Hyperparameter tuning methodologies - Begin Alberta Plan paper background

Weeks 7-12: Alberta Plan + Healthcare IAM Specific (Extended) - Weeks 7-8: Alberta Plan paper (Sutton, Bowling, Pilarski 2023) deep dive; GVF papers (Horde architecture, Sutton et al. 2011); PBAC and role mining algorithms - Weeks 9-10: RLHF papers (Christiano et al. 2017 - learning from human preferences); Intelligence Amplification historical context (Licklider 1960, Engelbart 1962); Continual learning and temporal uniformity - Weeks 11-12: Healthcare IAM challenges and HIPAA/PHIPA compliance; RL deployment in security-critical systems; IA evaluation methodologies; Academic writing for security/IA research; Workshop paper structures (USENIX HealthSec, SACMAT)


Coursera Module Coverage

Course / Module Scheduled Week(s)
C1. Fundamentals of RL ** ✅ COMPLETE**
M1 Welcome
M2 Intro to Sequential Decision-Making (bandits)
M3 Markov Decision Processes
M4 Value Functions & Bellman Equations
M5 Dynamic Programming
C2. Sample-based Learning Methods ** ✅ COMPLETE**
M1 Welcome
M2 Monte Carlo (pred & control)
M3 TD for Prediction
M4 TD for Control (SARSA, Q-Learning)
M5 Planning, Learning & Acting (Dyna)
C3. Prediction & Control w/ Function Approximation W2–W4
M1 Welcome W2 ✅
M2 On-policy Prediction w/ Approx W2 ✅
M3 Constructing Features (tile coding) W2 ✅
M4 Control w/ Approx (semi-gradient methods) W3 ✅
M5 Policy Gradient W4 🔄
C4. Capstone (Intensive 2-week sprint) W5–W6
M1: Formalize problem as MDP (Coursera's problem) W5
M2: Choose and compare algorithms W5
M3: Parameter identification and exploration W5
M4: Neural network implementation with RMSProp W6
M5: Parameter study and statistical analysis W6
Healthcare IAM Capstone (Extended 6-week project) W7–W12
Weeks 7-8: Foundation (Alberta Plan, MDP, simulator) W7–W8
Weeks 9-10: Implementation (GVFs, RL+RLHF, continual) W9–W10
Weeks 11-12: Evaluation & Paper (IA metrics, draft) W11–W12

Portfolio & Publication Goals

Technical Portfolio (github.com/j-klawson/learnrl): - Clean, well-tested implementations of core RL algorithms from scratch - Comprehensive documentation explaining algorithmic decisions - Reproducible experiments with statistical analysis - Coursera C4 Capstone: Complete RL system applying algorithms to Coursera's problem (Weeks 5-6, intensive) - Healthcare IAM Capstone: Intelligence Amplification for dynamic authorization in healthcare IAM (Weeks 7-12, extended)

Publication Target: - Paper: "Intelligence Amplification for Dynamic Authorization in Healthcare IAM Systems" - Framing: Alberta Plan Step 12 contribution - real-world IA in safety-critical domain - Venues: USENIX HealthSec Workshop, SACMAT (ACM Symposium on Access Control Models and Technologies), IEEE Security & Privacy Workshop - Timeline: Draft by Week 12, submit Q1 2026

D.Eng. Preparation: - Strong foundations in bandits → MDPs → DP → TD → function approximation → policy gradient - Experience with neural network function approximation and RMSProp optimization - Two capstone projects: intensive practice (C4, 2 weeks) + extended research (Healthcare IAM, 6 weeks) - Deep understanding of safe RL and constrained optimization (critical for healthcare) - Alberta Plan alignment: GVFs, temporal uniformity, continual learning, intelligence amplification - Real-world problem formulation experience in safety-critical domain - Publication-ready research demonstrating applied research capability - Clear dissertation trajectory: Intelligence Amplification for cybersecurity

Repository Structure:

learnrl/
├── bandits/              # Week 1: k-armed bandits ✅
   ├── epsilon_greedy.py 
   ├── ucb.py (TODO)
   └── thompson_sampling.py (TODO)
├── dp/                   # Week 1-2: Dynamic programming ✅
   ├── value_iteration.py 
   └── policy_iteration.py 
├── td/                   # Week 2-4: Temporal difference learning
   ├── monte_carlo.py
   ├── td_zero.py
   ├── sarsa.py
   ├── q_learning.py
   └── expected_sarsa.py
├── policy_gradient/      # Week 4: Policy gradient methods
   └── reinforce.py      # Basic REINFORCE algorithm
├── function_approx/      # Week 5-6, 7-8: Function approximation
   ├── tile_coding.py
   ├── linear_fa.py
   ├── semi_gradient.py
   └── nn_fa.py          # Neural network FA with RMSProp
├── capstone_coursera/    # Week 5-6: C4 capstone code
   ├── environment.py    # Coursera's problem as Gymnasium env
   └── experiments.py    # Algorithm comparison experiments
├── capstone/             # Week 7-12: Healthcare IAM (extended)
   ├── auth_env.py       # Authentication environment
   ├── policies.py       # Baseline and RL policies
   ├── safety.py         # Safety constraints
   └── evaluation.py     # Metrics and analysis
├── utils/
   ├── bandit_env.py 
   ├── gridworld_env.py 
   └── stats.py          # Statistical testing
└── tests/                # 156 tests, 92% coverage ✅

Outcomes at Week 12

Technical Mastery: - Deep understanding of RL foundations (bandits → MDPs → DP → TD → function approximation) - From-scratch implementations demonstrating algorithmic understanding - Experience with safe RL and constrained optimization for critical systems - Statistical methodology for evaluating RL systems

Research Capability: - Formulated real-world problem as MDP (dynamic authorization in healthcare IAM) - Applied Alberta Plan concepts to safety-critical domain (GVFs, RLHF, continual learning) - Implemented Intelligence Amplification system (Step 12 of Alberta Plan) - Paper draft ready for workshop submission (USENIX HealthSec or SACMAT) - Experience with empirical evaluation, IA metrics, and statistical validation for security

D.Eng. Readiness: - Strong theoretical foundations for advanced coursework - Practical experience applying RL to healthcare cybersecurity - Publication demonstrating research capability - Clear dissertation direction (safe RL for healthcare/security)

Portfolio Artifacts: - github.com/j-klawson/learnrl: Foundational RL implementations from scratch - C4 Capstone: Complete RL system with NN function approximation (2-week intensive) - Healthcare IAM Capstone: Intelligence Amplification system for dynamic authorization (6-week extended project) - Paper: "Intelligence Amplification for Dynamic Authorization in Healthcare IAM" (Alberta Plan Step 12) - Comprehensive documentation and reproducible experiments with statistical validation


Resources

Reinforcement Learning

Engineering & Testing

Applied Math

Healthcare & Cybersecurity Applications

  • IEEE Security & Privacy - Security research journal
  • ACM CCS - Computer and communications security conference
  • Search terms for papers: "adaptive authentication", "risk-based access control", "reinforcement learning security", "safe reinforcement learning", "healthcare IAM"