top of page


Teaching an LLM to Coach Itself: Multi-Agent Math Tutoring with Reinforcement Learning
training a Solver-Coach-Reviser system on Hendrycks MATH using Tinker RL and Qwen3-8B. Introduction What if a language model could not only solve math problems, but also review its own work, spot its mistakes, and fix them? That's exactly what we set out to build. In this project we trained a multi-agent math coaching system where a single LLM plays three distinct roles: 1. Solver attempts the problem step-by-step and produces a final answer. 2. Coach reviews the Solver'
48 minutes ago8 min read


Recursive Language Models as procedural scaling
Long-context is often treated like a single knob. Increase the window, improve the model, and the problem goes away. That framing collapses under closer inspection, because “long context” is not one thing. There is the systems problem of making attention and training efficient at larger sequence lengths, and there is a more subtle problem that shows up even when efficiency is not the bottleneck: the data distribution that language models are trained on is not unbounded in len
4 days ago5 min read


How RL Changed My Taste in AI Systems
I used to treat reinforcement learning as a mysterious corner of machine learning where agents somehow “figure it out” through trial and error. The more I read, the more I realized that the mystery comes from a single twist: the feedback is delayed, noisy, and often sparse. Once you accept that, RL stops being magic and starts being a very specific kind of optimization problem that punishes sloppy assumptions. What follows is the learning path that actually worked for me. It
Feb 46 min read
Reinforcement learning vs “regular” training: the real difference is not the math, it is the loop
Most ML people grow up on a simple mental model: you have a dataset, you define a loss, you run gradient descent, you ship a checkpoint. That covers supervised learning and a lot of self-supervised pretraining. The model is learning from a fixed distribution of examples, and the training pipeline is basically a linear flow from data to gradients. Reinforcement learning (RL) breaks that mental model because the model is not only learning from data, it is also actively creating
Jan 267 min read


GPT-OSS Safeguard as Policy-Executable Safety, and the Cabinet Briefing Risk Scanner Built on Top of It
Abstract This article presents a systems-focused account of how GPT-OSS Safeguard can be used as a policy-executable safety component and how that capability can be operationalized into a real workflow for high-stakes government communications. The case study is a Cabinet Briefing Risk Scanner, an AI tool that reviews draft communications prior to distribution by applying an explicit written risk policy, treating the analyzed text as untrusted, and emitting strict structured
Jan 314 min read
2025: The Year I Bet on Myself
On December 30th, 2024, I finished my last day at IBM. It was the kind of ending that looks simple from the outside, but internally it carried years of thought and a lot of quiet pressure. I wasn’t leaving because I hated the work, and I wasn’t leaving because something broke. I was leaving because I could feel myself outgrowing the comfort of a structured path. IBM gave me discipline, exposure, and a solid environment to sharpen my skills, but I kept feeling a stronger pull
Jan 1, 20267 min read
bottom of page




