Member of Technical Staff, Post-Training, RL

Mirendil•United States, CA

About The Position

Mirendil is a tech-first company focused on solving core bottlenecks that unlock step-change acceleration across science and technology. Our first goal is to democratize frontier AI R&D across scientific disciplines. We believe accelerating scientific discovery is one of the most powerful ways to improve the future of humanity, and that AI will play a central role in making that possible. We are building a frontier AI research company and training our own models end-to-end. Our work spans areas such as model training, reinforcement learning, reasoning systems, and infrastructure for large-scale experiments. Our team includes researchers and engineers from Anthropic, Google DeepMind, xAI, OpenAI, Microsoft, Apple, and MIT. The Role We are looking for research engineers to help build the post-training stack for frontier reasoning models. This role sits at the point where model capability, training dynamics, data, verification, and infrastructure all meet. You will design and run the experiments that turn a strong base model into a model that can solve difficult tasks reliably: choosing training objectives, shaping data mixtures, building verifiers, debugging reward signals, scaling runs, and understanding why a recipe works or fails. Researchers are also expected to have strong engineering skills. The best work here will involve both: forming hypotheses about training behavior, implementing them in real systems, running large-scale experiments, reading the resulting traces carefully, and turning the lessons into the next training run.

Requirements

Strong engineering skills
Forming hypotheses about training behavior
Implementing hypotheses in real systems
Running large-scale experiments
Reading resulting traces carefully
Turning lessons into the next training run

Nice To Haves

Post-training recipes
Scaling RL
Long-horizon reasoning
Off-policy and asynchronous training
Verification and reward quality
Multi-task post-training
Experiment analysis and debugging
End-to-end execution

Responsibilities

Develop and iterate on RL, SFT, and distillation recipes.
Understand how choices in objectives, data mixtures, hyperparameters, rollout generation, and filtering affect efficiency, stability, capability, and final model behavior.
Make post-training work at larger scales: more tokens, longer trajectories, larger models, more steps, and larger compute budgets.
Identify the bottlenecks that appear only when an approach leaves the small-run regime.
Train models on tasks where success depends on many intermediate decisions.
Develop methods for assigning useful feedback across long trajectories, where sparse rewards, credit assignment, exploration, and verification all become harder.
Work on training regimes where data is generated by older policies, different policies, or partially filtered policies.
Build intuition and tooling for when off-policy data helps, when it hurts, and how to control the resulting instabilities.
Build robust verification pipelines for tasks where correctness can be checked automatically or semi-automatically.
Detect and reduce reward hacking, false positives, brittle verifiers, and other failure modes that make RL look better than it really is.
Scale recipes across different task families and domains.
Study the tradeoffs between specialization and generality, and design training mixtures that improve all capabilities together.
Develop a deep empirical understanding of training runs.
Diagnose regressions, separate real improvements from noise, design better ablations, and build the probes and analyses needed to make post-training less opaque.
Work closely with systems, infrastructure, and data teams to get experiments from idea to production-scale runs.
Make training pipelines reliable, ensure data and verifier quality, and turn successful experiments into repeatable and scalable recipes.