Runtime Engineer

MatX•Mountain View, CA

5d•$120,000 - $475,000•Hybrid

About The Position

MatX is building custom silicon for large-language-model inference and training, with HW/SW co-design across ISA, RTL, simulator, compiler, and kernels so each layer benefits from the others. The runtime owns the host-side stack and the contracts that bind those teams together.

Requirements

Strong experience in a systems programming language — Rust, C, C++, or Go — including memory management, allocator design, and FFI/ABI work
Have built Python interop layers in production (PyO3, ctypes, pybind11, or equivalent C-ABI bridging)
Have designed and maintained API or ABI contracts between teams — versioning, evolution, breaking-change discipline — not just consumed someone else's
Hands-on with at least one accelerator programming model (CUDA, ROCm, oneAPI Level Zero, TPU, or comparable) — enough to reason about device memory, async execution, and kernel launch
ML-systems literate — comfortable with the training and inference loop, what collectives do, what a tensor layout is. Research depth not required.

Nice To Haves

LLM inference internals — vLLM, TensorRT-LLM, or SGLang (paged attention, scheduler design)
Rust at depth, including proc macros, unsafe with soundness reasoning, and complex lifetime/trait work
Custom allocator design (slab, paged, arena) or other low-level memory work
ML framework integration experience (PyTorch custom backends, JAX/XLA, ONNX runtime)
Profiler or tracing infrastructure work (perfetto, Nsight, or a custom stack)
Driver-adjacent or kernel-bypass work, or prior new-silicon bring-up

Responsibilities

Build the host-side interface library — device memory management, DMA, streams and events, sync primitives — that every compiler-emitted program runs on top of
Own and extend the executable format: the compiler→runtime contract, its versioning, the weight and quantization layouts that let compiler and runtime evolve independently
Design the custom-kernel ABI — calling convention, sync semantics, lifecycle — and the host-side marshaling layer (DLPack, the buffer protocol, numpy) that gets Python tensors to the device
Build Python bindings via PyO3, with a C-ABI shim as the alternative integration path for downstream consumers
Build the LLM inference serving stack — paged KV cache, continuous batching, request scheduling, token streaming — and the cluster orchestration primitives underneath it
Bring up interconnect topology from the host and own the failure-detection and clean-teardown path for stop-restructure-resume recovery across racks
Design what the chip exposes to host-side profilers and debuggers — perf counters, traces, and the Python surfaces ML engineers actually use — and hit measurable performance targets on runtime overhead and serving throughput

Benefits

Generous equity, with option cash/equity swap at offer, and option to employee early exercise.
Company subsidized Health, Dental, Vision, and Life insurance
Pre-tax Health Savings Accounts with generous company contribution (even if you don’t)
4 weeks paid time off (accrued)
12 company holidays
3 weeks remote/flexible work per year
Up to 12 weeks of paid parental leave, regardless of your path to parenthood
$1,500 yearly towards your professional development e.g. conferences, courses, and other learning opportunities
Team Lunches, quarterly off-sites, and regular town halls
401K and/or Roth IRA, with 5% company contribution, even if you don’t!
Pre-tax spend accounts for medical, dental/vision, dependent care, parking, and transit expenses
For those commuting up to 1 hour, put your rideshare cost on our company card and reclaim the drive-time to get work done!
$50 per month to use on the perks you care about most
We work remotely Monday & Friday, supported by home-tech setup, and remote wifi expense reimbursement