Runtime Engineer

MatXMountain View, CA
$120,000 - $475,000Hybrid

About The Position

MatX is building custom silicon for large-language-model inference and training, with HW/SW co-design across ISA, RTL, simulator, compiler, and kernels so each layer benefits from the others. The runtime owns the host-side stack and the contracts that bind those teams together.

Requirements

  • Strong experience in a systems programming language — Rust, C, C++, or Go — including memory management, allocator design, and FFI/ABI work
  • Have built Python interop layers in production (PyO3, ctypes, pybind11, or equivalent C-ABI bridging)
  • Have designed and maintained API or ABI contracts between teams — versioning, evolution, breaking-change discipline — not just consumed someone else's
  • Hands-on with at least one accelerator programming model (CUDA, ROCm, oneAPI Level Zero, TPU, or comparable) — enough to reason about device memory, async execution, and kernel launch
  • ML-systems literate — comfortable with the training and inference loop, what collectives do, what a tensor layout is. Research depth not required.

Nice To Haves

  • LLM inference internals — vLLM, TensorRT-LLM, or SGLang (paged attention, scheduler design)
  • Rust at depth, including proc macros, unsafe with soundness reasoning, and complex lifetime/trait work
  • Custom allocator design (slab, paged, arena) or other low-level memory work
  • ML framework integration experience (PyTorch custom backends, JAX/XLA, ONNX runtime)
  • Profiler or tracing infrastructure work (perfetto, Nsight, or a custom stack)
  • Driver-adjacent or kernel-bypass work, or prior new-silicon bring-up

Responsibilities

  • Build the host-side interface library — device memory management, DMA, streams and events, sync primitives — that every compiler-emitted program runs on top of
  • Own and extend the executable format: the compiler→runtime contract, its versioning, the weight and quantization layouts that let compiler and runtime evolve independently
  • Design the custom-kernel ABI — calling convention, sync semantics, lifecycle — and the host-side marshaling layer (DLPack, the buffer protocol, numpy) that gets Python tensors to the device
  • Build Python bindings via PyO3, with a C-ABI shim as the alternative integration path for downstream consumers
  • Build the LLM inference serving stack — paged KV cache, continuous batching, request scheduling, token streaming — and the cluster orchestration primitives underneath it
  • Bring up interconnect topology from the host and own the failure-detection and clean-teardown path for stop-restructure-resume recovery across racks
  • Design what the chip exposes to host-side profilers and debuggers — perf counters, traces, and the Python surfaces ML engineers actually use — and hit measurable performance targets on runtime overhead and serving throughput

Benefits

  • Generous equity, with option cash/equity swap at offer, and option to employee early exercise.
  • Company subsidized Health, Dental, Vision, and Life insurance
  • Pre-tax Health Savings Accounts with generous company contribution (even if you don’t)
  • 4 weeks paid time off (accrued)
  • 12 company holidays
  • 3 weeks remote/flexible work per year
  • Up to 12 weeks of paid parental leave, regardless of your path to parenthood
  • $1,500 yearly towards your professional development e.g. conferences, courses, and other learning opportunities
  • Team Lunches, quarterly off-sites, and regular town halls
  • 401K and/or Roth IRA, with 5% company contribution, even if you don’t!
  • Pre-tax spend accounts for medical, dental/vision, dependent care, parking, and transit expenses
  • For those commuting up to 1 hour, put your rideshare cost on our company card and reclaim the drive-time to get work done!
  • $50 per month to use on the perks you care about most
  • We work remotely Monday & Friday, supported by home-tech setup, and remote wifi expense reimbursement
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service