Staff Machine Learning Engineer, ML Infrastructure

SimpliSafe•Boston, MA

3d•Hybrid

About The Position

SimpliSafe is seeking a Staff ML Engineer to join their Cloud ML team. This role is a senior individual contributor position focused on enhancing the way ML systems are built, deployed, and operated at scale. The engineer will collaborate with other senior engineers to guide architecture, mentor team members, and define the technical direction for the ML platform. The work involves demanding workloads such as real-time computer vision inference for video processing and LLM/GenAI infrastructure for future intelligent applications. The ideal candidate will have prior experience building ML infrastructure, understanding its challenges, and a passion for improving the speed and reliability for other teams.

Requirements

8+ years of software/ML engineering experience, with a clear track record of building and operating production ML systems at scale.
Deep expertise in cloud ML infrastructure on Kubernetes, with hands-on production experience with Ray; experience with KServe, Triton, vLLM, Kubeflow, Argo, or similar is a strong plus.
Strong production experience on AWS (EKS, S3, IAM, networking) and with Kafka, containerized deployments, CI/CD, and infrastructure-as-code.
Demonstrated experience designing and operating high-throughput, low-latency inference systems — GPU-aware scheduling, batching, autoscaling, multi-tenancy.
Solid grounding in ML fundamentals: how models are trained, evaluated, versioned, deployed, monitored, and rolled back in production.
Proficiency in Python is required; experience with a systems language (Go, C++, Rust) for performance-sensitive components is a plus.
Staff-level technical leadership: ability to drive ambiguous, cross-cutting initiatives, align senior stakeholders, and elevate the engineers around you without formal authority.
Strong written and verbal communication — you can make complex technical tradeoffs legible to ML scientists, product, and other infra teams.

Nice To Haves

Hands-on experience with LLM serving in production (vLLM, TGI, TensorRT-LLM, SGLang) — KV cache management, continuous batching, speculative decoding, quantization for serving.
Experience building real-time video or streaming ML pipelines (Kafka, Kinesis, Flink, or similar) at scale.
Background supporting CV workloads in production — model formats, GPU/accelerator tradeoffs, video codecs.
Experience with model lifecycle tooling (MLflow, Weights & Biases, model registries, drift detection, shadow deployments).
Open source contributions to the ML infrastructure ecosystem (Ray, KServe, Triton, vLLM, Kubeflow, etc.).
Experience operating in environments with strong security and compliance requirements.

Responsibilities

Set technical direction for ML infrastructure
Drive architecture decisions for our Kubernetes-based ML platform — anchored on Ray for inference, alongside KServe, Triton, and vLLM — across real-time and batch workloads.
Lead deep technical reviews on system design, capacity planning, and reliability for the highest-stakes ML systems at SimpliSafe.
Identify and remove the systemic bottlenecks in our ML deployment infrastructure — whether that's serving reliability, deployment friction, observability gaps, scaling, or cost.
Build and operate real-time CV inference at scale
Own the design and evolution of cloud-side inference systems that process live video and events from SimpliSafe devices in real time.
Drive throughput, latency, and cost improvements (batching strategies, GPU utilization, autoscaling, multi-model serving) for production CV models.
Build the feedback loops between cloud inference, edge devices, and the data flywheel that improves model quality over time.
Stand up LLM/GenAI serving infrastructure
Help shape how SimpliSafe serves LLMs in production — model serving patterns, KV-cache and batching strategies, evaluation pipelines, guardrails, and cost controls.
Partner with applied ML engineers to take new GenAI-powered product features from prototype to scaled deployment.
Raise the engineering bar across Cloud ML
Mentor engineers across the team through design reviews, code reviews, pairing, and written guidance — a meaningful uplift on everyone you work with.
Establish and evangelize best practices for model lifecycle management (registry, deployment, monitoring, rollback, drift) and on-call.
Write the documentation, runbooks, and architectural decision records that make the platform legible and durable.
Own reliability and operational excellence
Lead incident response and postmortems for critical ML systems; turn lessons learned into platform-level improvements.
Define SLOs, observability standards, and on-call practices for ML services in production.

Benefits

A mission- and values-driven culture and a safe, inclusive environment where you can build, grow and thrive
A comprehensive total rewards package that supports your wellness and provides security for SimpliSafers and their families
Free SimpliSafe system and professional monitoring for your home.
Employee Resource Groups (ERGs) that bring people together, give opportunities to network, mentor and develop, and advocate for change.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume