Staff Platform Engineer

Sanas•Palo Alto, CA

About The Position

We're looking for an experienced Platform Engineer to build and operate the hybrid infrastructure foundation for our advanced AI/ML research and product development. You'll architect, build, and run our platform environments spanning AWS, and other cloud environments, that empower our teams to train and deploy complex models at scale. This role is focused on creating a robust, self-service environment using Kubernetes, AWS, Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads.

Requirements

6+ years of Software Engineering experience, preferably in Platform Engineering or Site Reliability.
Strong fundamentals with a focus on writing clean & maintainable code.
Strong proficiency in Python or Rust.
Experience building large-scale distributed systems with high demands on model inference, performance, reliability, and observability.
Experience with high-performance compute (HPC) schedulers, capacity planning, containerized deployments, and familiarity in managing GPU-intensive AI workloads.
Strong communication skills with ability to own large scope projects by working cross-functionally across Engineering, AI, Product, Research and Business stakeholders.
Experience working with AWS (preferred), GCP or Azure, EKS / Kubernetes.
Deep curiosity about the state of agentic coding tools and how to optimize agent-assisted workflows.
Bachelor’s Degree in Computer Science, related fields, or equivalent experience.

Nice To Haves

Familiarity with real-time streaming protocols like WebTransport and SIP/SRTP.

Responsibilities

Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications, research initiatives, and Sanas’ services.
Provision, manage, and maintain our cloud infrastructure for high-performance GPU computing.
Lead comprehensive observability across the organization (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, on-call, incident response, and performance tuning.
Design and build low latency, scalable, and reliable infrastructure that serves model inference and training for our cutting edge speech models.
Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate and support their development cycle.
You'll have significant autonomy to shape our product infrastructure, and directly impact how cutting-edge AI is applied across various devices and applications in speech.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume