Staff Platform Engineer

Sanas•Palo Alto, CA

About The Position

We're looking for an experienced Platform Engineer to build and operate the hybrid infrastructure foundation for our advanced AI/ML research and product development. You'll architect, build, and run our platforms spanning AWS and on-premise deployments, empowering our teams to train and deploy complex models at scale. This role is focused on creating a robust, self-service environment using Kubernetes, AWS, and Infrastructure-as-Code (Terraform), and orchestrating high-demand GPU workloads.

Requirements

5+ years of Software Engineering experience, preferably in Platform Engineering or Site Reliability.
Strong fundamentals with a focus on writing clean & maintainable code.
Strong proficiency in scripting (Bash), Python, or Rust.
Experience building large-scale distributed systems with high demands on model inference, performance, reliability, and observability.
Experience with high-performance compute (HPC) schedulers, capacity planning, containerized deployments, and familiarity in managing GPU-intensive AI workloads.
Strong communication skills with ability to own large scope projects by working cross-functionally across Engineering, AI, Product, Research and Business stakeholders.
Experience working with AWS (preferred), GCP or Azure, EKS / Kubernetes.
Deep curiosity about the state of agentic coding tools and how to optimize agent-assisted workflows.

Nice To Haves

Familiarity with real-time streaming protocols like WebTransport and SIP/SRTP.
Bachelor’s Degree in Computer Science or related fields.

Responsibilities

Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications, research initiatives, and Sanas services.
Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing.
Lead comprehensive observability across the organization (monitoring, logging, tracing) to ensure platform(s) health, and create automation for operational tasks, incident response, and performance tuning.
Design and build low latency, scalable, and reliable infrastructure that serves model inference and training for our cutting-edge speech AI models.
Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate and support their development cycle.
You'll have significant autonomy to shape our product infrastructure, and directly impact how cutting-edge AI is applied across various devices and applications in speech.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume