Site Reliability Engineer

LatentSan Francisco, CA
14dOnsite

About The Position

You are the infrastructure expert who enables our rapid product development and guarantees 99.9%+ stability and performance of our clinical AI platform for major health systems. Your focus on operational excellence is directly tied to a patient's access to life-saving treatment. What We Look for in a Great Engineer You have the intensity and technical mastery to own mission-critical infrastructure. You hold yourself and others to high standards and thrive in a high-energy, in-office culture where everyone is in it to win it. Tool Proficiency: You are highly proficient with your tools—you speak command line fluently and have mastered keyboard shortcuts. Ownership: You thrive on owning complex systems and have a proven track record of scaling mission-critical deployments. Automation Drive: You love automating things, always finding new ways to increase your own leverage, and defining standards for operational excellence. Problem Solver: You won't wait for someone else to solve a problem that you're in a position to solve; you are willing to jump into whatever needs to get done.

Requirements

  • Deep, demonstrable experience with Kubernetes, Helm, and Terraform
  • Proven ability to architect and maintain complex, distributed systems with high-availability requirements.
  • Hands-on experience optimizing deployment pipelines for both application code (TypeScript) and machine learning models (Python/ML). Also PostgreSQL, Redis, Kakfa.
  • Excitement about working five days per week in our San Francisco office.

Responsibilities

  • Infrastructure Ownership: Design, implement, and maintain the production environment, having previously handled 500+ machine deployments
  • Kubernetes Mastery: Own our containerized infrastructure, leveraging deep expertise in Kubernetes and Helm to manage deployment, scaling, and operational health.
  • CI/CD & Deployment Optimization: Optimize and streamline both the TypeScript and Python/ML deployment pipelines to support high-velocity feature release while maintaining the highest reliability.
  • DevX Support: Support Developer Experience (DevX) work to streamline developer workflows, enhance tool proficiency, and improve CI/CD systems.
  • Infrastructure as Code (IaC): Manage and maintain infrastructure definitions using Terraform.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service