Sr. Site Reliability Engineer

Tiger Analytics Inc.Washington, DC
Hybrid

About The Position

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps—bridging the gap between model development and production-grade reliability.

Requirements

  • Expert-level knowledge of Kubernetes (K8s) and Docker.
  • Familiarity with MLOps tools such as Kubeflow, Vertex AI, MLflow, or DVC.
  • Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.
  • Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).
  • Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

Responsibilities

  • Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
  • Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
  • Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.
  • Ensure the high availability of Vertex AI endpoints and custom inference services.
  • Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
  • Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
  • Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
  • Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
  • Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
  • Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
  • Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
  • Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Benefits

  • Significant career development opportunities exist as the company grows.
  • The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service