Sr. Site Reliability Engineer

Tiger Analytics Inc.•Washington, DC

48d•Hybrid

About The Position

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps—bridging the gap between model development and production-grade reliability.

Requirements

Expert-level knowledge of Kubernetes (K8s) and Docker.
Familiarity with MLOps tools such as Kubeflow, Vertex AI, MLflow, or DVC.
Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.
Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).
Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

Responsibilities

Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.
Ensure the high availability of Vertex AI endpoints and custom inference services.
Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.
Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.
Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Benefits

Significant career development opportunities exist as the company grows.
The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume