Staff Site Reliability Engineer

Stratitech Services LLC.San Francisco, CA
$210,000 - $250,000Onsite

About The Position

This is not a ticket-taking SRE role. You will define how mission-critical machine learning and real-time analytics systems operate in production — influencing reliability strategy, deployment standards, and infrastructure architecture across engineering. This team operates in a highly collaborative, in-person engineering environment in SOMA. Infrastructure, ML, and engineering leaders work side by side to design, build, and operate complex systems in real time. The pace is fast, the feedback loops are tight, and decisions happen quickly. If you’ve grown from Linux systems → DevOps → Staff-level SRE, and you now think in terms of systemic risk, scalability, and long-term reliability strategy — this role gives you direct influence and visibility. This role is intentionally in-person because reliability decisions happen at architectural depth, ML, data, and infrastructure teams collaborate continuously in real time, post-incident reviews, system design debates, and performance tuning sessions are hands-on and high impact, you will have direct access to engineering leadership and decision-makers, and the infrastructure you’re operating is mission-critical and evolving quickly. If you value deep technical collaboration, tight feedback loops, and being at the center of high-scale ML systems — this environment is built for that.

Requirements

  • Deep experience operating Linux infrastructure and networking in production environments
  • Proven impact as a Staff SRE, Senior SRE, or senior-level DevOps/Platform Engineer supporting distributed systems
  • Experience supporting complex, data-intensive or ML-driven systems in production
  • Strong hands-on experience with Docker and Kubernetes
  • Infrastructure-as-Code expertise
  • Strong scripting ability (Bash and/or Python)
  • CI/CD ownership experience (GitHub Actions, ArgoCD, or similar)
  • Experience with modern observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry)
  • Ability to debug systemic failures across infrastructure, deployments, and workloads
  • Clear communicator who works effectively across engineering and data teams

Nice To Haves

  • Experience operating ML platforms at scale (training + inference)
  • AWS or cloud-managed services experience
  • Exposure to data platforms such as Spark, Airflow, or Kafka
  • Experience in SOC 2 or regulated environments

Responsibilities

  • Production reliability for ML and real-time analytics workloads
  • CI/CD strategy, deployment automation, and rollback design
  • Observability frameworks (SLOs, alerting, monitoring, incident response)
  • Infrastructure-as-Code and Kubernetes environments
  • Capacity planning and performance optimization
  • Post-incident reviews that drive measurable, long-term reliability improvements
  • Reliability standards across teams — not just within a single service
  • Partner directly with engineering and data science teams to ensure ML workloads are production-ready and reliable by design.

Benefits

  • Competitive base compensation ($210K–$250K)
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service