System Reliability Engineer

BackOps-AISan Francisco, CA
5hHybrid

About The Position

BackOps AI is transforming supply chain operations with agentic AI solutions that automate complex workflows, freeing operations teams to focus on what matters most. Headquartered in the San Francisco Bay Area with flexible remote-friendly options, we foster a culture of innovation, ownership, and measurable impact. As a Systems Reliability Engineer (SRE), you’ll own the reliability, scalability, and security posture of the platforms that power our agentic workflows. You’ll build the guardrails and operational foundations that let product and AI teams ship quickly without sacrificing uptime, observability, or customer trust. We run primarily on AWS; familiarity with GCP is a plus.

Requirements

  • Experience: 4+ years in SRE/DevOps/Infrastructure roles supporting production systems with meaningful uptime requirements
  • AWS Expertise: Strong hands-on experience operating workloads in AWS (IAM, VPC/networking, compute, storage, monitoring, and security controls)
  • Systems Thinking: Solid understanding of distributed systems failure modes (timeouts, retries, cascading failures), and how to design for resilience
  • Operational Excellence: Strong incident leadership instincts; comfortable being the calm, methodical driver during outages
  • Automation Mindset: You automate first—repeatable environments, scripted operations, and minimal manual toil
  • Clear Communicator: Can write crisp runbooks, postmortems, and technical proposals; able to align engineering, product, and ops on priorities
  • Security & Quality: Proven ability to improve security posture and reliability without blocking delivery

Nice To Haves

  • CloudWatch: Strong experience with CloudWatch Logs/Metrics/Alarms, dashboarding, and alert hygiene
  • Sentry: Experience operating error monitoring and triage workflows (alert tuning, release health, actionable grouping)
  • LangSmith: Familiarity with LLM/agent observability (trace analysis, evals/monitoring signals, debugging agent failures)
  • incident.io: Experience running incident workflows (paging, incident timelines, postmortems, follow-up tracking)
  • GCP: Experience operating production systems on GCP (or hybrid/multi-cloud environments)
  • Kubernetes experience (or deep experience with managed platforms and production deployment patterns)
  • Strong background in compliance-oriented environments (SOC 2), audit readiness, and control implementation

Responsibilities

  • Reliability & Availability: Define and improve SLOs/SLIs, reduce error budget burn, and drive initiatives that improve uptime and customer experience
  • Incident Response: Lead and/or participate in on-call rotations; run incident response, coordinate remediation, and produce clear postmortems with measurable follow-ups
  • Observability: Build end-to-end observability (metrics, logs, tracing), dashboards, alerts, and runbooks that make issues diagnosable quickly across services and agents
  • Cloud Operations (AWS): Improve and maintain AWS foundations (IAM, VPC/networking, compute, storage, monitoring, logging, and security controls)
  • Infrastructure as Code: Build and maintain repeatable infrastructure using IaC; enforce consistency across environments (dev/stage/prod) and reduce configuration drift
  • Deployment & CI/CD: Improve deployment safety and velocity (progressive rollouts, rollback strategies, canary patterns, automation in CI/CD)
  • Security & Compliance: Implement and operationalize security best practices (least privilege IAM, secrets management, audit logging, network segmentation) and support SOC 2–aligned controls
  • Performance & Cost: Identify bottlenecks and reliability risks; tune compute/database/network performance and optimize cloud spend without compromising availability
  • Data Protection: Own backup/restore strategies, disaster recovery plans, retention/deletion execution, and periodic recovery testing

Benefits

  • Equity & Ownership: Competitive equity so you grow alongside the company
  • Impact & Visibility: Direct access to co-founders; your work directly improves customer trust and operational outcomes
  • Collaborative Culture: Tight-knit team of seasoned operators and AI experts
  • Flexible Work: Hybrid with core Bay Area presence and remote flexibility
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service