System Reliability Engineer

BackOps-AI•San Francisco, CA

5h•Hybrid

About The Position

BackOps AI is transforming supply chain operations with agentic AI solutions that automate complex workflows, freeing operations teams to focus on what matters most. Headquartered in the San Francisco Bay Area with flexible remote-friendly options, we foster a culture of innovation, ownership, and measurable impact. As a Systems Reliability Engineer (SRE), you’ll own the reliability, scalability, and security posture of the platforms that power our agentic workflows. You’ll build the guardrails and operational foundations that let product and AI teams ship quickly without sacrificing uptime, observability, or customer trust. We run primarily on AWS; familiarity with GCP is a plus.

Requirements

Experience: 4+ years in SRE/DevOps/Infrastructure roles supporting production systems with meaningful uptime requirements
AWS Expertise: Strong hands-on experience operating workloads in AWS (IAM, VPC/networking, compute, storage, monitoring, and security controls)
Systems Thinking: Solid understanding of distributed systems failure modes (timeouts, retries, cascading failures), and how to design for resilience
Operational Excellence: Strong incident leadership instincts; comfortable being the calm, methodical driver during outages
Automation Mindset: You automate first—repeatable environments, scripted operations, and minimal manual toil
Clear Communicator: Can write crisp runbooks, postmortems, and technical proposals; able to align engineering, product, and ops on priorities
Security & Quality: Proven ability to improve security posture and reliability without blocking delivery

Nice To Haves

CloudWatch: Strong experience with CloudWatch Logs/Metrics/Alarms, dashboarding, and alert hygiene
Sentry: Experience operating error monitoring and triage workflows (alert tuning, release health, actionable grouping)
LangSmith: Familiarity with LLM/agent observability (trace analysis, evals/monitoring signals, debugging agent failures)
incident.io: Experience running incident workflows (paging, incident timelines, postmortems, follow-up tracking)
GCP: Experience operating production systems on GCP (or hybrid/multi-cloud environments)
Kubernetes experience (or deep experience with managed platforms and production deployment patterns)
Strong background in compliance-oriented environments (SOC 2), audit readiness, and control implementation

Responsibilities

Reliability & Availability: Define and improve SLOs/SLIs, reduce error budget burn, and drive initiatives that improve uptime and customer experience
Incident Response: Lead and/or participate in on-call rotations; run incident response, coordinate remediation, and produce clear postmortems with measurable follow-ups
Observability: Build end-to-end observability (metrics, logs, tracing), dashboards, alerts, and runbooks that make issues diagnosable quickly across services and agents
Cloud Operations (AWS): Improve and maintain AWS foundations (IAM, VPC/networking, compute, storage, monitoring, logging, and security controls)
Infrastructure as Code: Build and maintain repeatable infrastructure using IaC; enforce consistency across environments (dev/stage/prod) and reduce configuration drift
Deployment & CI/CD: Improve deployment safety and velocity (progressive rollouts, rollback strategies, canary patterns, automation in CI/CD)
Security & Compliance: Implement and operationalize security best practices (least privilege IAM, secrets management, audit logging, network segmentation) and support SOC 2–aligned controls
Performance & Cost: Identify bottlenecks and reliability risks; tune compute/database/network performance and optimize cloud spend without compromising availability
Data Protection: Own backup/restore strategies, disaster recovery plans, retention/deletion execution, and periodic recovery testing

Benefits

Equity & Ownership: Competitive equity so you grow alongside the company
Impact & Visibility: Direct access to co-founders; your work directly improves customer trust and operational outcomes
Collaborative Culture: Tight-knit team of seasoned operators and AI experts
Flexible Work: Hybrid with core Bay Area presence and remote flexibility

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume