Sr SRE/Dev Ops Engineer

Madison Reed•San Francisco, CA

22h•$170,000 - $175,000•Remote

About The Position

Madison Reed is seeking a hands-on Senior SRE / AI Platform DevOps Engineer to build, operate, and scale the infrastructure behind our AI-powered services, agents, and orchestration platforms. This role sits at the intersection of site reliability engineering, cloud infrastructure, DevOps automation, observability, and AI operations. You will own the systems and practices that ensure our AI-enabled services are reliable, secure, scalable, cost-effective, and production-ready. The ideal candidate is infrastructure-first and operationally minded, with deep experience in cloud environments, CI/CD, production monitoring, incident response, and automation. You will help operationalize AI systems by building reliable deployment workflows, telemetry pipelines, monitoring frameworks, and governance processes for models, agents, and orchestration services. This is a highly hands-on engineering role for someone who enjoys building resilient platforms, reducing operational risk, improving deployment velocity, and making advanced technology dependable in real-world production environments.

Requirements

5+ years of experience in DevOps, Site Reliability Engineering, Platform Engineering, Cloud Infrastructure, or related roles.
Strong hands-on experience with cloud infrastructure, preferably AWS.
Experience building and maintaining CI/CD pipelines and automated deployment workflows.
Proficiency with infrastructure-as-code tools such as Terraform, CloudFormation or similar.
Experience operating production systems with strong monitoring, alerting, logging, and incident response practices.
Strong scripting or programming skills in Python, Bash, Go, or similar languages.
Experience designing reliable, secure, scalable, and cost-conscious infrastructure.
Comfortable participating in on-call rotations and supporting production systems.

Nice To Haves

Experience operating AI, ML, agent-based, or data-intensive systems in production.
Familiarity with model deployment, model versioning, inference services, or MLOps workflows.
Experience with observability platforms such as Datadog, New Relic, Grafana, Prometheus, OpenTelemetry, Splunk, or similar.
Experience with event-driven architectures, queueing systems, streaming platforms, or telemetry pipelines.
Familiarity with AIOps concepts such as anomaly detection, alert correlation, automated remediation, and intelligent incident response.
Experience implementing SLOs, error budgets, production readiness reviews, and reliability scorecards.
Understanding of security, compliance, access control, and governance practices for production systems.

Responsibilities

Design, provision, and manage cloud infrastructure for AI-powered services, agents, orchestration systems, and supporting platforms.
Automate environment setup and configuration across development, staging, and production environments.
Build reusable infrastructure-as-code patterns that improve consistency, security, scalability, and maintainability.
Partner with engineering teams to ensure production systems are resilient, observable, performant, and cost-efficient.
Participate in on-call support, incident response, root cause analysis, and continuous reliability improvement.
Build, maintain, and optimize CI/CD pipelines for services, agents, orchestration layers, and supporting infrastructure.
Implement automated testing, validation, security, and reliability gates within deployment workflows.
Design safe deployment patterns including blue/green deployments, canary releases, feature flags, and automated rollback mechanisms.
Integrate health checks, service readiness checks, and reliability signals into release processes.
Improve deployment speed and confidence while reducing production risk.
Package, version, deploy, and manage AI models, agent services, and orchestration components across environments.
Support safe rollout, rollback, refresh, and retirement workflows for AI-powered services.
Monitor AI service performance across latency, throughput, availability, cost, quality, and business-critical reliability signals.
Implement operational controls for AI systems, including version tracking, environment promotion, access management, and change governance.
Partner with data, engineering, product, and support teams to ensure AI systems are production-ready and operationally accountable.
Design and operate scalable telemetry pipelines for logs, metrics, traces, model events, agent interactions, and operational signals.
Enable structured observability for AI services and orchestration systems to support real-time monitoring, alerting, and diagnostics.
Build dashboards, alerts, and reporting that provide actionable insight into system health, performance, reliability, and cost.
Improve incident detection, triage, and resolution through high-quality telemetry and operational data.
Support data-driven reliability practices, including SLOs, error budgets, service health reviews, and post-incident analysis.
Implement intelligent monitoring, alert correlation, anomaly detection, and automated incident response capabilities.
Integrate AIOps tools and workflows into existing DevOps, SRE, and engineering operations.
Build automation that reduces manual operational work and improves mean time to detect and resolve issues.
Identify opportunities to use AI and automation to improve platform reliability, observability, supportability, and operational efficiency.
Define and maintain reliability standards for AI-powered production systems.
Establish and track service-level indicators, service-level objectives, and operational readiness requirements.
Lead reliability reviews, production readiness assessments, and infrastructure risk assessments.
Drive improvements in system resilience, scalability, security, performance, and cost optimization.
Champion SRE best practices across engineering teams.