Senior Site Reliability Engineer

MastercardRemote - Utah, UT
$96,000 - $163,000Remote

About The Position

Commerce Media is hiring a Senior Site Reliability Engineer to lead the reliability, scalability, and production operations of a greenfield application within our enterprise platform. This is a high-impact, individual contributor role with end-to-end ownership of system reliability—from design influence through production operations. You will partner across engineering and platform teams to ensure services are resilient, observable, and production-ready from day one.

Requirements

  • Years of professional experience operating distributed systems at scale in production
  • Strong expertise in: Kubernetes and containerized environments, Observability (metrics, logging, tracing), Spring Boot and/or Golang ecosystems
  • Hands-on across application, infrastructure, and release pipelines
  • Demonstrated ownership of service reliability, incident response, and operational strategy
  • Ability to influence system design through technical leadership and data-driven decisions
  • Pragmatic mindset—balancing automation, trade-offs, and system evolution
  • Experience navigating enterprise environments while maintaining delivery velocity
  • Leverages AI tools (e.g., Copilot, ChatGPT, Claude) to: Accelerate design, coding, and testing, Improve code quality and operational outcomes
  • Integrates AI into workflows: Architecture reviews, code generation, testing, and documentation
  • Applies strong judgment in production-critical, low-latency environments

Nice To Haves

  • Spring Boot and/or Golang services

Responsibilities

  • Drive reliability-focused design in partnership with engineering and platform teams
  • Lead architecture and launch readiness reviews, including: Capacity planning, Failure-mode and risk analysis
  • Define and enforce non-functional requirements (availability, latency, resilience)
  • Own production reliability and service health
  • Act as incident commander, leading triage, mitigation, and communication
  • Lead blameless post-mortems with clear, actionable follow-ups
  • Proactively identify and reduce operational risk across the system
  • Define and manage SLIs, SLOs, and error budgets
  • Design and operate monitoring and alerting using: Prometheus, Grafana, OpenSearch / Elasticsearch, Opsgenie
  • Build dashboards aligned to user impact and system health
  • Drive automation-first operations to scale systems sustainably
  • Enhance CI/CD pipelines (GitHub Actions) with deployment gating and validation
  • Identify and resolve performance and reliability bottlenecks
  • Improve developer experience through operational tooling and best practices

Benefits

  • insurance (including medical, prescription drug, dental, vision, disability, life insurance)
  • flexible spending account and health savings account
  • 16 weeks of new parent leave
  • up to 20 days of bereavement leave
  • 80 hours of Paid Sick and Safe Time
  • 25 days of vacation time
  • 5 personal days
  • 10 annual paid U.S. observed holidays
  • 401k with a best-in-class company match
  • deferred compensation for eligible roles
  • fitness reimbursement or on-site fitness facilities
  • eligibility for tuition reimbursement
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service