Senior Site Reliability Engineer

Mastercard•Remote - Utah, UT

2d•$96,000 - $163,000•Remote

About The Position

Commerce Media is hiring a Senior Site Reliability Engineer to lead the reliability, scalability, and production operations of a greenfield application within our enterprise platform. This is a high-impact, individual contributor role with end-to-end ownership of system reliability—from design influence through production operations. You will partner across engineering and platform teams to ensure services are resilient, observable, and production-ready from day one.

Requirements

Years of professional experience operating distributed systems at scale in production
Strong expertise in: Kubernetes and containerized environments, Observability (metrics, logging, tracing), Spring Boot and/or Golang ecosystems
Hands-on across application, infrastructure, and release pipelines
Demonstrated ownership of service reliability, incident response, and operational strategy
Ability to influence system design through technical leadership and data-driven decisions
Pragmatic mindset—balancing automation, trade-offs, and system evolution
Experience navigating enterprise environments while maintaining delivery velocity
Leverages AI tools (e.g., Copilot, ChatGPT, Claude) to: Accelerate design, coding, and testing, Improve code quality and operational outcomes
Integrates AI into workflows: Architecture reviews, code generation, testing, and documentation
Applies strong judgment in production-critical, low-latency environments

Nice To Haves

Spring Boot and/or Golang services

Responsibilities

Drive reliability-focused design in partnership with engineering and platform teams
Lead architecture and launch readiness reviews, including: Capacity planning, Failure-mode and risk analysis
Define and enforce non-functional requirements (availability, latency, resilience)
Own production reliability and service health
Act as incident commander, leading triage, mitigation, and communication
Lead blameless post-mortems with clear, actionable follow-ups
Proactively identify and reduce operational risk across the system
Define and manage SLIs, SLOs, and error budgets
Design and operate monitoring and alerting using: Prometheus, Grafana, OpenSearch / Elasticsearch, Opsgenie
Build dashboards aligned to user impact and system health
Drive automation-first operations to scale systems sustainably
Enhance CI/CD pipelines (GitHub Actions) with deployment gating and validation
Identify and resolve performance and reliability bottlenecks
Improve developer experience through operational tooling and best practices

Benefits

insurance (including medical, prescription drug, dental, vision, disability, life insurance)
flexible spending account and health savings account
16 weeks of new parent leave
up to 20 days of bereavement leave
80 hours of Paid Sick and Safe Time
25 days of vacation time
5 personal days
10 annual paid U.S. observed holidays
401k with a best-in-class company match
deferred compensation for eligible roles
fitness reimbursement or on-site fitness facilities
eligibility for tuition reimbursement