Senior Site Reliability Engineer

Mastercard•O'fallon, MO

3d•Onsite

About The Position

The Business Operations (Biz Ops) Team is seeking a Business Operations Site Reliability Engineer (SRE). The Biz Ops organization serves as the production readiness steward for Mastercard products. As a Biz Ops SRE, you are responsible for ensuring platform stability, reliability, and health. You will: Break down barriers to operational excellence by fostering developer run ownership. Enable teams to build resilient, fault-tolerant, and scalable products. Support developers during application build phases with a focus on: Operational design, Automation, Capacity planning, Monitoring and observability. Biz Ops teams take a holistic view of systems and enforce operational standards while facilitating an agile, learning-focused culture. We are seeking a Senior Site Reliability Engineer to play a critical role in ensuring the reliability, scalability, and performance of applications that power Mastercard’s global operations. As a thought leader, you will bring deep technical expertise, a strong automation mindset, and mentoring capabilities.

Requirements

Bachelor’s degree in Computer Science or related technical field (or equivalent experience)
Ability to read, write, and understand code in at least one programming language
Strong understanding of DevOps principles and configuration management
Experience designing and operating large-scale distributed systems
Strong analytical and problem-solving skills
Excellent communication skills and ownership mindset
Passion for observability, automation, and continuous improvement
Ability to collaborate across geographically distributed and matrixed teams
Programming experience in one or more of: Java, Spring Framework, Python, Go, C++, Spark, Big Data, gRPC
Experience with CI/CD tools such as: Git / Bitbucket, Jenkins, Maven, Artifactory, Chef, Groovy
Familiarity with cloud platforms (AWS, Azure, or GCP preferred)
Experience with messaging technologies: Kafka, RabbitMQ, ActiveMQ, Event Brokers
Observability tools: Splunk, Dynatrace, Prometheus, Datadog, Grafana
Strong understanding of: Networking fundamentals (Layers 1–3), Load balancers and application firewalls, Operating systems, Logging and monitoring standards, High availability and business continuity, Caching and configuration management

Nice To Haves

Hands-on experience with Kubernetes, Docker, and container registries
Public cloud architecture experience (AWS, Azure, GCP)
Blue/green deployment strategies
DevSecOps expertise and modern CI/CD practices
Azure certifications (AZ-203, AZ-400) preferred
Security fundamentals including: TLS/SSL, Certificate lifecycle management, Encryption methods (symmetric/asymmetric)

Responsibilities

Support daily operations with a strong focus on: Incident triage, Root cause analysis, Business impact assessment, Blameless post-mortems
Engage early in the development lifecycle to: Proactively manage production readiness, Optimize production changes, Enhance customer experience
Drive risk management, compliance, and resiliency across environments
Align product and customer priorities with operational requirements through continuous feedback throughout the application lifecycle
Own overall application health, performance, and capacity
Support services pre-launch through: Architecture reviews, Capacity planning, Launch readiness reviews
Partner with product and development teams to establish: Monitoring and alerting strategies, Zero-downtime deployment frameworks
Design and operate highly reliable and scalable systems
Perform root cause analysis and collaborate with development teams on remediation
Participate in on-call rotations and incident response
Implement sustainable incident management and blameless post-mortems
Define and improve Service Level Objectives (SLOs)
Automate data-driven alerting to proactively identify issues
Improve the full service lifecycle—from design to production and optimization
Support CI/CD pipelines with validation and operational gating
Lead DevOps automation and best practices
Reduce operational toil through increased automation and tooling
Optimize capacity planning and performance
Analyze ITSM activities and identify operational gaps
Provide continuous feedback to development teams on resiliency improvements

Benefits

insurance (including medical, prescription drug, dental, vision, disability, life insurance)
flexible spending account and health savings account
paid leaves (including 16 weeks of new parent leave and up to 20 days of bereavement leave)
80 hours of Paid Sick and Safe Time, 25 days of vacation time and 5 personal days, pro-rated based on date of hire
10 annual paid U.S. observed holidays
401k with a best-in-class company match
deferred compensation for eligible roles
fitness reimbursement or on-site fitness facilities
eligibility for tuition reimbursement

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume