Senior Site Reliability Engineer

MastercardO'fallon, MO
Onsite

About The Position

The Business Operations (Biz Ops) Team is seeking a Business Operations Site Reliability Engineer (SRE). The Biz Ops organization serves as the production readiness steward for Mastercard products. As a Biz Ops SRE, you are responsible for ensuring platform stability, reliability, and health. You will: Break down barriers to operational excellence by fostering developer run ownership. Enable teams to build resilient, fault-tolerant, and scalable products. Support developers during application build phases with a focus on: Operational design, Automation, Capacity planning, Monitoring and observability. Biz Ops teams take a holistic view of systems and enforce operational standards while facilitating an agile, learning-focused culture. We are seeking a Senior Site Reliability Engineer to play a critical role in ensuring the reliability, scalability, and performance of applications that power Mastercard’s global operations. As a thought leader, you will bring deep technical expertise, a strong automation mindset, and mentoring capabilities.

Requirements

  • Bachelor’s degree in Computer Science or related technical field (or equivalent experience)
  • Ability to read, write, and understand code in at least one programming language
  • Strong understanding of DevOps principles and configuration management
  • Experience designing and operating large-scale distributed systems
  • Strong analytical and problem-solving skills
  • Excellent communication skills and ownership mindset
  • Passion for observability, automation, and continuous improvement
  • Ability to collaborate across geographically distributed and matrixed teams
  • Programming experience in one or more of: Java, Spring Framework, Python, Go, C++, Spark, Big Data, gRPC
  • Experience with CI/CD tools such as: Git / Bitbucket, Jenkins, Maven, Artifactory, Chef, Groovy
  • Familiarity with cloud platforms (AWS, Azure, or GCP preferred)
  • Experience with messaging technologies: Kafka, RabbitMQ, ActiveMQ, Event Brokers
  • Observability tools: Splunk, Dynatrace, Prometheus, Datadog, Grafana
  • Strong understanding of: Networking fundamentals (Layers 1–3), Load balancers and application firewalls, Operating systems, Logging and monitoring standards, High availability and business continuity, Caching and configuration management

Nice To Haves

  • Hands-on experience with Kubernetes, Docker, and container registries
  • Public cloud architecture experience (AWS, Azure, GCP)
  • Blue/green deployment strategies
  • DevSecOps expertise and modern CI/CD practices
  • Azure certifications (AZ-203, AZ-400) preferred
  • Security fundamentals including: TLS/SSL, Certificate lifecycle management, Encryption methods (symmetric/asymmetric)

Responsibilities

  • Support daily operations with a strong focus on: Incident triage, Root cause analysis, Business impact assessment, Blameless post-mortems
  • Engage early in the development lifecycle to: Proactively manage production readiness, Optimize production changes, Enhance customer experience
  • Drive risk management, compliance, and resiliency across environments
  • Align product and customer priorities with operational requirements through continuous feedback throughout the application lifecycle
  • Own overall application health, performance, and capacity
  • Support services pre-launch through: Architecture reviews, Capacity planning, Launch readiness reviews
  • Partner with product and development teams to establish: Monitoring and alerting strategies, Zero-downtime deployment frameworks
  • Design and operate highly reliable and scalable systems
  • Perform root cause analysis and collaborate with development teams on remediation
  • Participate in on-call rotations and incident response
  • Implement sustainable incident management and blameless post-mortems
  • Define and improve Service Level Objectives (SLOs)
  • Automate data-driven alerting to proactively identify issues
  • Improve the full service lifecycle—from design to production and optimization
  • Support CI/CD pipelines with validation and operational gating
  • Lead DevOps automation and best practices
  • Reduce operational toil through increased automation and tooling
  • Optimize capacity planning and performance
  • Analyze ITSM activities and identify operational gaps
  • Provide continuous feedback to development teams on resiliency improvements

Benefits

  • insurance (including medical, prescription drug, dental, vision, disability, life insurance)
  • flexible spending account and health savings account
  • paid leaves (including 16 weeks of new parent leave and up to 20 days of bereavement leave)
  • 80 hours of Paid Sick and Safe Time, 25 days of vacation time and 5 personal days, pro-rated based on date of hire
  • 10 annual paid U.S. observed holidays
  • 401k with a best-in-class company match
  • deferred compensation for eligible roles
  • fitness reimbursement or on-site fitness facilities
  • eligibility for tuition reimbursement
© 2026 Teal Labs, Inc
Privacy PolicyTerms of Service