Senior Reliability Engineer

MastercardO’Fallon, MO
4d

About The Position

The Business Operations (Biz Ops) team serves as the production readiness steward for Mastercard products. As a Business Operations Site Reliability Engineer (SRE) / Operational Readiness Architect, the mission is to ensure platform stability, health, and resilience.

Requirements

  • BS in Computer Science or related technical field, or equivalent practical experience.
  • Curiosity and appetite for automation, new technologies, and scalable architectures.
  • Strong problem‑solving skills, communication abilities, ownership, and drive.
  • Interest in large‑scale distributed systems design, analysis, and troubleshooting.
  • Ability to work in diverse, matrix‑based, geographically distributed teams.
  • Balance between long‑term system health and short‑term fixes.
  • Ability to collaborate cross‑functionally with clear understanding of expected system behavior and monitoring needs.
  • Experience in industry standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, Artifactory, and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.
  • Experience in one or more of the following is preferred: C, C++, Java, Python, Go, Perl or Ruby.
  • Ability to work in shifts and weekends when in needed & based on team members rotations & schedule.

Nice To Haves

  • Experience with algorithms, data structures, scripting, pipeline management, and software design.
  • Experience working across development, operations, and product teams.
  • Prior SRE experience.
  • Expertise in RDBMS such as PostgreSQL and Oracle.
  • Proficiency in SQL, PL/SQL, and PostgreSQL features.
  • Strong understanding of database architecture, performance tuning, and query optimization.
  • Experience with monitoring tools (e.g., Splunk, Dynatrace).
  • Experience in production support and ITIL processes.
  • Experience with CI/CD tools: Git/Bitbucket, Jenkins, Maven, Artifactory, Groovy, Chef.
  • Understanding of:
  • Client‑server relationships
  • Network concepts (Layer 1–3)
  • Stack trace analysis (TCP dumps, heap/CPU/memory/thread dumps)
  • Load balancers and application firewalls
  • Operating system navigation
  • Logging and monitoring standards
  • High availability and business continuity
  • Caching concepts
  • Configuration management
  • Awareness of security implementations, certificate lifecycle management, mutual TLS, SSL handshake, SSH keys, and encryption methods (symmetric/asymmetric).

Responsibilities

  • Foster developer ownership and empower teams to build resilient, fault‑tolerant, scalable products.
  • Support developers during the build phase with operational design, automation, capacity planning, and monitoring.
  • Establish and enforce operational standards while promoting an agile, learning‑focused culture.
  • Lead triage and root‑cause analysis with a focus on business impact and blameless post‑mortems.
  • Engage early in the development lifecycle to proactively manage production and change activities.
  • Drive risk management, compliance, and mitigation across environments.
  • Align product and customer priorities with operational needs through continuous feedback.
  • Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead Mastercard in DevOps automation and best practices.
  • Practice sustainable incident response and blameless post-mortems.
  • Take a holistic approach to problem solving, by connecting the dots during a production event thru the various technology stack that makes up the platform, to optimize mean time to recover
  • Work with a global team spread across tech hubs in multiple geographies and time zones
  • Share knowledge and mentor junior resources
  • Serve as the primary contact for application health, performance, and capacity.
  • Support services before launch through system design consulting, capacity planning, and launch reviews.
  • Partner with development and product teams to define monitoring and alerting strategies.
  • Build frameworks that enable zero‑downtime deployments.
  • Analyze ITSM activities and provide feedback to development teams on operational gaps and resiliency concerns.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service