Senior Reliability Engineer

Mastercard•O’Fallon, MO

About The Position

The Business Operations (Biz Ops) team serves as the production readiness steward for Mastercard products. As a Business Operations Site Reliability Engineer (SRE) / Operational Readiness Architect, the mission is to ensure platform stability, health, and resilience.

Requirements

BS in Computer Science or related technical field, or equivalent practical experience.
Curiosity and appetite for automation, new technologies, and scalable architectures.
Strong problem‑solving skills, communication abilities, ownership, and drive.
Interest in large‑scale distributed systems design, analysis, and troubleshooting.
Ability to work in diverse, matrix‑based, geographically distributed teams.
Balance between long‑term system health and short‑term fixes.
Ability to collaborate cross‑functionally with clear understanding of expected system behavior and monitoring needs.
Experience in industry standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, Artifactory, and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.
Experience in one or more of the following is preferred: C, C++, Java, Python, Go, Perl or Ruby.
Ability to work in shifts and weekends when in needed & based on team members rotations & schedule.

Nice To Haves

Experience with algorithms, data structures, scripting, pipeline management, and software design.
Experience working across development, operations, and product teams.
Prior SRE experience.
Expertise in RDBMS such as PostgreSQL and Oracle.
Proficiency in SQL, PL/SQL, and PostgreSQL features.
Strong understanding of database architecture, performance tuning, and query optimization.
Experience with monitoring tools (e.g., Splunk, Dynatrace).
Experience in production support and ITIL processes.
Experience with CI/CD tools: Git/Bitbucket, Jenkins, Maven, Artifactory, Groovy, Chef.
Understanding of:
Client‑server relationships
Network concepts (Layer 1–3)
Stack trace analysis (TCP dumps, heap/CPU/memory/thread dumps)
Load balancers and application firewalls
Operating system navigation
Logging and monitoring standards
High availability and business continuity
Caching concepts
Configuration management
Awareness of security implementations, certificate lifecycle management, mutual TLS, SSL handshake, SSH keys, and encryption methods (symmetric/asymmetric).

Responsibilities

Foster developer ownership and empower teams to build resilient, fault‑tolerant, scalable products.
Support developers during the build phase with operational design, automation, capacity planning, and monitoring.
Establish and enforce operational standards while promoting an agile, learning‑focused culture.
Lead triage and root‑cause analysis with a focus on business impact and blameless post‑mortems.
Engage early in the development lifecycle to proactively manage production and change activities.
Drive risk management, compliance, and mitigation across environments.
Align product and customer priorities with operational needs through continuous feedback.
Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead Mastercard in DevOps automation and best practices.
Practice sustainable incident response and blameless post-mortems.
Take a holistic approach to problem solving, by connecting the dots during a production event thru the various technology stack that makes up the platform, to optimize mean time to recover
Work with a global team spread across tech hubs in multiple geographies and time zones
Share knowledge and mentor junior resources
Serve as the primary contact for application health, performance, and capacity.
Support services before launch through system design consulting, capacity planning, and launch reviews.
Partner with development and product teams to define monitoring and alerting strategies.
Build frameworks that enable zero‑downtime deployments.
Analyze ITSM activities and provide feedback to development teams on operational gaps and resiliency concerns.