Site Reliability Engineer II

Mastercard•O'fallon, MO

3d•$76,000 - $127,000•Onsite

About The Position

The Payment Network Business Operations team is seeking a highly motivated and experienced Site Reliability Engineer II (SRE) to join our team. You will play a critical role in ensuring the reliability, scalability, and performance of our applications, supporting essential services that power Mastercard's global operations. As a thought leader in your field, you will bring technical expertise, a passion for automation, and the ability to mentor. The role of the Business Operations Site Reliability Engineer is to be the production readiness steward for Mastercard products. As Business Operations SRE, we are responsible for ensuring that our platform is stable and healthy. We break down barriers to running our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principles that include operational design, automation, capacity planning, and monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture. We support daily operations with a hyper focus on triage, root cause by understanding the business impact of our products and subsequently performing blameless post-mortems. The goal of every Business Operations team is to engage early in the development lifecycle to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Business Operations teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. Ultimately, the role of Business Operations is to align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle.

Requirements

Ability to use scripting and tooling to implement observability solutions, enabling the collection, analysis, and visualization of metrics, logs, and traces to support incident detection, diagnosis, and continuous service improvement.
Ability to write and maintain code and scripts to automate tasks, build operational tools, and support monitoring, deployment, and incident response using languages such as Python, Go, Bash, or similar.
Ability to configure, operate, and troubleshoot Linux/Unix systems and network components, applying knowledge of networking concepts, protocols, security, and system reliability.
Ability to design, deploy, and manage applications and infrastructure on cloud platforms (e.g., AWS, Azure, GCP), ensuring scalability, security, availability, and operational efficiency.
Ability to design and operate systems for high availability, fault tolerance, and disaster recovery, while ensuring systems can scale to meet current and future demand.
Ability to apply DevOps principles and practices, including CI/CD pipelines, containerization, and orchestration, to enable faster, more reliable software delivery and operations.
Capability to systematically identify, diagnose, and resolve technical issues across systems, applications, and networks, using analytical methods and tools to restore functionality, minimize disruption, and ensure stable operations.
Ability to monitor resource utilization, forecast future capacity needs, and optimize system performance to support growth, scalability, and efficient infrastructure usage.
Ability to apply IT service management principles to incident, problem, and change management, ensuring reliable service delivery, effective incident response, and continuous service improvement aligned to business needs.
The ability to use application reliability signals to anticipate issues, identify risks, and drive preventative improvements that enhance application performance and availability.
Strong knowledge of ITSM practices, observability, and monitoring using tools such as Splunk and Dynatrace.
Experience operating and supporting applications on PCF and AWS platforms.
Proven ability to implement CI/CD pipelines using Jenkins, Bitbucket, and XLR for automated build and release management.

Responsibilities

Work independently on elements of projects/processes within the Site Reliability Engineering area by applying intermediate/practical knowledge and area best practices to meet organizational standards of quality and excellence.
Support the implementation and maintenance of high-availability systems to ensure operational stability.
Assist in evaluating operational needs and developing technical solutions under guidance.
Contribute to automation and scripting projects to streamline routine operational tasks.
Troubleshoot and resolve basic to moderate system issues, escalating more complex problems as needed.
Document operational procedures and shares knowledge with team members.
Participate in quality checks and reviews to ensure system stability and reliability.
Utilize experience and a comprehensive understanding of area processes and tools to make minor adjustments or enhancements to resolve identifiable issues.
May manage smaller project/initiatives as an experienced individual contributor with specialized knowledge within the Site Reliability Engineering area.

Benefits

insurance (including medical, prescription drug, dental, vision, disability, life insurance)
flexible spending account and health savings account
paid leaves (including 16 weeks of new parent leave and up to 20 days of bereavement leave)
80 hours of Paid Sick and Safe Time, 25 days of vacation time and 5 personal days, pro-rated based on date of hire
10 annual paid U.S. observed holidays
401k with a best-in-class company match
deferred compensation for eligible roles
fitness reimbursement or on-site fitness facilities
eligibility for tuition reimbursement

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume