Senior BizOps Engineer

Mastercard•O’Fallon, MO

44d

About The Position

The Business Operations (Biz Ops) team is seeking a Business Operations Site Reliability Engineer (SRE)/ Operational Readiness Architect. The role of Business Operations Organization is to be the production readiness steward for Mastercard products. As a Business Operations SRE, we are responsible for ensuring that our platform is stable and healthy. We break down barriers to run our products by fostering developer run ownership and empowering developers to build resilient products. We support our developers during the application build phase in software run principals that includes operational design, automation, capacity planning, monitoring that leads to fault-tolerant, scalable products. We see the big picture and help create and enforce operations standards while facilitating an agile and learning culture. We support daily operations with a hyper focus on triage, root cause by understanding the business impact of our products and subsequently performing blameless post-mortems. The goal of every Business Operations team is to engage early in the development lifecycle to be more proactive and upfront in the development process, and to proactively manage production and change activities to maximize customer experience and increase the overall value of supported applications. Business Operations teams also focus on risk management by tying all our activities together with an overarching responsibility for compliance and risk mitigation across all our environments. Ultimately, the role of Business Operations is to align Product and Customer Focused priorities with Operational needs by providing continuous feedback throughout the lifecycle.

Requirements

BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience.
Appetite for change and pushing the boundaries of what can be done with automation. Be curious about new technology, infrastructure, and practices to scale our architecture and prepare for future growth.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Interest in designing, analyzing, and troubleshooting large-scale distributed systems.
Willingness and ability to learn and take on challenging opportunities and to work as a member of matrix based diverse and geographically distributed project team.
Ability to balance doing things right with fixing things quickly. Flexible and pragmatic, while working towards improving the long-term health of the system.
Comfortable collaborating with cross-functional teams to ensure that expected system behavior is understood, and monitoring exists to detect anomalies.

Nice To Haves

Experience with algorithms, data structures, scripting, pipeline management, and software design.
Experience in working across development, operations, and product teams to prioritize needs and to build relationships is a must.
Experience in a SRE role or related field.
Proven expertise in relational database management systems (RDBMS) such as PostgreSQL and Oracle.
Proficiency in SQL, PL/SQL, and PostgreSQL-specific features.
Strong understanding of database architecture, performance tuning, and query optimization.
Experience in Monitoring tools such as Splunk, Dynatrace.
Experience in production support environments and ITIL processes.
Experience with industry standard CI/CD tools like Git/Bitbucket, Jenkins, Maven, Artifactory, Groovy and Chef. Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is required.
Understanding of:
Client-server relationships
Network concepts (Layer 1 to Layer 3)
Stack trace analysis (TCP dumps, heap dumps, CPU/memory analysis, thread dumps).
Load balancers and application firewalls.
Operating System navigation.
Logging and monitoring methods, standards, and tools.
High availability and business continuity planning
Caching concepts
Configuration management
Awareness of security implementations, certificate management lifecycle, mutual TLS, SSL handshake, SSH keys, symmetric and asymmetric encryptions.

Responsibilities

Serve as the primary contact responsible for the overall application health, performance, and capacity.
Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
Partner with the development and product team of a new application to establish the right monitoring and alerting strategy and create the framework to achieve zero downtime during deployment.
Analyses ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume