Principal Site Reliability Engineer

ACI Worldwide•Norcross, GA

49d•Hybrid

About The Position

ACI powers the payments ecosystem – globally, and you power ACI. You’ll innovate, collaborate, and grow – in an energetic technology culture with decades of proven success. ACIers – in all roles and levels – are truly your colleagues and many are your friends. Our size and reach allow you to see the global impact of your work. You are visible, your talents are valued, and you are empowered to shape the future of payments. As a Principal Site Reliability Engineer in Norcross, GA or Omaha, NE, you will join a diverse, passionate team, dedicated to powering the world’s payments ecosystem! The Principal Site Reliability Engineer is embedded directly with our product teams, working closely with them to design, code, test, run, and evolve the systems that help people around the world make payments. We work closely with ACI teams to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, follow-the-sun on-call, incident retrospectives, chaos testing, and end-to-end ownership.

Requirements

BS degree in Computer Science, related technical field, or equivalent practical experience.
Experience in data structures, database systems, algorithms, and software design.
Experience writing code in Java, Go, Shell, Python, or a similar language.
Ability to debug, optimize code, and automate routine tasks.
Practical skills with RDBs (such as PostgreSQL, Oracle), NoSQL KV stores (such as Cassandra) and messaging systems (such as Kafka, RabbitMQ and MQ) or equivalent
Proven ability to drive organizational adherence to SRE topics like SLOs, resilience, scaling, performance, and more
15+ years of experience

Nice To Haves

Experience in an SRE or Production Engineering role
Experience with a globally distributed team
Take initiative to solve problems using a scientific approach
Apply appropriate new technologies and processes
Skilled in providing substantial feedback on distributed system designs
Collaboration skills

Responsibilities

Design, develop, deploy, and motivate the creation of software and systems to increase product reliability and organizational efficiency.
Guide reliability practices through the entire software development lifecycle through activities like architecture reviews, code reviews, creating platforms and frameworks, capacity planning, and chaos testing.
Maintain service health by implementing and evolving monitoring, alerting, self-healing and follow-the-sun incident response.
Improve service reliability through blameless post-incident reviews and using code to prevent or respond to problem recurrence. Function as a key technical and culture leader throughout your assigned line of business
Drive and evolve the overall resilience strategy of your given line of business leveraging industry and internal tools
Ensure that local and cross-site redundancy mechanisms are meeting requirements, work as designed and are ever evolving
Set, maintain, and enforce standards across deployment practices, operations etc.
Engage in change review as a key member.
Function as a key contributor to overall capacity, peak season and business continuity methodologies and testing for your space
Interface directly with key clients as needed
Support and help standardize sales responses for your space by helping to craft the go forward offers with business and DevOps teams aligning costs, SLAs and technology.
Perform other duties as assigned
Understand and adhere to all corporate policies to include but not limited to the ACI Code of Business Conduct and Ethics.