Principal Site Reliability Engineer

ACI WorldwideNorcross, GA
7hHybrid

About The Position

ACI powers the payments ecosystem – globally, and you power ACI. You’ll innovate, collaborate, and grow – in an energetic technology culture with decades of proven success. ACIers – in all roles and levels – are truly your colleagues and many are your friends. Our size and reach allow you to see the global impact of your work. You are visible, your talents are valued, and you are empowered to shape the future of payments. As a Principal Site Reliability Engineer in Norcross, GA or Omaha, NE, you will join a diverse, passionate team, dedicated to powering the world’s payments ecosystem! The Principal Site Reliability Engineer is embedded directly with our product teams, working closely with them to design, code, test, run, and evolve the systems that help people around the world make payments. We work closely with ACI teams to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, follow-the-sun on-call, incident retrospectives, chaos testing, and end-to-end ownership.

Requirements

  • BS degree in Computer Science, related technical field, or equivalent practical experience.
  • Experience in data structures, database systems, algorithms, and software design.
  • Experience writing code in Java, Go, Shell, Python, or a similar language.
  • Ability to debug, optimize code, and automate routine tasks.
  • Practical skills with RDBs (such as PostgreSQL, Oracle), NoSQL KV stores (such as Cassandra) and messaging systems (such as Kafka, RabbitMQ and MQ) or equivalent
  • Proven ability to drive organizational adherence to SRE topics like SLOs, resilience, scaling, performance, and more
  • 15+ years of experience

Nice To Haves

  • Experience in an SRE or Production Engineering role
  • Experience with a globally distributed team
  • Take initiative to solve problems using a scientific approach
  • Apply appropriate new technologies and processes
  • Skilled in providing substantial feedback on distributed system designs
  • Collaboration skills

Responsibilities

  • Design, develop, deploy, and motivate the creation of software and systems to increase product reliability and organizational efficiency.
  • Guide reliability practices through the entire software development lifecycle through activities like architecture reviews, code reviews, creating platforms and frameworks, capacity planning, and chaos testing.
  • Maintain service health by implementing and evolving monitoring, alerting, self-healing and follow-the-sun incident response.
  • Improve service reliability through blameless post-incident reviews and using code to prevent or respond to problem recurrence. Function as a key technical and culture leader throughout your assigned line of business
  • Drive and evolve the overall resilience strategy of your given line of business leveraging industry and internal tools
  • Ensure that local and cross-site redundancy mechanisms are meeting requirements, work as designed and are ever evolving
  • Set, maintain, and enforce standards across deployment practices, operations etc.
  • Engage in change review as a key member.
  • Function as a key contributor to overall capacity, peak season and business continuity methodologies and testing for your space
  • Interface directly with key clients as needed
  • Support and help standardize sales responses for your space by helping to craft the go forward offers with business and DevOps teams aligning costs, SLAs and technology.
  • Perform other duties as assigned
  • Understand and adhere to all corporate policies to include but not limited to the ACI Code of Business Conduct and Ethics.

Benefits

  • opportunities for growth
  • career development
  • competitive compensation and benefits package
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service