Site Reliability Engineer - Global Commercial Services Tech

American Express•New York, NY

7d•$103,750 - $174,750•Hybrid

About The Position

At American Express, our culture is built on a 175-year history of innovation, shared values and Leadership Behaviors, and an unwavering commitment to back our customers, communities, and colleagues. From delivering differentiated products to providing world-class customer service, we operate with a strong risk mindset, ensuring we continue to uphold our brand promise of trust, security, and service. As part of Team Amex, you'll experience this powerful backing with comprehensive support for your holistic well-being and many opportunities to learn new skills, develop as a leader, and grow your career. Here, your voice and ideas matter, your work makes an impact, and together, you will help us define the future of American Express. How will you make an impact in this role? Responsible for contacting clients with overdue accounts to secure the settlement of the account. Also they do preventive work to avoid future overdues with accounts that have a high exposure. Collaborates across Software Engineering teams to design, develop, and implement features that enhance system resilience, scalability, and performance, proactively identifying and resolving system bottlenecks and failure points Develops and refines sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline operational workflows, deployment processes, and infrastructure management, ensuring high system efficiency Engages in architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are integrated into strategic decision-making processes Designs and executes comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhance system robustness and recovery capabilities Develops, optimizes, and maintains comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions Advocates for observability practices by promoting and implementing best practices such as error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Requirements

Experience in software development, or technology operations, with a focus on Site Reliability Engineering
Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)
Bachelor's degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
Knowledge of IaC and CI/CD automation tools - TerraForm, GitHubActions
Advanced knowledge of modern observability stack - Splunk, Elastic Search, Prometheus, Grafana
Advanced knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
Advanced knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
Advanced knowledge of distributed system Architecture, RESTful services and microservices
Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud
Familiarity with test automation tools/frameworks (e.g. Postman, Karate)

Nice To Haves

Advanced certification in Site Reliability Engineering (SRE) or related is a plus

Responsibilities

Collaborates across Software Engineering teams to design, develop, and implement features that enhance system resilience, scalability, and performance, proactively identifying and resolving system bottlenecks and failure points
Develops and refines sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline operational workflows, deployment processes, and infrastructure management, ensuring high system efficiency
Engages in architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are integrated into strategic decision-making processes
Designs and executes comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhance system robustness and recovery capabilities
Develops, optimizes, and maintains comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions
Advocates for observability practices by promoting and implementing best practices such as error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability
Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Benefits

Competitive base salaries
Bonus incentives
6%25 Company Match on retirement savings plan
Free financial coaching and financial well-being support
Comprehensive medical, dental, vision, life insurance, and disability benefits
Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
20+ weeks paid parental leave for all parents, regardless of gender, offered for pregnancy, adoption or surrogacy
Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
Free and confidential counseling support through our Healthy Minds program
Career development and training opportunities