Site Reliability Engineer - Global Commercial Services Tech

American ExpressNew York, NY
$103,750 - $174,750Hybrid

About The Position

At American Express, our culture is built on a 175-year history of innovation, shared values and Leadership Behaviors, and an unwavering commitment to back our customers, communities, and colleagues. From delivering differentiated products to providing world-class customer service, we operate with a strong risk mindset, ensuring we continue to uphold our brand promise of trust, security, and service. As part of Team Amex, you'll experience this powerful backing with comprehensive support for your holistic well-being and many opportunities to learn new skills, develop as a leader, and grow your career. Here, your voice and ideas matter, your work makes an impact, and together, you will help us define the future of American Express. How will you make an impact in this role? Responsible for contacting clients with overdue accounts to secure the settlement of the account. Also they do preventive work to avoid future overdues with accounts that have a high exposure. Collaborates across Software Engineering teams to design, develop, and implement features that enhance system resilience, scalability, and performance, proactively identifying and resolving system bottlenecks and failure points Develops and refines sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline operational workflows, deployment processes, and infrastructure management, ensuring high system efficiency Engages in architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are integrated into strategic decision-making processes Designs and executes comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhance system robustness and recovery capabilities Develops, optimizes, and maintains comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions Advocates for observability practices by promoting and implementing best practices such as error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Requirements

  • Experience in software development, or technology operations, with a focus on Site Reliability Engineering
  • Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)
  • Bachelor's degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
  • Knowledge of IaC and CI/CD automation tools - TerraForm, GitHubActions
  • Advanced knowledge of modern observability stack - Splunk, Elastic Search, Prometheus, Grafana
  • Advanced knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
  • Advanced knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
  • Advanced knowledge of distributed system Architecture, RESTful services and microservices
  • Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud
  • Familiarity with test automation tools/frameworks (e.g. Postman, Karate)

Nice To Haves

  • Advanced certification in Site Reliability Engineering (SRE) or related is a plus

Responsibilities

  • Collaborates across Software Engineering teams to design, develop, and implement features that enhance system resilience, scalability, and performance, proactively identifying and resolving system bottlenecks and failure points
  • Develops and refines sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline operational workflows, deployment processes, and infrastructure management, ensuring high system efficiency
  • Engages in architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are integrated into strategic decision-making processes
  • Designs and executes comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhance system robustness and recovery capabilities
  • Develops, optimizes, and maintains comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions
  • Advocates for observability practices by promoting and implementing best practices such as error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability
  • Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives

Benefits

  • Competitive base salaries
  • Bonus incentives
  • 6%25 Company Match on retirement savings plan
  • Free financial coaching and financial well-being support
  • Comprehensive medical, dental, vision, life insurance, and disability benefits
  • Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
  • 20+ weeks paid parental leave for all parents, regardless of gender, offered for pregnancy, adoption or surrogacy
  • Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
  • Free and confidential counseling support through our Healthy Minds program
  • Career development and training opportunities
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service