Site Reliability Engineer III

JPMorgan Chase & Co.•Plano, TX

11h

About The Position

As a Site Reliability Engineer III at JPMorganChase within the Data Solutions team of Corporate Sector, you will play a key role in automating, troubleshooting, and monitoring AWS-based applications and infrastructure. You will work hands-on to enhance reliability, performance, and scalability, ensuring seamless operations and continuous improvement. Your expertise will help drive the adoption of SRE best practices and deliver impactful solutions for the business.

Requirements

Formal training or certification on software engineering concepts and 3+ years applied experience
Proficient in site reliability engineering principles and their application within cloud environments
Skilled in at least one programming language such as Python, Java/Spring Boot, or .Net
Strong knowledge of software applications and technical processes within disciplines like Cloud or AI
Experience with observability tools (Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.)
Familiarity with CI/CD tools such as Jenkins, GitLab, or Terraform
Ability to proactively identify and address technical challenges
Demonstrates interest in learning new technologies to drive innovation
Capable of identifying and implementing relevant solutions to meet design constraints
Initiates and implements ideas to solve business problems
Effectively communicates and collaborates within large teams with limited supervision

Nice To Haves

Experience with AWS platform and container orchestration (EKS)
Familiarity with troubleshooting common networking technologies and issues
Exposure to cloud security and compliance practices
Experience with infrastructure automation tools (Ansible, Chef, Puppet)
Knowledge of distributed systems and microservices architecture
Experience working in agile development environments

Responsibilities

Guides and assists others in building effective designs and achieving consensus within the team
Collaborates with software engineers and teams to implement automated CI/CD pipelines for deployment
Designs, develops, tests, and implements solutions to improve availability, reliability, and scalability
Implements infrastructure, configuration, and network as code for assigned applications and platforms
Works with technical experts, stakeholders, and team members to resolve complex issues
Understands and applies service level indicators and objectives to proactively address potential problems
Supports the adoption and implementation of site reliability engineering best practices
Drives automation initiatives to reduce manual intervention and improve operational efficiency
Troubleshoots AWS infrastructure and application issues to maintain high reliability
Enhances observability through monitoring, alerting, and telemetry collection