Site Reliability Engineer III

JPMorganChase•Jersey City, NJ

About The Position

This role focuses on designing, implementing, and maintaining robust CI/CD pipelines, managing AWS cloud infrastructure using Infrastructure as Code, and deploying containerized workloads. The engineer will establish monitoring, alerting, and SLOs, lead incident response, and apply security best practices. Key responsibilities include driving system reliability and cost efficiency, standardizing environment management, developing internal tooling, and partnering with development teams. The role also involves documenting processes, fostering a culture of continuous learning, and automating operational procedures.

Requirements

7+ years of experience in DevOps, SRE, or Cloud Automation.
Hands-on experience with AWS services (IAM, VPC, EC2, ALB/NLB, S3, RDS/Aurora, CloudWatch, EKS/ECS, Lambda, Route 53).
Experience building CI/CD with GitHub Actions, GitLab CI, Jenkins, Azure DevOps, etc.
Proficiency in at least one scripting or programming language (Python, Bash, Java, .NET).
Solid understanding of Linux/Unix systems and networking fundamentals.
Experience with secrets and configuration management tools (AWS Secrets Manager/SSM, Vault).
Experience with observability and monitoring tools (Grafana, Dynatrace, Prometheus, Datadog, Splunk).
Familiarity with container orchestration (Docker, Kubernetes, ECS).
Strong communication skills and ability to work independently or in teams.
Proactive, innovative, and passionate about learning.

Nice To Haves

Familiarity with modern front-end technologies.
Experience with large-scale distributed systems.
Knowledge of networking and security best practices.
Strong collaboration and communication skills.

Responsibilities

Design, implement, and maintain end-to-end CI/CD pipelines for both application and infrastructure delivery, supporting release management and change control processes.
Build, manage, and govern AWS cloud infrastructure using Infrastructure as Code tools such as Terraform, CloudFormation, or CDK, ensuring consistency across environments.
Implement and manage containerized workloads and deployment workflows using Docker, Kubernetes/EKS, and ECS across the full software delivery lifecycle.
Establish monitoring, alerting, and SLOs using service level indicators; lead incident response, root cause analysis, and postmortem processes to minimize customer impact.
Apply security best practices including IAM least privilege, secrets management, and policy-as-code to enforce governance and reduce risk across all environments.
Drive system reliability and cost efficiency through autoscaling strategies, right-sizing, performance tuning, and proactive issue resolution.
Standardize and automate environment management across dev, test, and production, enforcing governance controls and ensuring parity across stages.
Design and develop robust internal tooling and software solutions that enhance system performance, scalability, and operational efficiency.
Partner with development teams and stakeholders to identify reliability and scalability improvements, participating in on-call rotation and supporting cross-functional delivery.
Document processes, contribute to communities of practice, and foster a team culture grounded in diversity, inclusion, respect, and continuous learning.
Automate provisioning, configuration management, patching, backups, and operational procedures to reduce toil and improve system reliability.