System Admin/Engr II - AMZ13432.7

Amazon•Seattle, WA

14d•Remote

About The Position

This role involves creating new technology components, building software solutions for large-scale infrastructure management, and utilizing AWS technologies like S3, SQS, SNS, Step Function Workflows, and Lambda to solve data center problems. The position requires designing and implementing scalable, fault-tolerant architectures, working with databases such as Redshift and DynamoDB, and troubleshooting Java, Ruby, JavaScript, and Python-based applications. A key aspect of the role is continuous collaboration with teams to identify automation opportunities, design and implement automated workflows, and automate repetitive tasks in SOPs and runbooks. The role also includes implementing monitoring and alerting solutions, analyzing logs and metrics to identify root causes of problems, and developing runbooks for ticket resolution. Collaboration with development teams to enhance runbooks, provide incident response support, and perform periodic Change Management executions are also part of the responsibilities. The role will track Continuous Deployment implementation, support development teams, implement constraint and value manager changes, manage cross-organizational campaigns, address OS and fleet updates, implement infrastructure best practices, resolve Application Security action items, and participate in major engineering projects with global stakeholders. Additionally, the role involves mentoring System Engineers and interns on Amazon Tools and Technologies.

Requirements

Bachelor's degree or foreign equivalent in Computer Science, Information Science, Information Technology, Engineering, Mathematics, Physics, or a related field followed by 5 progressively responsible years of experience in software development or engineering.
In the alternative, a Master’s degree or foreign equivalent in Computer Science, Information Science, Information Technology, Engineering, Mathematics, Physics, or a related field and 1 year of experience in software development or engineering.
Must have one year of experience in the following skill(s): systems engineering or site reliability engineering experience in a large, distributed environment focusing on automation.
Experience with DevOps tools, processes, and culture.
Knowledge of the complete software deployment life cycle from design, build, test, and deployment.
Designing and implementing CI/CD workflows using Infrastructure as Code (IaC) and Git in Production environments.
Troubleshooting and debugging technical systems or Java based applications.
Working with Linux or Unix operating systems.
Experience with XML/SOAP, REST, and HTTP protocols.
Debugging database-related issues and working with databases such as Redshift, DynamoDB, or Aurora RDS.
Working with one of the following programming or scripting languages: Python, JavaScript, Ruby, or Bash.
Utilizing AWS technologies such as S3, SQS, SNS, Step Function Workflows, or Lambda to create solutions to data center problems.

Nice To Haves

All applicants must meet all the above listed requirements.

Responsibilities

Create new technology components needed as part of a service or system implementation.
Build software solutions to manage infrastructure needed to operate our system at large scale.
Utilize AWS technologies such as S3, SQS, SNS, Step Function Workflows, and Lambda to create solutions for data center problems.
Design and implement scalable and fault-tolerant architectures using AWS services.
Work with databases including Redshift and DynamoDB.
Troubleshoot and debug Java, Ruby, JavaScript, Python-based applications to identify and resolve issues.
Continuously collaborate with teams and identify opportunities to automate away manual processes.
Design and implement automated workflows and processes to streamline system operations and reduce manual intervention.
Automate repeated tasks in Standard Operating Procedures (SOPs) and runbooks.
Implement monitoring and alerting solutions to proactively detect and address system anomalies.
Analyze logs, metrics, and other diagnostic data to pinpoint root causes of problems.
Develop and maintain comprehensive runbooks for ticket resolution by use case.
Collaborate with development teams to investigate and enhance runbooks for tickets that cannot be resolved using existing steps.
Provide incident response support for various types of tickets (e.g., Large Scale Events).
Perform periodic Change Management (MCM) executions for onboarded services, ensuring MCM templates remain current.
Track progress of Continuous Deployment (CD) implementation for onboarded services and support development teams.
Implement constraint and value manager changes without new attribute modeling for Fulfillment Planner.
Manage cross-organizational campaigns and tasks.
Address critical OS updates, fleet updates (e.g., AL2 migration), and implement infrastructure best practices.
Resolve Application Security (AppSec) action items to improve overall infrastructure health.
Participate in major engineering projects end-to-end with globally dispersed cross-functional stakeholders.
Mentor System Engineers and interns on Amazon Tools and Technologies while helping them progress in their careers.

Benefits

Telecommuting benefits available.
A sign-on bonus and restricted stock units may be provided as part of the compensation package, in addition to a full range of medical, financial, and/or other benefits, dependent on the position offered.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume