Systems Reliability Engineer

Leidos•Chantilly, VA

1d•Onsite

About The Position

GEOAxIS is looking for Systems Reliability Engineer engineer to work with the rest of the operations team to help drive program technical execution, innovation and modernization. The GEOAxIS system provides Identity, Credential and Access Management for all web applications. GEOAxIS enables online, on-demand, access to NGA GEOINT content based on user’s authoritative attributes/roles. Our Mission is to maintain highly available ICAM services for protecting those critical mission applications across all security domains. The GxNext contract was awarded to Leidos in 2021 and runs until 2031.

Requirements

BS degree and 4+years of prior relevant experience or Masters with 2+ years of prior relevant experience.
Requires a TS/SCI and ability to obtain and maintain a Polygraph post hire
Strong communication skills, both verbal and written
Ability to quickly learn new software and IT concepts
Strong problem solving and decision making skills
Self-starter with an ability to work in a team environment and independently
Intimately familiar with the COTS products that the program leverages: Oracle Identity and Access Management (IdAM) suite, Apache webgates, and Computer Associates (CA) API Gateway
Experience scripting in a Linux environment using Shell and Bash
Deep understanding and background in COTS integration and custom code development
Experience in at least one of the following languages: Bash Python Java NodeJS
Local to DMV (DC/Maryland/Virginia) with ability to be physically present at the team’s work location in Chantilly
Strong interpersonal skills and proven track record of leading technical teams, conveying technical solutions to technical and non-technical audiences
Candidate must be able to physically be in Chantilly, VA a minimum of 5 days a week to work with the team with occasional meetings in Reston and/or Springfield, VA
All candidates must be US CITIZENS to be considered for the position
Security+ certification within 60 days of hire

Nice To Haves

Kubernetes experience using Rancher RKE2 or Openshift
Strong understanding of containers
Experience containerizing existing custom software
Knowledge of common DevOps tools such as: Ansible ArgoCD Gitlab Nexus3
Kubernetes Certifications in any of the following: RHCSA/RHCE AWS Solutions Architect/DevOps Engineer CKA/CKAD
Familiarity with modern authentication flows such as SAML, OAuth2 and OIDC

Responsibilities

Troubleshoot and resolve system/operational incidents
Perform root cause analysis for operational incidents
Analyze system performance and take corrective actions as needed
Coordinate with mission partners, consumer applications, and other external entities in troubleshooting enterprise incidents and integration problems
Design, develop, and implement automated solutions to proactively monitor system health, identify performance bottlenecks, and resolve system issues through automated remediation, reducing manual intervention and improving system reliability.
Collect data, identify and analyze trends in Operational Incidents, and provide suggestions to mitigate common issues
Work closely with Ops Tech Lead and Development Lead to identify baseline enhancements to improve operational stability
Work with deployment and ISP teams to support baseline deployments to operations
Willingness to support off-hour calls to assist in troubleshooting when high priority operational incidents occur

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume