Site Reliability Engineer, Senior Advisor

Peraton•Annapolis Junction, MD

46d

About The Position

Join the Peraton Team as a Site Reliability Engineer (SRE3) and Help Secure Mission-Critical Systems! We are seeking a highly experienced Site Reliability Engineer (SRE) to support large-scale, highly distributed systems in a mission-critical environment. This role requires a strong blend of software development and system administration expertise, with a focus on designing and implementing sustainable automation solutions that improve reliability, efficiency, and operational consistency. The ideal candidate will leverage extensive experience managing large systems to develop tools that: Reduce risk to production environments Minimize human error Eliminate labor-intensive and repetitive manual processes Improve adherence to operational procedures Serve as a force multiplier for monitoring and system administration teams Automation solutions may include configuration management tools (e.g., SALT, Puppet), custom-developed GUIs for shift operations, or fully automated cluster-level solutions. The goal is to deliver sustainable tools that perform at or above the reliability of manual processes. Peraton offers enhanced benefits to employees supporting our critical National Security programs, including: Heavily subsidized medical, dental, and vision coverage for employees and their dependents Eligibility to participate in a competitive bonus plan Generous PTO plan #MPOJobs #AJCM #PeratonRoyalMove

Requirements

Bachelor’s Degree with 12+ years relevant experience
Master’s Degree with 10+ years relevant experience
PhD with 7+ years relevant experience
Active TS/SCI with current polygraph
AWS Developer – Associate | AWS Solutions Architect (Associate or Professional) | AWS SysOps Administrator – Associate | CKA/CKAD | Elastic Certified Engineer | Elastic Certified Observability Engineer
7+ years software development/engineering experience including requirements analysis, development, integration, installation, testing, maintenance, and issue resolution
7+ years system engineering/architecture in large-scale environments
7+ years supporting distributed/parallel systems (e.g., HBase, Hadoop, Accumulo, Big Table, Cassandra, Scality)
7+ years scripting/automation using Python, Perl, or Ruby
4+ years managing and monitoring cloud-based systems
Experience in system integration, health monitoring, incident management, and postmortem analysis
Cloud certification will be verified during the interview or offer process.

Responsibilities

Design and implement automation solutions for large-scale distributed systems
Develop software tools to support monitoring and system administration teams
Provide technical direction for development, integration, and testing of hardware/software systems
Manage and monitor large cloud-based environments
Conduct postmortem analysis and support incident management processes
Improve operational processes and system health visibility
Support distributed, massively parallel data environments

Benefits

Heavily subsidized medical, dental, and vision coverage for employees and their dependents
Eligibility to participate in a competitive bonus plan
Generous PTO plan

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume