Platform Operations Engineer

RTX•Marlborough, MA

1d•Hybrid

About The Position

The Platform Reliability Engineer will report to the Digital Infrastructure Services organization and support the design, implementation, and maintenance of enterprise-wide orchestration and container management platforms based on Kubernetes. These platforms support program software, solutions, and products across the organization. The Raytheon Orchestration and Container Kubernetes Service (ROCKS) team provides a standardized container management platform built on Kubernetes. ROCKS supports: A secure, enterprise container orchestration foundation for the Digital Ecosystem Deployment in air-gapped classified environments Non-production services in shared unclassified environments Integration across cloud and on-premises environments

Requirements

Typically, requires Bachelor’s in science, Technology, Engineering, or Mathematics (STEM) or equivalent experience and a minimum of 5 years prior relevant experience, or An Advanced Degree in a related field and a minimum of 3 years experience.
Experience installing, deploying, monitoring, and supporting Kubernetes clusters in on-premises and cloud environments
Experience with Kubernetes platforms including Rancher RKE2, upstream Kubernetes, OpenShift Container Platform, VMware RKE/Tanzu, or similar Kubernetes distributions
Experience with Kubernetes-related tools and technologies including Terraform, Helm, Python, Go, and Bash
Experience with observability and monitoring tools including Grafana, Prometheus, Alert manager, and Loki
The ability to obtain and maintain a US security clearance.
U.S. citizenship is required as only U.S. citizens are eligible for a security clearance

Nice To Haves

Advanced knowledge of Kubernetes architecture, operations, and supporting tools
Experience deploying, configuring, maintaining, and supporting Kubernetes clusters
Experience in cloud and hybrid environments including AWS GovCloud, Azure Government, VMware, bare metal, and restricted networks
Experience with highly available and resilient cluster design and operations
Experience implementing observability, monitoring, and alerting solutions for distributed systems
Deploying cloud native platforms and systems in classified and unclassified environments
Designing and operating scalable, secure, high-performance systems, platforms, and Kubernetes clusters
Working with VMware, AWS GovCloud, and Azure Government environments
Oral and written communication
Executing projects within schedules and budgets
Translating business and functional requirements into technical requirements and tasks
Documenting and diagramming technical systems
Working with CNCF Kubernetes components including service mesh, service discovery, package management, observability and monitoring, runtimes, and security
Working with GitOps and Kubernetes package management tools including ArgoCD, Packer, Helm, and Kustomize
Working in Agile environments with product owners and scrum masters
Implementing Kubernetes in air-gapped and regulated network environments
Root cause analysis of distributed system failures
Monitoring CNCF ecosystem developments and applying technologies to Kubernetes platforms

Responsibilities

Implement, support, and optimize Kubernetes-based container orchestration platforms across both unclassified and closed-area systems
Collaborate with engineering, program teams, and cross-functional partners to improve platform usage and identify enhancements
Diagnose and resolve complex Kubernetes-related issues in partnership with internal teams and stakeholders
Support escalation and resolution of higher-severity platform issues in alignment with established processes
Develop and enhance observability and monitoring capabilities to support error detection, defect reduction, and improved system performance
Improve Mean Time to Detect (MTTD), Mean Time to Resolve (MTTR), service availability, and customer experience
Implement and maintain monitoring tools, dashboards, and alerting systems aligned with operational best practices
Work with infrastructure, networking, and application teams to forecast capacity, scaling requirements, and system demand