Senior Site Reliability Engineer

Umbra•Reston, VA

52d•Onsite

About The Position

We are seeking an experienced Site Reliability Engineer to help design, build, operate, and scale mission- and business-critical infrastructure. This role requires a deep understanding of the full technology stack and system architecture, with the ability to thoughtfully manage technical debt and make sound trade-offs that support long-term scalability and reliability. The Site Reliability Engineer will play a key role in evolving our architecture to meet future requirements, taking ownership of broad architectural direction while proactively identifying opportunities to improve team processes and drive technical excellence. Success in this role requires strong communication skills and the ability to collaborate effectively with customers, product managers, cross-functional partners, and external stakeholders. This position is expected to lead impactful technical and organizational improvements that benefit multiple teams and support Umbra’s broader objectives. Our aim is to hire this position to work in either our Santa Barbara/Goleta, CA office, Arlington, VA office, or Reston, VA office (coming soon).

Requirements

Bachelor’s degree in Computer Science or a related field, or equivalent professional experience.
8+ years of experience in a Site Reliability Engineer, DevOps, or similar role, with demonstrated expertise managing and scaling complex, distributed systems.
Extensive experience with AWS services (EC2, S3, Lambda, VPC Networking), or other cloud providers, and a deep understanding of cloud infrastructure, networking, and security best practices.
Proven experience in architecting and managing large-scale Kubernetes deployments in production environments.
Advanced proficiency in Infrastructure-as-Code (IaC) tools, preferably Terraform, as well as GitOps practices and automation frameworks.
Demonstrated ability to lead cross-team projects and initiatives, providing technical leadership and driving high-impact outcomes.
Strong expertise in infrastructure and software architecture, capable of designing and evolving complex systems for scalability and reliability.
Experience in developing and managing comprehensive monitoring, alerting, and incident response strategies.

Nice To Haves

12+ years of experience in a Site Reliability Engineer, DevOps, or similar role, with demonstrated expertise managing and scaling complex, distributed systems.
Advanced understanding of cloud and application security, identity management, and compliance.
Expertise in service mesh and service registration technologies, focusing on performance and reliability.
Experience in the aerospace industry.

Responsibilities

Lead the design and evolution of Umbra's critical infrastructure, ensuring scalability, reliability, and alignment with both current and future business needs.
Mentor and guide engineers across multiple teams, fostering a culture of continuous learning and serving as a key resource for technical expertise and professional growth.
Make strategic decisions about architecture and technology, balancing innovation with the management of technical debt and system reliability.
Lead initiatives to introduce and integrate new technologies and tools, develop proofs of concept, and establish best practices across the organization.
Collaborate effectively across teams, projects, and departments to solve complex problems and drive technical innovations that support organizational goals.
Participate in on-call rotations, providing support and resolving complex technical issues.