Site Reliability Engineer

Jobgether

3d•$118,000 - $158,000

About The Position

The Site Reliability Engineer will play a critical role in maintaining and scaling complex systems, ensuring the reliability, performance, and availability of infrastructure across cloud and on-premises environments. This role blends deep technical expertise in Linux systems, virtualization, container orchestration, Kubernetes, and CI/CD pipelines with proactive monitoring and operational excellence. You will collaborate closely with development and platform teams to implement best practices, automate workflows, and manage high-throughput services in large-scale datacenters. The position offers the opportunity to influence architecture, improve system resilience, and participate in incident response and root cause analysis. Ideal candidates thrive in fast-paced, distributed teams, are comfortable with both strategic planning and hands-on implementation, and are passionate about building robust and scalable systems.

Requirements

Bachelorâs degree in Computer Science, Engineering, or related field; advanced degree preferred
5+ years of experience in site reliability engineering or similar roles, with a focus on production systems, containers, microservices, and service delivery
Strong expertise in Linux systems, virtualization, and large-scale datacenter operations
Hands-on experience with CI/CD pipelines, GitOps workflows, ArgoCD, Helm, and Kustomize
Proficiency with observability tools such as Prometheus, ELK Stack, Grafana, and log collection frameworks
Familiarity with networking concepts and protocols within Linux environments
Excellent troubleshooting, problem-solving, and cross-functional collaboration skills

Nice To Haves

Experience with Kubernetes and container orchestration is highly desirable

Responsibilities

Monitor, troubleshoot, and optimize system performance, reliability, and availability across bare metal, virtualized, and cloud environments
Design, implement, and maintain scalable infrastructure using containers, Kubernetes, and microservices architectures
Manage CI/CD pipelines and GitOps workflows, including ArgoCD, Helm charts, and Kustomize configurations for automated application deployment
Oversee configuration management using tools like Ansible to ensure consistent and reliable software releases across datacenter infrastructure
Design and operate high-throughput Kafka clusters for event streaming, including replication, consumer lag monitoring, and disaster recovery strategies
Collaborate with development teams to guide system design, operational policies, and performance optimization
Create and maintain technical documentation, runbooks, architectural diagrams, and network topology maps for operational excellence

Benefits

Competitive base salary range: $118,000 â $158,000 USD
Comprehensive medical, dental, and vision coverage, including HSA funding support
Employer-paid income protection (life, AD&D, short- and long-term disability)
401(k) plan with employer match and Roth options, Employee Stock Purchase Plan (ESPP)
Paid time off, sick leave, and corporate holidays
Employee assistance programs and life balance benefits including travel assistance and identity theft protection
Additional perks: discount programs, credit union membership, Medicare assistance, and more

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume