Site Reliability Engineer

Jobgether

9h•$118,000 - $158,000•Remote

About The Position

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in United States. This role is responsible for ensuring the reliability, scalability, and performance of complex systems across cloud and on-premises environments. The Site Reliability Engineer will work closely with development, operations, and product teams to design and maintain resilient infrastructure, implement CI/CD pipelines, and manage containerized applications and Kubernetes clusters. You will proactively monitor system performance, troubleshoot critical issues, and optimize operational processes to maintain high service availability. This position involves hands-on management of large-scale data centers, automation of deployment workflows, and integration of observability tools. The ideal candidate is highly analytical, detail-oriented, and experienced in both infrastructure engineering and operational best practices. Success in this role directly impacts system uptime, operational efficiency, and overall customer satisfaction.

Requirements

Bachelorâs degree in Computer Science, Engineering, or a related field; advanced degree preferred.
5+ years of experience in site reliability engineering or a related field focused on production systems and service delivery.
Strong Linux systems expertise, including configuration, tuning, and troubleshooting.
Hands-on experience with containers, Kubernetes, and microservices architecture.
Proficient in CI/CD pipeline management and GitOps workflows, including ArgoCD, Helm charts, and automation tools.
Experience with observability tools such as Prometheus, Grafana, and ELK Stack.
Proven ability to manage large on-premises data centers with hundreds of bare metal servers and VMs.
Familiarity with networking concepts, protocols, and configuration management tools.
Strong analytical and troubleshooting skills with the ability to resolve complex system issues.
Excellent communication skills and experience collaborating across cross-functional teams.

Responsibilities

Design, implement, and maintain scalable, highly available infrastructure using containers, microservices, and Kubernetes.
Monitor system performance, troubleshoot reliability issues, and ensure optimal operation of both cloud-based and on-premises systems.
Manage CI/CD pipelines and GitOps workflows, including ArgoCD, Helm charts, and Kustomize configurations for efficient software deployment.
Implement configuration management processes using tools like Ansible to ensure consistent environments across data centers.
Operate and optimize high-throughput Kafka clusters for event streaming, including replication, partitioning, and disaster recovery strategies.
Collaborate with development teams to influence system design, operational policies, and best practices.
Maintain comprehensive technical documentation, runbooks, architectural diagrams, and incident response procedures.
Participate in on-call rotations and conduct blameless post-mortems for critical incidents.
Continuously evaluate emerging technologies to enhance operational efficiency and reliability.

Benefits

Competitive salary: $118,000â$158,000 USD, depending on experience and location.
Comprehensive medical, dental, and vision coverage for employees and dependents.
Employer-paid income protection benefits including life, AD&D, short- and long-term disability.
Flexible spending accounts for healthcare and dependent care.
Retirement plan with 401(k) and employer match, plus Roth options.
Employee Stock Purchase Plan (ESPP) and potential bonuses.
Paid time off, sick leave, and company-observed holidays.
Employee Assistance Program and additional perks such as commuter benefits, discount programs, and identity theft protection.

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume