Site Reliability Engineer

Jobgether
9h$118,000 - $158,000Remote

About The Position

This position is posted by Jobgether on behalf of a partner company. We are currently looking for a Site Reliability Engineer in United States. This role is responsible for ensuring the reliability, scalability, and performance of complex systems across cloud and on-premises environments. The Site Reliability Engineer will work closely with development, operations, and product teams to design and maintain resilient infrastructure, implement CI/CD pipelines, and manage containerized applications and Kubernetes clusters. You will proactively monitor system performance, troubleshoot critical issues, and optimize operational processes to maintain high service availability. This position involves hands-on management of large-scale data centers, automation of deployment workflows, and integration of observability tools. The ideal candidate is highly analytical, detail-oriented, and experienced in both infrastructure engineering and operational best practices. Success in this role directly impacts system uptime, operational efficiency, and overall customer satisfaction.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field; advanced degree preferred.
  • 5+ years of experience in site reliability engineering or a related field focused on production systems and service delivery.
  • Strong Linux systems expertise, including configuration, tuning, and troubleshooting.
  • Hands-on experience with containers, Kubernetes, and microservices architecture.
  • Proficient in CI/CD pipeline management and GitOps workflows, including ArgoCD, Helm charts, and automation tools.
  • Experience with observability tools such as Prometheus, Grafana, and ELK Stack.
  • Proven ability to manage large on-premises data centers with hundreds of bare metal servers and VMs.
  • Familiarity with networking concepts, protocols, and configuration management tools.
  • Strong analytical and troubleshooting skills with the ability to resolve complex system issues.
  • Excellent communication skills and experience collaborating across cross-functional teams.

Responsibilities

  • Design, implement, and maintain scalable, highly available infrastructure using containers, microservices, and Kubernetes.
  • Monitor system performance, troubleshoot reliability issues, and ensure optimal operation of both cloud-based and on-premises systems.
  • Manage CI/CD pipelines and GitOps workflows, including ArgoCD, Helm charts, and Kustomize configurations for efficient software deployment.
  • Implement configuration management processes using tools like Ansible to ensure consistent environments across data centers.
  • Operate and optimize high-throughput Kafka clusters for event streaming, including replication, partitioning, and disaster recovery strategies.
  • Collaborate with development teams to influence system design, operational policies, and best practices.
  • Maintain comprehensive technical documentation, runbooks, architectural diagrams, and incident response procedures.
  • Participate in on-call rotations and conduct blameless post-mortems for critical incidents.
  • Continuously evaluate emerging technologies to enhance operational efficiency and reliability.

Benefits

  • Competitive salary: $118,000–$158,000 USD, depending on experience and location.
  • Comprehensive medical, dental, and vision coverage for employees and dependents.
  • Employer-paid income protection benefits including life, AD&D, short- and long-term disability.
  • Flexible spending accounts for healthcare and dependent care.
  • Retirement plan with 401(k) and employer match, plus Roth options.
  • Employee Stock Purchase Plan (ESPP) and potential bonuses.
  • Paid time off, sick leave, and company-observed holidays.
  • Employee Assistance Program and additional perks such as commuter benefits, discount programs, and identity theft protection.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service