Senior Site Reliability Engineer

Red HatRaleigh, OR
1d

About The Position

The Sr. Site Reliability Engineer is responsible for driving the reliability, performance, and scalability of services with minimal instruction. This role involves tackling non-routine assignments and resolving moderately complex issues that directly impact the service’s stability and effectiveness. The Sr. Site Reliability Engineer applies a deep understanding of software and systems engineering principles to design and implement solutions that enhance service reliability. This position requires good judgment and the ability to prioritize work effectively while contributing to the overall goals of the SRE team and organization. Note: This role may come into contact with confidential or sensitive customer information requiring special treatment in accordance with Red Hat policies and applicable privacy laws. What you will do: Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability. Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase. Work to solve moderately complex issues, making decisions that impact the service's reliability and performance. Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement. Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution. Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues. Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives. Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals. What you bring: A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required. However, hands-on experience that demonstrates your ability and interest in Site Reliability Engineering are valuable to us, and may be considered in lieu of degree requirements. You must have some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language. You must have experience working with public clouds such as AWS, GCP, or Azure. You must also have the ability to collaboratively troubleshoot and solve problems in a team setting. As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems. Direct experience with Kubernetes or OpenShift is a plus. We like to see a demonstrated ability to debug, optimize code and automate routine tasks. We are Red Hat, so you need a basic understanding of Unix/Linux operating systems.

Requirements

  • A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required. However, hands-on experience that demonstrates your ability and interest in Site Reliability Engineering are valuable to us, and may be considered in lieu of degree requirements.
  • You must have some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language.
  • You must have experience working with public clouds such as AWS, GCP, or Azure.
  • You must also have the ability to collaboratively troubleshoot and solve problems in a team setting.
  • As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems.
  • We like to see a demonstrated ability to debug, optimize code and automate routine tasks.
  • We are Red Hat, so you need a basic understanding of Unix/Linux operating systems.

Nice To Haves

  • Direct experience with Kubernetes or OpenShift is a plus.
  • 5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
  • 3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
  • 3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
  • 2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
  • 2+ years of experience delivering a hosted service
  • Demonstrated ability to quickly and accurately troubleshoot system issues
  • Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
  • Solid communications skills and experience working directly with and presenting to customers
  • 1+ year(s) of experience with Kubernetes is a plus
  • 1+ year(s) of experience with docker-based containers is a plus

Responsibilities

  • Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability.
  • Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase.
  • Work to solve moderately complex issues, making decisions that impact the service's reliability and performance.
  • Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement.
  • Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution.
  • Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues.
  • Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives.
  • Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals.

Benefits

  • Comprehensive medical, dental, and vision coverage
  • Flexible Spending Account - healthcare and dependent care
  • Health Savings Account - high deductible medical plan
  • Retirement 401(k) with employer match
  • Paid time off and holidays
  • Paid parental leave plans for all new parents
  • Leave benefits including disability, paid family medical leave, and paid military leave
  • Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service