Senior Site Reliability Engineer

Red Hat•Raleigh, OR

About The Position

The Sr. Site Reliability Engineer is responsible for driving the reliability, performance, and scalability of services with minimal instruction. This role involves tackling non-routine assignments and resolving moderately complex issues that directly impact the service’s stability and effectiveness. The Sr. Site Reliability Engineer applies a deep understanding of software and systems engineering principles to design and implement solutions that enhance service reliability. This position requires good judgment and the ability to prioritize work effectively while contributing to the overall goals of the SRE team and organization. Note: This role may come into contact with confidential or sensitive customer information requiring special treatment in accordance with Red Hat policies and applicable privacy laws. What you will do: Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability. Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase. Work to solve moderately complex issues, making decisions that impact the service's reliability and performance. Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement. Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution. Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues. Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives. Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals. What you bring: A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required. However, hands-on experience that demonstrates your ability and interest in Site Reliability Engineering are valuable to us, and may be considered in lieu of degree requirements. You must have some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language. You must have experience working with public clouds such as AWS, GCP, or Azure. You must also have the ability to collaboratively troubleshoot and solve problems in a team setting. As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems. Direct experience with Kubernetes or OpenShift is a plus. We like to see a demonstrated ability to debug, optimize code and automate routine tasks. We are Red Hat, so you need a basic understanding of Unix/Linux operating systems.

Requirements

A bachelor's degree in Computer Science or a related technical field involving software or systems engineering is required. However, hands-on experience that demonstrates your ability and interest in Site Reliability Engineering are valuable to us, and may be considered in lieu of degree requirements.
You must have some experience programming in at least one of these languages: Python, Golang, C, C++ or another object-oriented language.
You must have experience working with public clouds such as AWS, GCP, or Azure.
You must also have the ability to collaboratively troubleshoot and solve problems in a team setting.
As an SRE you will be most successful if you have some experience troubleshooting an as-a-service offering (SaaS, PaaS, etc.) and some experience working with complex distributed systems.
We like to see a demonstrated ability to debug, optimize code and automate routine tasks.
We are Red Hat, so you need a basic understanding of Unix/Linux operating systems.

Nice To Haves

Direct experience with Kubernetes or OpenShift is a plus.
5+ years of experience managing Linux servers running Red Hat Enterprise Linux (RHEL), CentOS, or Fedora hosted at a cloud provider such as Amazon Web Services (AWS), Google Compute Engine (GCE), or Microsoft Azure
3+ years of experience with enterprise systems monitoring; knowledge of Prometheus is a plus
3+ years of experience with enterprise configuration management software like Ansible by Red Hat, Puppet, or Chef
2+ years of experience programming with at least one object-oriented language; Golang, Java, or Python are preferred
2+ years of experience delivering a hosted service
Demonstrated ability to quickly and accurately troubleshoot system issues
Solid understanding of standard TCP/IP networking and common protocols like DNS and HTTP
Solid communications skills and experience working directly with and presenting to customers
1+ year(s) of experience with Kubernetes is a plus
1+ year(s) of experience with docker-based containers is a plus

Responsibilities

Lead the development and implementation of robust code and automation scripts to improve service reliability and scalability.
Conduct thorough code reviews and testing processes to ensure the highest quality standards in the codebase.
Work to solve moderately complex issues, making decisions that impact the service's reliability and performance.
Mentor and guide junior engineers, fostering a collaborative environment focused on continuous improvement.
Engage in a regular on-call rotation, taking responsibility for critical incidents and ensuring timely resolution.
Lead incident response and postmortem processes, implementing solutions to prevent recurrence of issues.
Collaborate with cross-functional teams to design, develop, and refine SRE tools and systems that support service objectives.
Take ownership of tasks and projects, prioritizing them according to their impact on service health and team goals.

Benefits

Comprehensive medical, dental, and vision coverage
Flexible Spending Account - healthcare and dependent care
Health Savings Account - high deductible medical plan
Retirement 401(k) with employer match
Paid time off and holidays
Paid parental leave plans for all new parents
Leave benefits including disability, paid family medical leave, and paid military leave
Additional benefits including employee stock purchase plan, family planning reimbursement, tuition reimbursement, transportation expense account, employee assistance program, and more!

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume