Sr. Manager - Site Reliability Engineering (SRE)

Analog Devices•Wilmington, MA

155d•$144,000 - $198,000

About The Position

We are seeking an experienced Site Reliability Engineering Manager with leadership skills to maintain and improve the reliability, availability, and performance of our cloud and on-premises-based applications and infrastructure. You will be responsible for designing, automating, and managing systems in a hybrid environment, ensuring that both on-prem and cloud services run seamlessly. This role requires strong collaboration with a globally distributed team to implement scalable, automated solutions and foster continuous improvement in operational processes.

Requirements

Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent work experience).
10+ years of experience in SRE, DevOps, or Systems Engineering with expertise in managing hybrid and on-prem infrastructure in addition to cloud-based systems.
Strong leadership skills and experience managing globally distributed teams for at least 5+ years.
Expertise in cloud platforms (AWS, GCP, Azure) and container orchestration systems (Kubernetes, Docker).
Proficiency in programming/scripting languages such as Python, Go, Ruby, or Shell.
5+ years of experience in hosting enterprise level applications on various platforms such as Windows, Linux, Cloud, OnPrem, etc.
Extensive experience with Infrastructure as Code (IaC) tools and automation frameworks (Terraform, Ansible, etc.).
Solid knowledge of monitoring and logging tools (Prometheus, Grafana, ELK Stack, Datadog, etc.).
Familiarity with CI/CD tools and processes (Jenkins, GitLab CI, etc.).
Strong experience with service-oriented architecture (SOA), microservices, and managing distributed systems across geographically diverse regions.
Excellent communication, interpersonal, and leadership skills with the ability to work effectively with remote and distributed teams.
Familiarity with VMs, Containers, networking, VPNs, and load balancing in hybrid environments.

Nice To Haves

Experience in the manufacturing industry or a similar domain.
Experience with SAP Datasphere and/or Databricks platform support and management.
Certifications in cloud platforms.

Responsibilities

Take an active, hands-on role in managing and improving the performance, scalability, and reliability of both cloud and on-prem systems.
Lead by example in resolving complex issues and troubleshooting system failures.
Design, implement, and manage infrastructure across on-premises, cloud-based, and hybrid environments to ensure high availability, performance, and cost-effectiveness.
Lead automation initiatives to streamline the management of on-prem and cloud environments, including infrastructure provisioning, deployment pipelines, monitoring, and incident resolution.
Ensure systems are highly available, resilient, and scalable, with proactive monitoring and incident response mechanisms across on-prem, hybrid, and cloud infrastructures.
Manage the challenges associated with globally distributed systems and teams, ensuring that systems are synchronized and can perform consistently across multiple regions.
Leverage Infrastructure as Code (IaC) tools like Terraform, Ansible, and CloudFormation to manage both cloud and on-prem infrastructure efficiently and consistently.
Collaborate with development teams to refine and improve CI/CD pipelines, ensuring seamless integration and delivery across a hybrid infrastructure environment.
Oversee incident management, ensuring that any issues across both cloud and on-prem infrastructure are resolved quickly.
Lead root cause analysis and postmortem activities to improve future reliability.
Lead and guide a team of App Hosting engineers across different geographies, ensuring effective communication, coordination, and collaboration with globally distributed teams.
Work closely with product, development, and infrastructure teams across different locations to design and implement solutions that meet the needs of our hybrid and global environments.
Provide mentorship to engineers, promoting a culture of continuous learning and fostering best practices in both hybrid and on-prem infrastructure management.
Oversee and prioritize initiatives that span multiple regions, ensuring that global infrastructure needs and timelines are met.
Define and implement processes to ensure the stability, scalability, and efficiency of systems across both cloud and on-prem environments.

Benefits

Medical, vision and dental coverage
401k
Paid vacation
Holidays
Sick time
Discretionary performance-based bonus

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Education Level

Bachelor's degree

Number of Employees

5,001-10,000 employees

Sr. Manager - Site Reliability Engineering (SRE)

About The Position

Requirements

Nice To Haves

Responsibilities

Benefits

What This Job Offers

Job Search Resources

Tools

Career Hubs

Guides

Company