Site Reliability Engineer

Nusano•West Valley City, UT

1d•Onsite

About The Position

The Site Reliability Engineer supports the deployment and operation of Linux-based, containerized microservice clusters that power a state-of-the-art particle accelerator used for medical radioisotope production supporting cancer care, scientific research, and advanced technology applications. This role works directly with physical accelerator systems, where software reliability, infrastructure resilience, and real-time data availability are critical to continuous operations. Reporting to the Director of CSEE, the Site Reliability Engineer collaborates closely with engineers and scientists to integrate software and hardware systems responsible for accelerator control, data acquisition, and data management. This hands-on role focuses on automation, observability, and operational stability within an on-premise production environment, helping ensure reliable platform operations and secure storage, monitoring, and retrieval of accelerator time-series and imaging data essential to system performance.

Requirements

Bachelor's degree in Computer Science or related field
3-5 years of professional experience in a DevOps/SRE capacity
Hands-on experience working in a production environment with on-premise hardware
Proficiency in deploying and configuring containers and container orchestration platforms (Docker, Kubernetes)
Experience implementing and maintaining real-time observability tools (Grafana, Prometheus, or similar)
Experience deploying and configuring CI/CD pipelines (GitLab CI, Jenkins, Argo, or similar)
Understanding of network fundamentals including enterprise routing and switching
Strong Linux command-line skills and system administration experience
Working knowledge of cybersecurity best practices

Nice To Haves

Exposure to or familiarity with the EPICS framework
Experience maintaining Linux systems (Red Hat, Rocky, CentOS)
Programming experience in Python and/or C/C++ with objectoriented principles
Experience with databases and object stores (ElasticSearch, MariaDB, S3, MongoDB)Experience in highly regulated or safety-critical environments

Responsibilities

Respond to and resolve production issues and outages under the guidance of senior team members
Support observability initiatives by implementing and maintaining monitoring dashboards and alerting systems
Develop and maintain infrastructure-as-code scripts, automation tools, system images, virtual machines, and containers to support accelerator operations
Assist in identifying and documenting system vulnerabilities and potential points of failure
Document processes, procedures, and troubleshooting steps in the group knowledge base
Deploy and configure vendor-supplied software utilities and packages built on the EPICS framework

Benefits

Comprehensive medical, dental, and vision coverage for employees and their eligible dependents
401(K) Retirement Plan
Company-paid life insurance & AD&D coverage
Company-paid short-term and long-term disability coverage
High-Deductible Health Plan (HDHP) option with company funded Health Savings Account (HSA)
Healthcare Flexible Spending Account (FSA)
Dependent Care Reimbursement Account (DCRA)
Voluntary Life Insurance
Voluntary benefits such as Critical Illness, Accident, Hospital, and Pet Insurance
Employee Assistance Program (EAP)
Vacation, Sick Time, and Holidays