Senior Research Systems Administrator

Stony Brook University•Stony Brook, NY

7d•Hybrid

About The Position

The Senior Research Systems Administrator will support on-campus research systems at Stony Brook University. This includes maintaining High-Performance Computing clusters used by researchers and providing assistance with Linux servers and workstations. The incumbent will help researchers design, procure, implement, monitor, and troubleshoot their HPC Linux systems. This individual will collaborate with stakeholders to define, analyze, and communicate system specifications.

Requirements

Bachelor’s degree or in lieu of a degree, four (4) years of directly related full-time experience or an equivalent combination of education and experience totaling four (4) years may be considered.
Minimum four (4) years of full-time systems administration experience.
Demonstrated expertise with Linux/Unix operating systems (e.g., Red Hat Enterprise Linux, Debian, Solaris).
Experience with hardware installation, configuration, upgrades, monitoring, and troubleshooting.
Experience managing user accounts, system security, and enterprise infrastructure environments.

Nice To Haves

Experience supporting High-Performance Computing (HPC) clusters.
Experience supporting research/clinical technologies & software.
Experience supporting large-scale datasets & databases.
Experience supporting file/web servers.
Experience with scripting or programming languages (e.g., Bash, Python, etc.).
Experience with virtualization technologies.
Familiarity with basic networking concepts and troubleshooting.

Responsibilities

High-Performance Computing Operations: Support day-to-day operations of institutional HPC clusters.
Manage account provisioning, queue configuration, system monitoring, and storage integration.
Ensure system stability, uptime, and operational continuity.
Monitor system performance and proactively address bottlenecks.
Collaborate with infrastructure and security teams to maintain compliance and data protection standards.
Infrastructure Architecture, Capacity Planning & Strategic Growth: Evaluate research computing growth trends and recommend scalable upgrade paths.
Participate in long-term capacity planning for compute, storage, and networking resources. Provide technical recommendations supporting institutional research computing strategy.
Align infrastructure planning with emerging research demands and funding trajectories.
Procurement & Vendor Engagement: Develop detailed technical specifications for research computing hardware and software acquisitions.
Evaluate vendor proposals and participate in formal bid and RFP processes.
Provide technical analysis supporting capital investment decisions.
Coordinate purchasing workflows, contract alignment, and delivery logistics.
Manage hardware refresh cycles, warranty agreements, and vendor support relationships.
Support grant-funded equipment planning and budget justification efforts.
Linux Systems Engineering & Lifecycle Management: Design, deploy, and maintain Linux-based research servers and workstations.
Apply patches, maintain system libraries, and perform performance tuning.
Rack, configure, and integrate new compute nodes and hardware.
Implement industry best practices for system hardening and resilience.
Conduct hardware diagnostics and component replacement as needed.
Advanced Performance Optimization & Researcher Support: Serve as a technical resource for researchers utilizing institutional HPC systems (e.g., SeaWulf, NVWulf, AMA-27, and related clusters).
Diagnose performance bottlenecks involving CPU, memory, I/O, and networking.
Translate system architecture constraints into actionable recommendations for research workloads.
Research Data & Storage Integration: Assist researchers in migrating external storage solutions (e.g., DAC, iSCSI, NFS) to centralized storage environments.
Assess backup and disaster recovery requirements.
Collaborate with storage and backup administrators to ensure secure and resilient research data management.
Troubleshooting & Incident Resolution: Diagnose and resolve complex hardware and software issues.
Conduct root cause analysis for outages or degraded performance.
Restore services in a timely manner to minimize research disruption.
Additional Duties: Perform other duties and special projects as assigned in support of departmental and institutional research missions.