IT Engineer

Keysight Technologies, Inc.Colorado Springs, CO
3h

About The Position

Keysight is at the forefront of technology innovation, delivering breakthroughs and trusted insights in electronic design, simulation, prototyping, test, manufacturing, and optimization. Our ~15,000 employees create world-class solutions in communications, 5G, automotive, energy, quantum, aerospace, defense, and semiconductor markets for customers in over 100 countries. Learn more about what we do. Our award-winning culture embraces a bold vision of where technology can take us and a passion for tackling challenging problems with industry-first solutions. We believe that when people feel a sense of belonging, they can be more creative, innovative, and thrive at all points in their careers.

Requirements

  • Bachelor or an equivalent degree in Computer Science.
  • 5+ years of experience linux advanced system administration
  • Experience in linux automation through shell scripting is desired
  • Strong Linux system administration experience
  • Hands-on experience with HPC schedulers (Slurm preferred)
  • Knowledge of MPI, parallel computing, and job profiling
  • Experience with high-speed interconnects (InfiniBand, Omni-Path)
  • Storage expertise (GPFS/Spectrum Scale, Lustre, NFS)
  • Scripting and automation (Bash, Python)
  • Understanding of CPU/GPU architectures and NUMA
  • Networking fundamentals (TCP/IP, RDMA, firewalls)
  • Experience with xCAT, Bright Cluster Manager, or similar

Nice To Haves

  • Knowledge of containerization and orchestration technologies (Docker, Kubernetes) is desirable

Responsibilities

  • Engineer - IT to support HPC (High Performance Computing) operations
  • This role is responsible for the strategic vision of Keysight’s Software Transformation. Responsibilities include but not limited to handling of L1/L2 calls related to HPC operations of COS HPC and supporting the team in handling calls across other HPCs in HCH, BAN and VBR DCs
  • L1/L2 operational support for the HPC business users at the OS level and troubleshooting issues related to HPC hardware,
  • Monitor SLURM and the health of HPC
  • Work closely with business teams in USA and support them in their day-to-day activities related to HPC.
  • Handle all operational issues through SNOW with strict adherence to SLA
  • Monitor HPC cluster health (nodes, storage, interconnect, schedulers)
  • Respond to alerts from monitoring tools (e.g., Nagios, Prometheus, Grafana)
  • Restart failed services and jobs where procedures exist
  • Coordinate with hardware vendor for hardware related issues and ensure till closure
  • Monitor SNOW tickets and perform basic triage
  • Perform the storage checks (Quota and utilization).
  • Monitor SLURM (queue, Job state) and escalate to the next level where necessary
  • Attend regular standup meetings and participate actively
  • Analyze performance issues and work with other team members to clear bottlenecks
  • Work closely with the patching team to ensure that the HPC services are available after the critical patching
  • Restart the nodes if they are in the hung state

Benefits

  • Medical, dental and vision
  • Health Savings Account
  • Health Care and Dependent Care Flexible Spending Accounts
  • Life, Accident, Disability insurance
  • Business Travel Accident and Business Travel Health
  • 401(k) Plan
  • Flexible Time Off, Paid Holidays
  • Paid Family Leave
  • Discounts, Perks
  • Tuition Reimbursement
  • Adoption Assistance
  • ESPP (Employee Stock Purchase Plan)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service