Leidos-posted 3 months ago
$126,100 - $227,950/Yr
Full-time • Senior
Bethesda, MD
11-50 employees

At Leidos, we deliver innovative solutions through the efforts of our diverse and talented people who are dedicated to our customers’ success. We empower our teams, contribute to our communities, and operate sustainable practices. Everything we do is built on a commitment to do the right thing for our customers, our people, and our community. Our Mission, Vision, and Values guide the way we do business. Employees enjoy career enrichment opportunities available through mobility and development and experience rewarding relationships with supportive supervisors and talented colleagues and customers. Your most important work is ahead. If this sounds like the kind of environment where you can thrive, keep reading! Leidos is looking for a highly skilled Systems Engineer with deep expertise in operating systems, hardware, GPU, and high-speed networking. In this role, you will design, develop, and optimize GPU clusters that power enterprise AI for the mission customers. This is a 100% on-site position. All work must be performed at the customer site in Bethesda at the Intelligence Community Campus.

  • Design, configure, and maintain GPU Clusters.
  • Collaborate with a multidisciplinary team to define and optimize architectures, ensuring they meet performance, power efficiency, and feature requirements.
  • Work closely with AI/ML engineers to ensure smooth GPU integration with Linux-based systems.
  • Optimize GPU drivers for compatibility, reliability, and performance.
  • Provide regular maintenance and updates.
  • Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers.
  • Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments.
  • Leverage scripting and configuration tools such as Bash, Python, Ansible, Puppet, and Salt.
  • Maintain technical documentation, architectural specifications, and Linux best practices.
  • Support ATO (Authority to Operate) and ensure compliance with federal security standards.
  • Bachelor's or higher degree in Computer Science, Computer Engineering, Electrical Engineering, or a related field with at least 12 years of related technical experience.
  • 10+ years of relevant systems engineering experience.
  • Experience in managing NVIDIA GPU data center platforms (DGX, HGX, H200, H100, L4s).
  • Knowledge of enterprise server components (storage/network controllers, HBA, SSDs).
  • Strong expertise with Linux distributions (RHEL, Ubuntu, Oracle, and Rocky).
  • Excellent problem-solving skills and the ability to collaborate within a team.
  • Candidate must meet DoD 8570.11- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification).
  • An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP).
  • Active TS/SCI clearance with Polygraph required OR active TS/SCI and willingness to obtain and maintain a Poly.
  • US Citizenship is required due to the nature of the government contracts we support.
  • Experience with Kubernetes cluster management and AI/ML workflow orchestration (Argo, Airflow, and Kubeflow).
  • Familiarity with GPU virtualization and cloud computing.
  • Experience with Prometheus/Grafana for monitoring.
  • Knowledge of distributed resource scheduling systems (Slurm (preferred), LSF, etc.).
  • Competitive compensation.
  • Health and Wellness programs.
  • Income Protection.
  • Paid Leave.
  • Retirement.
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service