Senior/Principal - Artificial Intelligence Infrastructure - Hybrid

Sandia CorporationAlbuquerque, NM
61dHybrid

About The Position

Sandia's artificial intelligence (AI) team is building the U.S. Department of Energy's (DOE) next-generation AI Platform, an integrated scientific AI capability that delivers rapid, high-impact solutions for national security, science, and applied energy missions. The Platform is based on three pillars: Models, Infrastructure, and Data. You will join the Infrastructure Pillar team to design, deploy, and operate the unified compute-and-data fabric that underpins all mission workflows from AI model training and simulation steering to real-time inference at experimental and production facilities. We anticipate multiple hires for the Infrastructure Pillar that collectively span the set of responsibilities and skills described below. Likewise, new hires will be expected to work in conjunction with existing Sandia staff and teams from other DOE laboratories to deliver on this ambitious, fast-paced project. Importantly, we anticipate that while AI Platform development will leverage existing AI and data science tools extensively, success will also require considerable innovation and problem solving to address the unique needs of DOE applications. If this sounds like an exciting challenge to you, we look forward to reading your application!

Requirements

  • Bachelor's degree in Computer Science, Electrical Engineering, Mathematics, or a related STEM field plus five (5) years of directly relevant experience, or an equivalent combination of education and experience
  • Ability to acquire and maintain a DOE Q clearance

Nice To Haves

  • Graduate degree in a relevant computationally-intensive discipline where an independent research project was a graduation requirement (e.g., independent project, thesis, or dissertation).
  • Experience in developing software and AI systems for enterprise and national security applications.
  • Demonstrated software development skills and familiarity with modern software development practices.
  • Proven ability to work and communicate effectively in a collaborative and interdisciplinary team environment mentoring junior engineers
  • Ph.D. in a STEM field with focus on high-performance or distributed computing (Data Science, Data and Computing Systems, Informatics or a related STEM field with a significant data systems research component
  • Experience with HPC systems administration (Slurm, PBS, Flux) and cloud platforms (AWS, Azure, GCP)
  • Proficiency in container orchestration (Kubernetes, Docker) and infrastructure as code (Terraform, Ansible)
  • Networking background: ESnet, VLANs, WAN overlays, encryption, and failover design
  • Hands-on experience with storage architectures: Lustre, GPFS, object stores, multi-tier caching
  • Experience implementing DevSecOps principles and security best practices in containerized infrastructure, including network policies, classification management, and vulnerability scanning.
  • Experience with federated Kubernetes and large-scale container platforms across classification domains
  • Familiarity with secure enclave technologies and zero-trust security models
  • Background in digital-twin or real-time simulation steering architectures
  • Experience with SIEM environments, such as Splunk, and operational management of application infrastructure services
  • Proficiency in observability toolchains (OpenTelemetry, Prometheus, Grafana, ELK) and automated log analytics
  • Knowledge of DOE/NNSA compute and networking environments (Frontier, Aurora, Perlmutter, ESnet)
  • Integrating experimental facilities, robotics, or 3D-printing systems into automated AI workflows
  • Deploying large-scale secure enclaves for CUI and RD applications
  • Coordinating public¿private partnerships on HPC and AI infrastructure deployments
  • Working in cross-lab federated teams with shared governance and risk models
  • Ability to obtain and maintain a SCI clearance, which may require a polygraph test.

Responsibilities

  • Architect and implement the hybrid compute fabric
  • Integrate exascale HPC systems with elastic cloud resources and specialized AI accelerator clusters (on-prem and in-cloud)
  • Deploy ruggedized edge servers and digital-twin infrastructure for sub-millisecond inference and real-time physics simulations
  • Develop infrastructure services and orchestration
  • Build federated Kubernetes clusters, container registry services, resource registry, and job scheduling abstractions
  • Implement self-configuring distributed clusters with intelligent network overlays, AI-driven traffic steering, and sensor-driven control loops
  • Design secure networking and enclaves
  • Configure ESnet-backed, multi-tier WAN overlays with low-latency, geo-diverse routing, failover, and encryption protocols
  • Provide software-defined, dynamic security enclaves for CUI/Restricted Data with attested runtime and curated egress
  • Enable observability, provenance & monitoring
  • Deploy unified logging, metrics, dashboards, and trace-analysis across cloud and on-prem environments using OpenTelemetry, Prometheus, ELK, or equivalent
  • Automate provenance capture for compute jobs, data movements, and AI workflows
  • Support federated identity and access control
  • Integrate multiple identity providers, attribute-based access controls, and allocation models for risk-shared governance
  • Manage enterprise licensing, token agreements, and software audits for AI and HPC frameworks
  • Manage the full lifecycle of the AI platform's infrastructure, including capacity planning, upgrades, documentation, and performance monitoring
  • Implement and enforce security best practices within container environments, including Role-Based Access Control (RBAC), secrets management, network policies, and vulnerability scanning.

Benefits

  • Generous vacation
  • Strong medical and other benefits
  • Competitive 401k
  • Learning opportunities
  • Relocation assistance
  • Amenities aimed at creating a solid work/life balance

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume

What This Job Offers

Job Type

Full-time

Career Level

Senior

Industry

National Security and International Affairs

Number of Employees

5,001-10,000 employees

© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service