Senior/Principal Artificial Intelligence Models- Hybrid

Sandia Corporation•Albuquerque, NM

72d•Hybrid

About The Position

Sandia's artificial intelligence (AI) team is building the U.S. Department of Energy's (DOE) next-generation AI Platform, an integrated scientific AI capability that delivers rapid, high-impact solutions for national security, science, and applied energy missions. The Platform is based on three pillars: Models, Infrastructure, and Data. You will join the Models Pillar team to architect, develop, and deploy fine-tuned reasoning models, domain foundation models, high-fidelity surrogate models, and autonomous agents. Your work will compress mission timelines by enabling scientists and engineers to explore design spaces, evaluate outcomes, and steer experiments and simulations with transparent, high-assurance AI workflows. We anticipate multiple hires for the Models Pillar that collectively span the set of responsibilities and skills described below. Likewise, new hires will be expected to work in conjunction with existing Sandia staff and teams from other DOE laboratories to deliver on this ambitious, fast-paced project. Importantly, we anticipate that while AI Platform development will leverage existing AI and data science tools extensively, success will also require considerable innovation and problem solving to address the unique needs of DOE applications. If this sounds like an exciting challenge to you, we look forward to reading your application!

Requirements

Bachelor's degree in Computer Science, Electrical Engineering, Mathematics, or a related STEM field plus five (5) years of directly relevant experience, or an equivalent combination of education and experience
Ability to obtain and maintain a DOE Q clearance

Nice To Haves

Graduate degree in a relevant computationally-intensive discipline where an independent research project was a graduation requirement (e.g., independent project, thesis, or dissertation).
Experience in developing software and AI systems for enterprise and national security applications.
Demonstrated software development skills and familiarity with modern software development practices.
Proven ability to work and communicate effectively in a collaborative and interdisciplinary team environment.
Demonstrated expertise with deep learning frameworks (PyTorch, TensorFlow) and proficiency in Python.
Experience with distributed computing frameworks (MPI, Horovod, Ray) and orchestration tools (Kubernetes).
Proficiency with C++, CUDA, or other performance-oriented languages/environments.
Familiarity with distributed training frameworks (MPI, Horovod, Ray), hyperparameter tuning, and HPC systems.
Hands-on experience with model optimization techniques (quantization, pruning, distillation) and hardware acceleration.
Proficiency with MLOps toolchains for CI/CD, experiment tracking, and monitoring (MLflow, Kubeflow, TFX).
Knowledge of human-centered AI principles and UX design for model-driven applications.
Knowledge of high-assurance AI: formal methods, red-teaming, interpretability, and runtime safety.
Strong collaboration skills in dynamic, interdisciplinary teams and experience mentoring junior engineers.
Developing and deploying large language models, multimodal AI systems, or advanced reinforcement-learning agents.
Integrating AI workflows with robotics, experimental facilities, or digital twins.
Contributing to open-source AI frameworks or publishing peer-reviewed research.
Implementing secure AI workflows in classified or regulated environments.
Ability to obtain and maintain a SCI clearance, which may require a polygraph test.

Responsibilities

Research, fine-tune, and certify large reasoning models (LLMs, graph neural nets, vision transformers, etc.) for domain tasks in materials science, chemistry, physics, grid controls, and nuclear security
Develop and integrate domain foundation models trained or adapted on DOE simulation, experimental, and production data
Build AI surrogates to accelerate exascale multiphysics simulations, enabling millisecond-scale predictions
Design and implement multi¿agent frameworks (hypothesizers, planners, executors, retrievers, assessors) with transparent decision graphs, uncertainty quantification, and audit logs
Embed continuous learning pipelines: connect model training/evaluation to live telemetry from HPC clusters, experiments, and autonomous labs
Establish a model repository with metadata, SBOMs, versioning, drift/poisoning surveillance, and periodic recertification
Implement high-assurance controls: least-privilege execution, runtime shields/tripwires, deterministic fallbacks, cryptographic provenance, and enclave attestation for sensitive workloads
Collaborate with Data and Infrastructure teams to align model requirements with data lakehouses, compute fabric, and edge inference systems
Contribute to open-source and internal AI frameworks, toolkits, and best practices for agentic workflows

Benefits

Career advancement and enrichment opportunities
Flexible work arrangements for many positions include 9/80 (work 80 hours every two weeks, with every other Friday off) and 4/10 (work 4 ten-hour days each week) compressed workweeks, part-time work, and telecommuting (a mix of onsite work and working from home)
Generous vacation, strong medical and other benefits, competitive 401k, learning opportunities, relocation assistance and amenities aimed at creating a solid work/life balance

Stand Out From the Crowd

Upload your resume and get instant feedback on how well it matches this job.

Upload and Match Resume