Senior AI Architect (Hybrid Eligible)

Oak Ridge National Laboratory•Oak Ridge, TN

4d•Hybrid

About The Position

As a Senior AI Architect at Oak Ridge National Laboratory (ORNL), you will operate at the intersection of advanced machine learning, software and systems architecture, and ORNL’s enterprise and high-performance computing (HPC) environments. The role is typically less about building individual models and more about designing, integrating, and governing end-to-end AI systems that can be deployed and sustained across research programs and mission operations. This includes architecting solutions that use large language models (LLMs), multimodal and foundation models, and hybrid deployments spanning on-premises HPC resources, controlled networks, and approved cloud services—ensuring these systems meet ORNL’s requirements for scale, security, reliability, and scientific reproducibility. In a practical sense, the Senior AI Architect defines reference architectures for “AI at ORNL,” including patterns for data ingestion and curation, training and fine-tuning pipelines, model serving and inference at scale, and integration into scientific workflows (e.g., simulation, experimental facilities, and analysis platforms). You will guide technology selection and implementation for core capabilities such as distributed training and inference, workflow orchestration, GPU/accelerator utilization, model registries and artifact management, vector search and retrieval-augmented generation (RAG), and LLMOps/MLOps practices (CI/CD for models, automated evaluation, and monitoring). You will also establish performance and cost models to decide when workloads should run on HPC versus cloud, and how to engineer systems for throughput, latency, and resource efficiency under real-world constraints. Because ORNL systems often involve sensitive data, regulated collaborations, and tightly controlled computing environments, this role places strong emphasis on secure-by-design architectures. The Senior AI Architect works closely with cybersecurity and compliance stakeholders to implement robust identity and access management, network segmentation, audit logging, and data-governance controls—supporting needs such as least-privilege access, provenance, and reproducibility. You will also define validation and assurance approaches appropriate for mission use: rigorous benchmarking, red-teaming and prompt-injection testing for LLM applications, model risk assessments, and continuous monitoring for drift and unexpected behavior. Finally, the Senior AI Architect acts as a technical leader and integrator across the division. You will lead architecture reviews, produce technical roadmaps, and coordinate cross-functional teams of researchers, platform engineers, and application developers to move AI capabilities from prototype to production. This often includes mentoring teams on best practices, standardizing reusable components (shared model-serving stacks, data connectors, evaluation harnesses), and ensuring that AI systems remain maintainable and supportable over time—aligned with ORNL’s mission priorities, operational constraints, and evolving research needs.

Requirements

Bachelor's or Master's degree in Computer Science, AI/ML, Software Engineering, or related field
7+ years of experience in software architecture, machine learning systems, or distributed systems
Strong understanding of machine learning fundamentals and LLM architectures
Strong understanding of API-based and local model inference workflows
Experience using both cloud-hosted AI services (e.g., OpenAI, AWS Bedrock, Azure AI, Vertex AI, or similar)
Experience using locally deployed models (e.g., via Hugging Face, Ollama, vLLM, or similar)
Familiarity with Model Context Protocol (MCP) or comparable orchestration/integration patterns
Proficiency in Python and backend system design
Experience with containerization and deployment (Docker, Kubernetes)

Nice To Haves

Experience architecting hybrid AI systems that dynamically route between local and cloud models
Hands-on experience setting up on-demand local model inference (GPU-backed or CPU-optimized deployments)
Deep familiarity with LLM system design patterns: Retrieval-Augmented Generation (RAG), Tool use / function calling, Agent-based workflows
Experience with vector databases (e.g., FAISS, Pinecone, Weaviate, Milvus)
Experience with inference optimization frameworks (vLLM, TensorRT, ONNX Runtime, quantization techniques)
Knowledge of model serving infrastructure (Ray Serve, KServe, Triton, FastAPI-based services, MLFlow, etc.)
Experience designing low-latency and high-throughput inference pipelines
Familiarity with fine-tuning approaches (LoRA, PEFT, instruction tuning)
Experience integrating AI systems into enterprise applications and data platforms
Understanding of observability for AI systems (logging, tracing, evaluation metrics, drift detection)
Experience with GPU infrastructure, scheduling, and cost optimization strategies
Familiarity with security practices for AI systems, including prompt injection mitigation and data isolation
Experience working with multimodal models (text, image, geospatial, etc.)

Responsibilities

Design, integrate, and govern end-to-end AI systems.
Architect solutions using LLMs, multimodal and foundation models, and hybrid deployments.
Define reference architectures for AI at ORNL, including patterns for data ingestion, training, fine-tuning, and model serving.
Guide technology selection and implementation for distributed training, inference, workflow orchestration, GPU utilization, model registries, vector search, RAG, and LLMOps/MLOps.
Establish performance and cost models for workload deployment on HPC versus cloud.
Implement secure-by-design architectures, working with cybersecurity and compliance stakeholders.
Define validation and assurance approaches, including benchmarking, red-teaming, and model risk assessments.
Act as a technical leader and integrator across the division.
Lead architecture reviews and produce technical roadmaps.
Coordinate cross-functional teams to move AI capabilities from prototype to production.
Mentor teams on best practices and standardize reusable components.
Ensure AI systems remain maintainable and supportable.