Senior Research Software Engineer

Oak Ridge National Laboratory•Oak Ridge, TN

16h•Onsite

About The Position

We are seeking a Senior Research Software Engineer to join the Incident Modeling and Computational Sciences (IMCS) Group in the National Security Sciences Directorate (NSSD) at Oak Ridge National Laboratory (ORNL). IMCS develops and maintains state-of-the-art modeling and simulation tools supporting nuclear forensics, nuclear weapon effects, and radiological consequence management for DOE, DOD, and DHS sponsors. In this role, you will serve as a senior technical leader responsible for the architecture, development, and sustained operation of enterprise AI and data infrastructure, including Docker-based microservices, large language model (LLM) inference servers on GPU clusters, vector database and retrieval-augmented generation (RAG) pipelines, and observability stacks that advance AI capabilities across the laboratory. The successful candidate will work independently and lead collaboratively, driving technical decisions, mentoring junior staff, and partnering with multidisciplinary teams of scientists, data engineers, and system administrators to deliver reliable, secure, and high-performance AI services to ORNL researchers.

Requirements

A PhD in computer science, software engineering, or a related technical field and a minimum of 8 years of relevant experience, or an MS in these areas with a minimum of 12 years of relevant experience.
Demonstrated experience designing, deploying, and operating complex software systems or AI/ML infrastructure in a research, national security, or comparable production environment.
Experience leading or making significant technical contributions to multi-component software projects, including ownership of architecture decisions and delivery of results to stakeholders.
Experience deploying and managing containerized applications using Docker and Docker Compose or equivalent technologies in multi-service environments.
Demonstrated proficiency in Python and at least one additional language (e.g., JavaScript, Bash, C++).
Experience with Linux shell scripting and working in HPC or GPU cluster environments.
Experience presenting technical work to diverse audiences, including both technical peers and non-specialist stakeholders.

Nice To Haves

Deep expertise deploying and operating LLM inference infrastructure, including serving frameworks such as vLLM, Ollama, or comparable tools, and model routing or proxy solutions such as LiteLLM.
Experience architecting or administering vector database and RAG pipelines (e.g., Milvus, ChromaDB, or similar) at scale.
Expertise in reverse proxy and web infrastructure, including Nginx configuration, TLS/mTLS certificate management, WebSocket proxying, and authentication subrequest patterns.
Experience designing and operating observability stacks using OpenTelemetry, Prometheus, Grafana, Loki, Tempo, or equivalent tooling.
Experience maintaining security-sensitive forks of open-source projects, including upstream merge management, CVE triage, patch backporting, and coordinated disclosure workflows.
Familiarity with JavaScript or TypeScript and component-based frontend frameworks such as Svelte or React.
Demonstrated experience mentoring junior engineers or leading multidisciplinary technical teams.
Experience contributing to research proposals, white papers, or program development activities with federal sponsors or comparable R&D organizations.
Experience working with DOE National Laboratories or other federal research institutions.
Excellent written and oral communication skills.
Ability to function well in a fast-paced research environment, set priorities to accomplish multiple tasks within deadlines, and adapt to ever-changing needs.

Responsibilities

Serve as a senior technical leader responsible for the architecture, development, and sustained operation of enterprise AI and data infrastructure.
Develop and maintain state-of-the-art modeling and simulation tools.
Drive technical decisions and mentor junior staff.
Partner with multidisciplinary teams of scientists, data engineers, and system administrators to deliver reliable, secure, and high-performance AI services.
Work independently and lead collaboratively.