Research Software Engineer

Oak Ridge National Laboratory•Oak Ridge, TN

3d•Onsite

About The Position

We are seeking a Research Software Engineer to join the Incident Modeling and Computational Sciences (IMCS) Group in the National Security Sciences Directorate at Oak Ridge National Laboratory (ORNL). IMCS develops and maintains state-of-the-art modeling and simulation tools supporting nuclear forensics, nuclear weapon effects, radiological consequence management, and other needs for DOE, DOW, and DHS sponsors. In this role, you will design, develop, and operate enterprise AI and data infrastructure, helping to build, maintain, and scale Docker-based microservices, large language model (LLM) inference servers on GPU clusters, vector database and retrieval-augmented generation (RAG) pipelines, and observability stacks that advance AI capabilities across the laboratory. The successful candidate will work independently and collaboratively with a multidisciplinary team of scientists, data engineers, and system administrators to deliver reliable, secure, and high-performance AI services to ORNL researchers.

Requirements

A BS degree in computer science, software engineering, or a related technical field and a minimum of five years of relevant experience. A combination of education and experience may also be considered.
Experience with software development life cycle, including version control with Git, code review practices, and collaborative development workflows.
Experience writing and maintaining production-quality code in Python, with exposure to one or more additional languages (e.g., JavaScript, Bash, C++).
Experience deploying and debugging containerized applications using Docker and Docker Compose, including multi-service environments.
Experience with Linux shell scripting in a command-line environment.
Experience working in multi-disciplinary teams across all phases of the software development life cycle.
Ability to obtain and maintain a Secret Compartmented Information (SCI) clearance from the Department of Energy.
Must be able to pass a pre-placement drug test and participate in an ongoing random drug testing program.
May be subject to random polygraph testing due to the SCI clearance.

Nice To Haves

Experience deploying or operating AI/ML serving infrastructure, including LLM serving frameworks such as vLLM, Ollama, or similar.
Familiarity with model routing or proxy tools such as LiteLLM or comparable API gateway solutions.
Experience with vector databases or retrieval-augmented generation (RAG) pipelines (e.g., Milvus, ChromaDB, Weaviate, or similar).
Knowledge of reverse proxy and web infrastructure concepts, including Nginx configuration, TLS/mTLS certificate management, WebSocket proxying, and authentication subrequests.
Experience with relational databases, including PostgreSQL administration and schema management.
Familiarity with observability tooling such as OpenTelemetry, Prometheus, Grafana, Loki, or Tempo.
Experience with HPC environments and job schedulers such as SLURM, or general experience deploying services on remote GPU clusters.
Experience maintaining forks of open-source projects, including upstream merge management, patch backporting, and dependency CVE remediation.
Familiarity with JavaScript or TypeScript and component-based frontend frameworks such as Svelte or React.
Excellent written and oral communication skills.
Motivated self-starter with the ability to work independently and to participate creatively in collaborative teams across the laboratory.
Ability to function well in a fast-paced research environment, set priorities to accomplish multiple tasks within deadlines, and adapt to ever-changing needs.

Responsibilities

Design, develop, and operate enterprise AI and data infrastructure.
Build, maintain, and scale Docker-based microservices.
Manage large language model (LLM) inference servers on GPU clusters.
Develop vector database and retrieval-augmented generation (RAG) pipelines.
Implement observability stacks that advance AI capabilities across the laboratory.
Work independently and collaboratively with a multidisciplinary team of scientists, data engineers, and system administrators.
Deliver reliable, secure, and high-performance AI services to ORNL researchers.