DevOps Engineer

LLNLLivermore, CA
15hHybrid

About The Position

We have an opening for a Development Operations (DevOps) Engineer. You will develop and support a robust, scalable, and operational infrastructure at the intersection of High-Performance Computing (HPC), on-prem cloud native technologies, and AI/ML software stacks to support, develop, and deploy collaboration tools and services for users of LLNL’s high-performance computers. You will work independently, applying software engineering and DevOps skills on a variety of hardware platforms to enable state-of-the-art collaboration and productivity tools for developers and scientists located world-wide. This position is in the Livermore Computing Division within the Computing Directorate. This position will be filled at either the SES.1 or SES.2 level depending on your qualifications. Additional job responsibilities (outlined below) will be assigned if you are selected at the higher level. You will Build, deploy, support, and enhance LLNL containerized applications and software stacks deployed in our LC OpenShift/Kubernetes clusters. Identify issues and propose solutions to technical problems across a wide range of projects and efforts to improve design and implementation of DevOps best practices. Perform software engineering using established development practices, tools, and processes for achieving robust software quality; including testing, configuration management, change management, and documentation. Collaborate closely with other technical teams/developers to ensure solutions are secure and integrated with other services as appropriate. Engage directly with HPC customers who use our tools and systems, delivering timely, customer-focused support and guidance. Assist with managing OpenShift/Kubernetes container orchestration infrastructure in Linux, to support complex operational, development, and security requirements. Investigate and deploy infrastructure monitoring, alerting, and logging tools for the DevOps infrastructure. Support and improve automating deployments of infrastructure services and applications with the design principles of high availability and zero downtime updates. Work with users and LC/LLNL security regarding use of on-premises and third-party, cloud-based AI offerings, while understanding what is available, what LC users want to use, and whether it satisfies LC security. Perform other duties as assigned. Additional job responsibilities, at the SES.2 level Implement automation tools to help with deploying, troubleshooting, and maintaining cluster environments within container orchestration environments. Design, implement and manage build and release pipelines. Extend Kubernetes to help simplify researcher’s usage and operations. Provide solutions to moderately complex problems involving largely identifiable factors.

Requirements

  • Ability to obtain and maintain a U.S. DOE Q-level security clearance which requires U.S. Citizenship.
  • Bachelor’s degree in computer science, Computer Engineering, or a related field, or the equivalent combination of education and related experience.
  • Familiarity with deploying web applications and/or micro-services in a containerized environment (e.g., Docker, Podman, Kubernetes, OpenShift).
  • Experience with Python, Bash, JavaScript, or similar scripting / languages.
  • Familiarity with software testing and implementing Continuous Integration pipelines.
  • Fundamental written and verbal communication skills necessary to effectively collaborate in a multi-disciplinary team environment, as well as the ability to work effectively with minimal guidance and as part of a team.
  • Experience creating CI pipelines that automate builds, tests, workflows, tasks or other processes.
  • Fundamental knowledge of the Git version control system, including push/pull, rebase, cherry pick, branching.
  • Experience integrating SSL/TLS certificates within applications and services according to security policy
  • Experience providing innovative approaches and applying new technologies to broadly defined tasks and projects.
  • Ability to set priorities and independently resolve complex problems in a fast-paced environment.
  • Comprehensive experience creating CI pipelines that automate builds, tests, workflows, tasks or other processes.
  • Proficient experience integrating SSL/TLS certificates within applications and services according to security policy
  • Broad experience providing innovative approaches and applying new technologies to broadly defined tasks and projects.

Nice To Haves

  • Experience with configuration management systems such as Ansible, Puppet, Chef or Salt.
  • Familiarity with LC LlamaMe, a Large Language Model inference software stack
  • Familiarity with Openshift/Kubernetes GPU device plugins, the Kubernetes scheduler to manage GPU resource allocation, interpret requirements for GPU load distribution, and LLM configuration in various environments.
  • Familiarity with LC LaunchIT, a full-stack web service provisioning application

Responsibilities

  • Build, deploy, support, and enhance LLNL containerized applications and software stacks deployed in our LC OpenShift/Kubernetes clusters.
  • Identify issues and propose solutions to technical problems across a wide range of projects and efforts to improve design and implementation of DevOps best practices.
  • Perform software engineering using established development practices, tools, and processes for achieving robust software quality; including testing, configuration management, change management, and documentation.
  • Collaborate closely with other technical teams/developers to ensure solutions are secure and integrated with other services as appropriate.
  • Engage directly with HPC customers who use our tools and systems, delivering timely, customer-focused support and guidance.
  • Assist with managing OpenShift/Kubernetes container orchestration infrastructure in Linux, to support complex operational, development, and security requirements.
  • Investigate and deploy infrastructure monitoring, alerting, and logging tools for the DevOps infrastructure.
  • Support and improve automating deployments of infrastructure services and applications with the design principles of high availability and zero downtime updates.
  • Work with users and LC/LLNL security regarding use of on-premises and third-party, cloud-based AI offerings, while understanding what is available, what LC users want to use, and whether it satisfies LC security.
  • Perform other duties as assigned.
  • Implement automation tools to help with deploying, troubleshooting, and maintaining cluster environments within container orchestration environments.
  • Design, implement and manage build and release pipelines.
  • Extend Kubernetes to help simplify researcher’s usage and operations.
  • Provide solutions to moderately complex problems involving largely identifiable factors.

Benefits

  • Flexible Benefits Package
  • 401(k)
  • Relocation Assistance
  • Education Reimbursement Program
  • Flexible schedules (depending on project needs)
© 2024 Teal Labs, Inc
Privacy PolicyTerms of Service